A few years ago, one of the companies I worked with created a “Production Application Support Team”, or PAST. These were the sysadmins and programmers intended as the next line of support after the helpdesk, who could triage some problems and perhaps actually fix them.
The team had a problem – vague complaints that the website was “slow” for login. Yet anytime a leader got the phone call, they would give the site a try, and see reasonable response times. PAST had a motivated programmer on it called Doug and a free copy of a functional test tool.
They gave the tool to Doug and asked him to figure it out.
Doug was a sharp guy, and he had an idea. He used the tool to record the login operation and write the timing of the full page download to a database, then set it to login every five minutes, all day long. Then he used a different tool to query the database and create graphs of average time-to-load over time — all from an end-user perspective.
In a week or so, Doug had a decent performance monitoring tool up … and he found something interesting.
Around 8:00AM, performance of the application fell through the floor. I mean through the floor; it took up to 35 seconds to login, combined with an occasional absolute timeout. By 9:00AM, things were good again – three to five second logins. (Of course, this was a few years ago, before Google made customer expectations go up a notch.)
Then around 1:00PM performance fell apart again, to get fixed by 2:00 or so.
What’s going on here?
Doug took the problem to the web team “Why could system performance suddenly degrade at exactly 8:00AM and 1:00PM?”
Paul, the team lead, had the answer: This was a business application, designed almost entirely for employees who used the system from eight AM to five PM.
And it had a one-hour no-activity timeout.
That meant on every day, at 8:00AM and 1:00PM, something like a thousand simultaneous users were directed to the login screen, and the application fell over.
The team has several options on how to fix the code, including optimizing login for performance. The issue the technical team had with that was that performance was just fine for most of the users, most of the time.
The first immediate fix was to change the timeout to be twenty hours plus between one and three hours – this spread out the time it would take for login and bought the team time. Of course, they still had problems on Mondays and holidays, but it turned out that enough people were late back into the office on those days that performance was still reasonable.
First, if there is a performance problem with the software, there is probably someone on the team who can find the root cause – he only needs the right information.
Our job as performance testers is to get the right information to the right people, then facilitate the fix.
Second, there is often an incredibly small, immediate fix that will get us to an 80% or 90% solution for a fraction of the work. This is not just me; it is a universal experience I have when comparing notes with other performance testers.
Third – There are many ways to gather data. My earlier post was on average response time, this post is on an individual sample feature test over time, and next time I’ll be writing about frequency distributions.
Each of these graphs and stats is a way to analyze the system.
Doug tied a functional tool to a database, and solved the problem by setting a timeout cookie in the browser.
Or, to borrow a line from Shakespeare: It turns out there are more ways to do performance testing under heaven and earth than are dreamt of in most philosophies!