Recently, I was invited to observe a colleague’s monthly stress test exercise. There is a large room filled with mangers and technicians, experts in various aspects of the system, general network specialists, application specialists, customer support representatives and a collection of people with backgrounds in various aspects of testing.
Once a month, they get a group of people together and observe how the systems behave under massive load levels. They make observations and notes and examine what possible causes for various behaviors can be. Then they compare the results with what happens in their test environment.
Questions and Answers
This caught me by surprise so I asked the question: “You don’t run this exercise in your test environment? You do this live?”
The person I was sitting next to seemed baffled at the idea. “Of course we can’t run this exercise in the test environment. It is far too small and far too limited. We don’t have enough licenses to run the performance testing tool at the levels that would generate these results. Even if we did, the test environment is nothing like the production environment. The test environment is mostly virtual machines that are running some form of the application.”
That struck me as a little odd. I asked the next question: “You run some virtual users in a test environment that is considerably different than the production environment. Does this give you much information you can act on for your production environment?”
“Well, sort of,” was the answer I received. “We haven’t been able to recreate some of the stuff we see in the test environment in the live environment. We can’t really recreate the problems we see in the live environment in the test environment because it always crashes before we can get any kind of number of instances running.”
“So, does this really give you any useable information? Is there something you can act on that will allow you to improve the performance of the live environment?”
The answer? “Well, we are still trying to figure out how what we are finding can be of use. “
At this point something dawned on me. There was far more going on than I first thought. I found it kind of sad. This got me thinking about something far more fundamental that I suspected I had missed before.
So, I asked the real question I wanted to ask, “How did this start? Why do you do this?”
The interesting thing was that the answer took a bit of time to formulate.
It seems that sometime in the past, some three years or so ago there were significant problems in the production environment over the most crucial time in the company’s monthly processing. The system crashed hard and it took considerable effort to get the system back and running. The lost revenue was massive.
They decided to avoid that possibility. The result of this was to make sure there were technicians available and the support they needed during the same process window that had crashed earlier. So each month, after running “extensive” performance tests in the test environment, technicians and managers and experts and support people gather in a large room and monitor the system’s behavior.
At great expense, these people monitored the system each month, every month, for three years.
I blinked then asked the question that begged to be asked. “Did the problems that originally occur come back again?” “Well, no. But because the problem never happened again, they think it is worth the trouble and expense.”
I was not terribly surprised. What I was surprised at was the answer to my next question.
“I see all these people recording information and examining behavior in the system. Is anything done with the observations or findings?”
“I don’t know. I know everyone records timings for transactions, but I don’t know if anyone does anything with them. The weird thing is that each time we do this, things slow down starting around 10:00 and get worse as we approach midnight. Sometimes, we get a bunch of strange errors popping up as we get close to midnight. I don’t know if anything is ever done to fix them for the next month.”
Thoughts on This
People more expert in software testing than I am can list all the problems and potential problems in that scenario. What strikes me in it is that people confuse relationships between causes and actions and results. There is no knowing what it is that is going on in the system at these times.
If you have an environment that you cannot replicate and you cannot recreate the behavior from one environment in the other, why go through the exercise? If you have system behaviors that are repeated regularly and still are unable to make changes needed to avoid them, why track them?
It strikes me that the real question is beyond the question of monitoring the behavior of the system.
When I asked “the experts” in the room about the differences between environments, the answer was that they exercised the scenarios in the test environment then extrapolated the results for the production environment. This makes no sense to me. If they were really able to do that, would they run into the same set of problems each month at about the same time and the same level of load?
This logic reminds me of someone wanting to test the performance of a top-end sports car on a race track, but only having the budget for a Yugo, which they would drive as fast as they can around the parking lot and extrapolate the results. It sounds good as far as it goes but does it really work?
Can the testing effort exercise the system adequately to show the performance of the system in the wild? Does it make sense to try?