If you've ever heard me speak, you've probably heard me talking about how we need to stop thinking about concurrent users, except for a very small number of circumstances. However, a few weeks ago I came across a goal metric much, much worse than concurrent users. This customer was still using BroadVision and wanted to achieve a certain number of Active Sessions in BroadVision (shudder).
I wanted to bring up this anecdote during my CMG tutorial today, but I skipped it in the interest of time. If you were there, read and enjoy!
Goal metrics should be realistic. This customer wanted to achieve 110,000 active sessions in BroadVision (BV). They had a 30-minute session timeout, which means that I had to generate at least 110,000 new sessions in 30 minutes, which means 220,000 user sessions per hour (SPH). In addition, they wanted me to ramp this up within 15 minutes. I advised against this approach, saying that it a) unrealistic to ramp that quickly for their type of application, and b) active session was a poor metric to choose. But they were the customer and they wanted things their way. And they got it.
I structured a load test that went from 0 to 110,000 SPH in 10 minutes, then 110,000 to 220,000 SPH in 5 minutes, then hold that for another 75 minutes. This makes a 90 minute test. Their site was "virtually unusable" in 7 minutes, but they wanted to see 110,000 active sessions. It took a while because users were having trouble getting to the site. Many users were seeing HTTP 503 (Server Too Busy) messages on the home page, some were getting Java stack traces, others saw HTTP 504 (Gateway Timeout Error) messages. These were in addition to the connection timeouts, GZIP encoding errors, etc. This was occuring from all 9 load generation locations and every Internet backbone. It was really, really ugly. Their application servers were all running at 100%, and some were restarting themselves. Users were abandoning the site in droves.
This customer had 83 minutes of mostly worthless load testing--their site was barely usable from minute 7 and all the metrics that were collected say that things were simply terrible. You didn't get any sense of application server time or application script execution time because some pages outright timed out, and the rest were just so high as to be unbelievable. The funny part is that one of the engineers wanted to tell me that their site wasn't down, that users could still get through if they would only wait 2 or 3 minutes on the home page, and refresh their browser any time they saw any HTTP errors or stack traces. I had trouble convincing him that things were really, really bad.
This customer's site eventually reached about 135,000 active sessions, and I think this was because their session clean-up routing wasn't working effectively. They saw that the system didn't "crash", meaning that it wasn't hard down. They didn't want to hear me say that it may as well have been down for all the good it was doing them. Users would probably have gotten such an unfavorable impression of the site that they would never return, and may not even step foot in one of their bricks-and-mortar locations again.
As usual, I always advocate setting your goal as a rate or throughput. Don't let yourself get fooled into testing to an arbitrary metric, which, if you think hard enough, is probably undesirable anyway.
(On a side note, my biggest piece of advice beyond doing a better load test, was to reduce their BV session timeout. 20-minutes is more than adequate, especially if your average session duration is only 7 minutes.)