Fundamentals - load test in a production-like environment
One of the keys to a realistic load test is to test on the same infrastructure and equipment that you will be utilizing in production. The corollary is to make sure that any scheduled jobs that run during production run during the test, and jobs that only run at off-peak times do NOT run during the test. Friday night we were testing a site that started to fail at the 50% load level. Their web servers were not especially busy, and their application servers were only moderately loaded. The culprit? Rsync and other batch jobs running to replicate content and perform routine site maintenance. The system administrators had to kill a number of processes and put the cron jobs on hold. The data for that load level was largely useless.
This test was interesting from another point of view. The total bandwidth from the load agents was about 100 megabit. 100M is a magic number for anyone looking at the network. However, total bandwidth was simply red-herring for this test. Indeed, the customer had multiple 100M pipes to their datacenter. The culprit was that the CPU on both their web servers and application servers were pegged at 100%. Their operations team could not find any background processes which were running or causing extraneous CPU to be consumed. In fact, once the nightly process jobs were cancelled, the test was fairly normal as load tests go.
Community Stress Testing
Lastly, a note about running production jobs during a load test--if you have processes that run during the day when you are experiencing peak traffic, you need to continue these jobs during the test. It will not pollute the data, but it actually guarantees correct data. I have retail customers who refresh their product database and descriptions every 30 minutes, even during peak periods. These full or partial updates are a business requirement. Turning these off during a test invalidates the test. Moreover, it is important to have other background traffic going on. For example, if you are a retail web site that also has a bricks-and-mortar component, and you share network connections for store support or back-end system, then you owe it to yourself to simulate that load in addition to the application that you are planning to stress. For clients with a large number of applications, we call this a “community stress test”. Each application may perform perfectly in isolation, but the entire system can come tumbling down when stressed as a group. Application interaction under stress can be, and often is completely unpredictable.
Customer or Server Pain
My latest new saying is that as a business, you should worry about your customers feeling pain, not your servers. Test to levels that are as realistic as possible, not to what levels you feel that your servers already support.

