Some of you may have noticed over the past week or so that Basecamp has felt a bit zippier. Good news: it wasn’t your imagination.
Let’s set the stage. Below, I’ve included a chart showing our performance numbers for Monday from four weeks ago. We can see that at the peak usage period between 11 AM and Noon Eastern, Basecamp was handling around 9,000 requests per minute. In the same time period, it was responding in around 320 ms on average or roughly 1/3 of a second. I know quite a few people who would be very pleased with a 320ms average response time, but I’m not one of them.
June 29, 2009 – 09:00 – 21:00 EDT
For months, we’ve been running our applications on virtualized instances. We have a bunch of Dell 2950 servers, each with 8×2.5 GHz “Harpertown” CPU cores (Intel Xeon L5420) and 32GB of RAM that we use to run our own private compute cloud. A typical Basecamp instance in this environment has 4 virtual CPUs allocated to it and 4GB of RAM. At the time the chart above was created, we had 10 of these instances running.
For some time I’ve wanted to run some tests to see what the current performance of Basecamp on the virtualized instances was versus the performance of dedicated hardware. I finally found the time to run these tests a few weeks ago. We had just ordered a few new Dell 2950 virtualization servers and I decided that I would run my test on one of them before putting it into production in our private cloud.
Similarly, I’ve been curious to see how the newest Intel Xeons, code named “Nehalem”, would perform with a production load. Since we host all our infrastructure with Rackspace it was a pretty simple matter to get a new Dell R710 installed in our racks to include in the testing. The R710 was configured with 8×2.27 GHz “Nehalem” CPU cores (Intel Xeon L5520) and 12GB of RAM.
Once we had the servers in place, I quickly installed a base operating system and configured them to act as Basecamp application servers using our Chef configuration management recipes and put them into production. To make a long story a little less long, we saw some pretty extreme performance improvements from moving Basecamp out of a virtualized environment and back onto dedicated hardware.
The Nehalem vs. Harpertown battle was a bit harder fought, with the R710 giving roughly a 20-25% response time increase versus the older 2950 as long as they both still had excess CPU capacity. While that was an interesting number, I suspected that it didn’t tell the whole story, and I wanted to see how they performed when they got close to being saturated. Throughout the day, I increased the load that the R710 and 2950 were handling until I was able to saturate the 2950 and see response times start to degrade rapidly. When I reached that point, the R710 still had roughly 30% of it’s CPU capacity idle.
Armed with these numbers, we decided to get some additional R710 servers and move all of Basecamp’s application server processing to them. A week or so later, that transition is complete and we end up with the chart below.
July 27, 2009 – 09:00 – 21:00 EDT
Narrating this chart like I did the one above, we see that the peak traffic was between 10AM and 11AM Eastern, with Basecamp handling a little over 11,000 requests per minute. During this time period, the average response time remained a hair under 100ms, or one 1/10 of a second.
We were able to cut response times to about 1/3 of their previous levels even when handling over 20% more requests per minute.
July 27, 2009 vs. June 29, 2009.
We’ll continue to strive to make Basecamp and our other applications as fast as possible. We hope you’re enjoying the performance boost.
Special thanks to New Relic for the absolutely phenomenal RPM performance monitoring tool for Rails applications. It would have been much more difficult for me to run the performance tests I did without it. We use it to constantly monitor our application performance and it hasn’t let us down yet.