Performance tuning is a fun sport, but how you’re keeping score matters more than you think, if winning is to have real impact. When it comes to web applications, the first mistake is start with what’s the easiest to measure: server-side generation times.
In Rails, that’s the almighty X-Runtime header — reported to the 6th decimal of a second, for that extra punch of authority. A clear target, easily measured, and in that safe realm of your own code to make it appear fully controllable and scientific. But what good is saving off milliseconds for a 50ms internal target, if your shit (or non-existent!) CDNs are costing you seconds in New Zealand? Pounds, not pennies, is where the wealth is.
Yet that’s still the easy, level one, part of the answer: Don’t worry too much about your internal performance metrics until you’ve cared enough about the full stack of SSL termination overhead, CDN optimization, JS/CSS asset minimization, and client-side computational overhead (the latter easily catching out people following the “just do a server-side API”, since the json may well generate in 50ms, but then the client-side computation takes a full second on the below-average device — doh!).
Level two, once reasonable efforts have been made to trim the fat around the X-Runtime itself, is getting some big numbers up on the board: Mean and the 90th percentile. Those really are great places to start. If your mean is an embarrassing 500ms+, well, then you have some serious, fundamental problems that need fixing, which will benefit everyone using your app. Get to it.
Keep going beyond even the 99th
Just don’t stop there. Neither at the mean or the 90th. Don’t even stop at the 99th! At Basecamp, we sorta fell into that trap for a while. Our means were looking pretty at around 60ms, the 90th was 200ms, and even the 99th was a respectable 700ms. Victory, right?
Well, victory for the requests that fell into the 1st to 99th percentile. But when you process about fifty million requests a day, there’s still an awful lot of requests hidden on the far side of the 99th. And there, young ones, is where the dragons lie.
A while back we started shining the light into that cave. And even while I expected there to be dragons, I was still shocked at just how large and plentiful they were at our scale. Just 0.4% of requests took 1-2 seconds to resolve, but that’s still a shockingly 200,000 requests when you’re doing those fifty million requests.
Yet it gets worse. Just 0.0025% of requests took 10-30 seconds, but that’s still a whooping 1,250 requests. While some of those come from API requests that users do not see directly, a fair slice is indeed from real, impatient human beings. That’s just embarrassing! And a far, far away land from that pretty picture painted by the 60ms mean. Ugh.
Finally, there was the true elite: The 0.0001%, for a total of 50 instances. Those guys sat and waited between 30 and 60 seconds on their merry request to complete. Triple ugh.
Dragon slaying
Since lighting the cave, we’ve already been pointed to big, obvious holes in our setup that we weren’t looking at before. One simple example was file uploads: We’d stage files in one area, then copy them over to their final resting place as part of the record creation process. That’s no problem when it’s a couple of 10MB audio files, but try that again with 20 400MB video files — it takes a while! So now we stage straight in the final resting place, and cut out the copy process. Voila: Lots of dragons dead.
There’s still much more work to do. Not just because it sucks for the people who actually hit those monster requests, but also because it’s a real drain on the rest of the system. Maybe it’s a N+1 case that really only appears under very special circumstances, but every time the request hits, it’s still an onslaught on the database, and everyone else’s fast queries might well be slowed down as a result.
But it really does also just suck for those who actually have to sit through a 30 second request. It doesn’t really help them very much to know that everyone else is having a good time. In fact, that might just piss them off.
It’s like going to the lobby of your hotel to complain about the cockroaches, and then seeing the smug smile of the desk clerk saying “oh, don’t worry about that, none of our other 499 guests have that problem… just deal with it”. You wouldn’t come back next Summer.
So do have a look at the far side of your histogram. And use actual request counts, not just feel-good percentiles.