Dragons on the far side of the histogram

Performance tuning is a fun sport, but how you’re keeping score matters more than you think, if winning is to have real impact. When it comes to web applications, the first mistake is start with what’s the easiest to measure: server-side generation times.

In Rails, that’s the almighty X-Runtime header — reported to the 6th decimal of a second, for that extra punch of authority. A clear target, easily measured, and in that safe realm of your own code to make it appear fully controllable and scientific. But what good is saving off milliseconds for a 50ms internal target, if your shit (or non-existent!) CDNs are costing you seconds in New Zealand? Pounds, not pennies, is where the wealth is.

Yet that’s still the easy, level one, part of the answer: Don’t worry too much about your internal performance metrics until you’ve cared enough about the full stack of SSL termination overhead, CDN optimization, JS/CSS asset minimization, and client-side computational overhead (the latter easily catching out people following the “just do a server-side API”, since the json may well generate in 50ms, but then the client-side computation takes a full second on the below-average device — doh!).

Level two, once reasonable efforts have been made to trim the fat around the X-Runtime itself, is getting some big numbers up on the board: Mean and the 90th percentile. Those really are great places to start. If your mean is an embarrassing 500ms+, well, then you have some serious, fundamental problems that need fixing, which will benefit everyone using your app. Get to it.

Keep going beyond even the 99th

Just don’t stop there. Neither at the mean or the 90th. Don’t even stop at the 99th! At Basecamp, we sorta fell into that trap for a while. Our means were looking pretty at around 60ms, the 90th was 200ms, and even the 99th was a respectable 700ms. Victory, right?

Well, victory for the requests that fell into the 1st to 99th percentile. But when you process about fifty million requests a day, there’s still an awful lot of requests hidden on the far side of the 99th. And there, young ones, is where the dragons lie.

A while back we started shining the light into that cave. And even while I expected there to be dragons, I was still shocked at just how large and plentiful they were at our scale. Just 0.4% of requests took 1-2 seconds to resolve, but that’s still a shockingly 200,000 requests when you’re doing those fifty million requests.

Yet it gets worse. Just 0.0025% of requests took 10-30 seconds, but that’s still a whooping 1,250 requests. While some of those come from API requests that users do not see directly, a fair slice is indeed from real, impatient human beings. That’s just embarrassing! And a far, far away land from that pretty picture painted by the 60ms mean. Ugh.

Finally, there was the true elite: The 0.0001%, for a total of 50 instances. Those guys sat and waited between 30 and 60 seconds on their merry request to complete. Triple ugh.

Dragon slaying

Since lighting the cave, we’ve already been pointed to big, obvious holes in our setup that we weren’t looking at before. One simple example was file uploads: We’d stage files in one area, then copy them over to their final resting place as part of the record creation process. That’s no problem when it’s a couple of 10MB audio files, but try that again with 20 400MB video files — it takes a while! So now we stage straight in the final resting place, and cut out the copy process. Voila: Lots of dragons dead.

There’s still much more work to do. Not just because it sucks for the people who actually hit those monster requests, but also because it’s a real drain on the rest of the system. Maybe it’s a N+1 case that really only appears under very special circumstances, but every time the request hits, it’s still an onslaught on the database, and everyone else’s fast queries might well be slowed down as a result.

But it really does also just suck for those who actually have to sit through a 30 second request. It doesn’t really help them very much to know that everyone else is having a good time. In fact, that might just piss them off.

It’s like going to the lobby of your hotel to complain about the cockroaches, and then seeing the smug smile of the desk clerk saying “oh, don’t worry about that, none of our other 499 guests have that problem… just deal with it”. You wouldn’t come back next Summer.

So do have a look at the far side of your histogram. And use actual request counts, not just feel-good percentiles.

David wrote this on Apr 14 2014 There are 9 comments.

Alex

on 14 Apr 14

What tools do you choose to measure your distribution of request times? Anything interesting in your monitoring stack?

Renaud

I would suggest that the really long loading times may have to do with China-based users. When we use Basecamp without a VPN, latency is long AND loading is long. It’s great software so we keep using it. But I still don’t get why it’s so slow here.

DHH

Alex, Noah has built our own measuring stack using, amongst other things, Impala. It’s really a thing of tremendous value. I hope he’ll write that up at some point!

Renaud, the main distributions I’m talking about in this article is indeed the X-Runtime stuff, which is all about how long the response takes to generate on the server side. So that’s independent of user location. But you’re right, there’s a big discrepancy in final loading times depending on where you are. We hope to level up even further on this as well!

Rob

David, you avoid storing files in the db? What about meta data?

Reji Modiyil

on 15 Apr 14

Yes Mr david i want to know about speed of site how to check somany sites provide everyone is different system and different details when we get the result please tell me which one is better, because mine more than 20 website almost all related videos site so i need at least below 2 sec is it possible my main site http://www.nanogb.com please check if you can

Piotr

David….are you secretly working on a novel? :)

Rob, we’re storing files in CleverSafe, which is a database of sorts. But we’re storing meta data in our main MySQL database.

Scott Asai

Nice way of saying: quality matters too (not just quantity). Each person’s opinion is valid (for the most part). Who knows that person’s recommendation could have a huge impact down the line.

Devan

on 17 Apr 14

Interested to hear if you are replicating uploaded file assets across across a CDN to reduce time-to-serve at remote regional locations? I know it is outside the scope of this particular post, but would be interesting to form a ‘big picture’ view of you handle this on the scale you are operating at right now.