We rely way too much on averages to run our business. We have our average response time, our average conversion rate, our average lifetime value per customer, and a thousand other averages.

The problem with averages are that they tell you nothing about the actual incidents and often gives you a misleading big picture.

Our average response time for Basecamp right now is 87ms, says New Relic. That sounds fantastic, doesn’t it? And it easily leads you to believe that all is well and that we wouldn’t need to spend any more time optimizing performance.

Wrong. That average number is completely skewed by tons of super fast responses to feed requests and other cached replies. If you have 1000 requests that return in 5ms, then you can have 200 requests taking 2,000 ms and still get a respectable 170ms average. Useless.

What we need instead are histograms. That way you can pick off clusters much easier and decide whether you want to deal with them or not. Outliers are given a more appropriate weight and you’re more likely to make good decisions from the data.

P.S.: Listing the standard deviation helps very little when there’s great variability. When some of your requests take 5ms and others take 5,000ms, the standard deviation is not of much use.

David wrote this on Aug 06 2009
There are34 comments.

This is where it might be useful to look into statistical control charts. Those take into account average, standard deviation, and whether or not a process is trending out of an acceptable range (with some simple heuristics).

I assume you guys spare no expense on new relic and see the transaction traces that we see on the Silver plan. Aren’t there any of your 200 2s+ controller+actions in there that you can kind of see are occurring in bunches?

Histograms are a complex metric that is hard to track progress on. Amazon puts a lot of focus on a simple metric – the response time at the 99% percentile. They ask the question, “How bad are the worst customer experiences?” since this is what makes people give up or get frustrated.

MI

on 06 Aug 09

Nate: Yup, we are Gold subscribers and we love New Relic. That data, along with the Apdex, helps to give an idea of the outliers but a histogram would be really nice to have as well.

Terry: Yes, New Relic does provide the standard deviation.

Allen: Yeah, Apdex is the metric we’re really trying to move these days rather than response time since our average response times are so low.

I have had this same thought on divorce rates. Like 40% of marriages end in divorce. Really, and how many of those have been divorced multiple times, and how many were divorced in the first 6 months and how many people live together forever and not get married.

That stat of 40% ending in divorce really scare people off of marriages. But in a lot of cases I’m not sure that stat really applies directly.

Averages and stats are good indicators but not terribly good explanations.

Dan W

on 06 Aug 09

@David

I read last weeks post about how Basecamp is now a lot faster but for some reason – the site actually appears much slower to me.

I’m on a T3 connection and within my project page – it can take a good 2-3 seconds for the page to render, which “feel” much longer than it had in the past.

I’m curious to know if NewRelic is measuring the right things.

Marcin

on 06 Aug 09

Histograms are kinda big and it’s hard to do them for data in time. Boxplots however work great, especially if you use them on standard deviations, not quartiles.

If you want to look at things like mean, mode, standard deviation, min, max, quartiles or percentiles, and display the data with consideration for date/time/weekday/month groupings, box plots would be something to look into.

Tim

on 06 Aug 09

When DID using averages ever become okay?

It’s like Goldman Sach, sure – the “average” salary at Goldman is $600k, but the standard deviation is huge.

Even histograms could be improved by excluding data that’s not in the context you’re interested in. I think that for the purpose of measuring UX, a histogram with RSS feed response times excluded would be better than one with them included.

Tyler

on 06 Aug 09

At the risk of being skewered for bring Zed Shaw into the conversation, he’s got some good points on this subject…

If you just need to measure a number, rather than carrying out a visual inspection of a histogram, then calculating the statistical kurtosis of your response times would potentially be of value. The skewness may also be of use, but not as much as the kurtosis for the analysis that is requested here.

See skewness and kurtosis – in particular: “Higher kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly-sized deviations.”

If you sample your response times over a given period, and calculate the kurtosis for that period, you then have a measurement of not just the variance, but a measure of how strongly the variance is due to excessively large outliers.

It would also be a lot easier to plot the calculated kurtosis over a longer period of time: any systematic change in the kurtosis would probably indicate some issue [either a performance issue or an improvement in functionality.]

Statistics can be used to “prove” many things, but that reputation isn’t helped by those not using statistical analyses and measurements correctly. FTR I hate statistics with a passion :-) But I know when/where stats can be of value.

Agreed. I think the bottom line is to think about what metrics reflect user experience (especially the ones having issues!). An old related blog post of mine: Client-centric metrics.

The reality of the issue at hand is ensuring that all of your users have reasonably fast access. That’s the trigger point for action. Your metric for monitoring this performance should reflect that. Instead of using an average, even combined with a measure of variance, consider using a threshold metric instead. Get to the point—if a certain # or % of your requests exceed an acceptable limit (or users making those requests, which is perhaps the bigger concern), that should be the trigger which throws the warning.

And if you know that page-cached requests will skew the figures, considering removing all requests which come in under a certain threshold, say 20ms, as a proxy for isolating non-cached requests.

You guys should read “The Black Swan” by Nasim Taleb. Seriously!

The damages caused by analyzing and relying on averages can have far greater impact than most are willing to admit.

George Feil

on 06 Aug 09

Excellent article, and your point is well taken.

Histogram charts in New Relic RPM is an idea that has been bounced about. The tricky bit is in implementing them without exploding the amount of data reported by the agents.

Apdex helps to solve this problem somewhat, as it gives more weight toward the dissatisfying actions. But a histogram would provide a finer granularity.

Agreed. This is why averages shouldn’t be used by doctors when treating patients, especially in the NICU…. I’m experiencing that problem with my newborn daughter Ruby (yes, named after the language).

Point well taken, David. I agree that average response time can be deceiving if its the only customer experience indicator you use. Apdex is our first step – and an important one – towards providing more meaningul metrics for measuring application performance and the customer experience in a more complete and actionable way. As you do at 37signals, we always try to build into our product as much value with the least amount of complexity. So when we look at something like histograms we ask questions that make sure our customers will get good information without a lot of complexity: how many buckets should we support? Should the user be able to provide the threshold settings for each bucket? How will the user be able to pre-determine what those threshold settings should be? Stuff like that.

We like Apdex because it simplifies the problem down to one customizable setting: what is a satisfactory response time for the app? They call this setting T. Then it breaks down all transactions into 3 categories (or buckets): Satisfactory (<= T), Tolerable (> T and <= 4T) and Frustrating (>T and errors). (For those who want to learn more about Apdex scores in New Relic, here is a short video http://newrelic.com/demos/apdex.html)

The question is, what is the increased value that arbitrary specification of thresholds and buckets will buy our typical (never “Average” ;) customer, and is it worth the added user interface complexity to provide that added value? The question is open and we will continue to strive to achieve the right balance. 37signals sets a great example of finding the right depth of functionality / configurability to satisfy the majority of its target users and use cases. We strive to do the same! Thanks!

P.S.: Listing the standard deviation helps very little when there’s great variability. When some of your requests take 5ms and others take 5,000ms, the standard deviation is not of much use.

Standard Deviation is a measure of variance. It lets you know when the mean is a good measure of the distribution and when it isn’t.

Andrew Banks

on 07 Aug 09

If you’re a statistician, there are actually three ways to “average.”

The Mean. Add up all the numbers. Divide the total by the number of samples. This is what most people mean (ahem, no pun intended) when they say the “average”.

The Median. Don’t add up the numbers. Line them all up, in order. Go to the middle of the line and pick that number.

The Mode. Again, don’t add up the numbers. Add up how many times each number happens. Pick the number that happens most often.

Each has its uses. For your response times example, I would first round your samples to the nearest tenth of a second. Then find the Mode.

Andrew Banks

on 07 Aug 09

Eh, I’m an idiot. Even the mode would be skewed by hundreds of short responses.

I guess you just have to start with, what are my longest response times, and do they happen rarely or a lot?

Listen to Mr. Dan Kjaergaard – You should really listen to Mr. Taleb’s talk, or even better, read his book, the Black Swab

Rob Chanter

on 07 Aug 09

Heat maps, a la Sun’s Fishworks, are an even better alternative to histograms. You get histogram-like breakdowns over time, so you can see the effects of things like cache ramp-up. See for example some of Brendan Gregg’s work at http://blogs.sun.com/brendan/category/Fishworks. There’s no way you could get that sort of info with averaging tools like sar or iostat.

Lonny Eachus

on 07 Aug 09

You seem to have forgotten that there are other kinds of “average”. In cases like the ones cited, the median and even the mode can relay a lot of useful information, without displaying a lot of meaningless data to wade through.

Berserk

on 07 Aug 09

Both mode and median will be heavily skewed by the massive amount of fast responses. I would not be surprised if both mode and median are lower then the mean in cases like these (with severe positive skewness).

While some sort of variation analysis could be fun to do, I think I lean to the percentile camp (of which the median is a special case, but too far to the “left” to be of real use here imho). I don’t think one should stick to just one percentile, if I did this I would probably choose 75, 90, 95, 99 and 99.5 percentile (since 1 % is too big for 37s :)) or something like that.

ssp

on 07 Aug 09

I haven’t been to this blog for a WHILE and I was pretty freaked out.

You don’t want faster respond time, you want faster respond times ;-)

Bill Kayser

on 09 Aug 09

We would all be a lot better off if we relied on the geometric mean instead of the arithmetic mean. If you study the distributions of response times of web applications you’ll discover very quickly they never fit a normal distribution.

A geometric mean is much more appropriate for the long tail distributions you see in response time histograms. Unlike the traditional arithmetic mean it does not get skewed by outliers.

And the arithmetic standard distribution is practically meaningless on this kind of data. In nearly every graph I’ve looked at the deviation is greater than the mean.

A median value would also give a much more meaningful representation of the user experience since it is also impervious to the outliers, but its an expensive piece of data to collect.

I recommend this paper by David Ciemiewicz which lays out a pretty strong case why mean is misleading and geometric mean is much more useful.

http://bit.ly/teBRz

Just to be clear, I also agree that a histogram is the ultimate answer to understanding the user experience.

When I was at Amazon.com, we ask, “What is the latency for the worst 0.1% of calls? How about the worst 1%? 10%?” We called this the tp99.9, tp99, and tp90, respectively.

We almost exclusively ignored the tp50, because it was always good. We used the tp99.9 most frequently, as that was a great pointer to trouble spots.

Another way to look at it is “percentage of calls above x”. So the u1000 is the percentage of calls that took longer than 1000ms.

Ben Darlow

on 10 Aug 09

Why are standard deviations “of little use”? This is precisely the kind of scenario that makes standard deviations useful. The important takeaway from having “some of your requests take 5ms and others take 5,000ms” is in knowing just what proportion of those requests are tending towards the higher number.

Sigh. Yet another 37signals post that dismisses something which has been established in another field for centuries.

Axie

on 10 Aug 09

What you need is to quantify uncertainty. What you need is Monte Carlo Simulation methods.

It is something resolved since the 1940s.

This discussion is closed.

About David

Creator of Ruby on Rails, partner at 37signals, best-selling author, public speaker, race-car driver, hobbyist photographer, and family man.

this onAug 06 2009There are34 comments.## Bil Kleb

on 06 Aug 09Amen.

## Chad Fowler

on 06 Aug 09Nice point.

This is where it might be useful to look into statistical control charts. Those take into account average, standard deviation, and whether or not a process is trending out of an acceptable range (with some simple heuristics).

## Nate

on 06 Aug 09I assume you guys spare no expense on new relic and see the transaction traces that we see on the Silver plan. Aren’t there any of your 200 2s+ controller+actions in there that you can kind of see are occurring in bunches?

## terry Heath

on 06 Aug 09The standard deviation on the example you gave is huge.

I’m not sure if NR offers that, but I agree, means without standard deviations are useless.

## Allen Pike

on 06 Aug 09Histograms are a complex metric that is hard to track progress on. Amazon puts a lot of focus on a simple metric – the response time at the 99% percentile. They ask the question, “How bad are the worst customer experiences?” since this is what makes people give up or get frustrated.

## MI

on 06 Aug 09Nate: Yup, we are Gold subscribers and we love New Relic. That data, along with the Apdex, helps to give an idea of the outliers but a histogram would be

reallynice to have as well.Terry: Yes, New Relic does provide the standard deviation.

Allen: Yeah, Apdex is the metric we’re really trying to move these days rather than response time since our average response times are so low.

## Derek

on 06 Aug 09I have had this same thought on divorce rates. Like 40% of marriages end in divorce. Really, and how many of those have been divorced multiple times, and how many were divorced in the first 6 months and how many people live together forever and not get married.

That stat of 40% ending in divorce really scare people off of marriages. But in a lot of cases I’m not sure that stat really applies directly.

Averages and stats are good indicators but not terribly good explanations.

## Dan W

on 06 Aug 09@David

I read last weeks post about how Basecamp is now a lot faster but for some reason – the site actually

appearsmuch slower to me.I’m on a T3 connection and within my project page – it can take a good 2-3 seconds for the page to render, which “feel” much longer than it had in the past.

I’m curious to know if NewRelic is measuring the right things.

## Marcin

on 06 Aug 09Histograms are kinda big and it’s hard to do them for data in time. Boxplots however work great, especially if you use them on standard deviations, not quartiles.

## Joe Mako

on 06 Aug 09If you want to look at things like mean, mode, standard deviation, min, max, quartiles or percentiles, and display the data with consideration for date/time/weekday/month groupings, box plots would be something to look into.

## Tim

on 06 Aug 09When DID using averages ever become okay?

It’s like Goldman Sach, sure – the “average” salary at Goldman is $600k, but the standard deviation is huge.

## Ben Atkin

on 06 Aug 09Even histograms could be improved by excluding data that’s not in the context you’re interested in. I think that for the purpose of measuring UX, a histogram with RSS feed response times excluded would be better than one with them included.

## Tyler

on 06 Aug 09At the risk of being skewered for bring Zed Shaw into the conversation, he’s got some good points on this subject…

http://www.zedshaw.com/essays/programmer_stats.html

## Mark Glossop

on 06 Aug 09If you just need to measure a number, rather than carrying out a visual inspection of a histogram, then calculating the statistical kurtosis of your response times would potentially be of value. The skewness may also be of use, but not as much as the kurtosis for the analysis that is requested here.

See skewness and kurtosis – in particular: “Higher kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly-sized deviations.”

If you sample your response times over a given period, and calculate the kurtosis for that period, you then have a measurement of not just the variance, but a measure of how strongly the variance is due to excessively large outliers.

It would also be a lot easier to plot the calculated kurtosis over a longer period of time: any systematic change in the kurtosis would probably indicate some issue [either a performance issue or an improvement in functionality.]

Statistics can be used to “prove” many things, but that reputation isn’t helped by those not using statistical analyses and measurements correctly. FTR I hate statistics with a

passion:-) But I know when/where stats can be of value.## David Hobbs

on 06 Aug 09Agreed. I think the bottom line is to think about what metrics reflect user experience (especially the ones having issues!). An old related blog post of mine: Client-centric metrics.

## Rick G.

on 06 Aug 09The reality of the issue at hand is ensuring that all of your users have reasonably fast access. That’s the trigger point for action. Your metric for monitoring this performance should reflect that. Instead of using an average, even combined with a measure of variance, consider using a threshold metric instead. Get to the point—if a certain # or % of your requests exceed an acceptable limit (or users making those requests, which is perhaps the bigger concern), that should be the trigger which throws the warning.

And if you know that page-cached requests will skew the figures, considering removing all requests which come in under a certain threshold, say 20ms, as a proxy for isolating non-cached requests.

## Dan Kjaergaard

on 06 Aug 09You guys should read “The Black Swan” by Nasim Taleb. Seriously!

The damages caused by analyzing and relying on averages can have far greater impact than most are willing to admit.

## George Feil

on 06 Aug 09Excellent article, and your point is well taken.

Histogram charts in New Relic RPM is an idea that has been bounced about. The tricky bit is in implementing them without exploding the amount of data reported by the agents.

Apdex helps to solve this problem somewhat, as it gives more weight toward the dissatisfying actions. But a histogram would provide a finer granularity.

## Dan Tylenda-Emmons

on 06 Aug 09Agreed. This is why averages shouldn’t be used by doctors when treating patients, especially in the NICU…. I’m experiencing that problem with my newborn daughter Ruby (yes, named after the language).

## Lew Cirne

on 06 Aug 09Point well taken, David. I agree that average response time can be deceiving if its the only customer experience indicator you use. Apdex is our first step – and an important one – towards providing more meaningul metrics for measuring application performance and the customer experience in a more complete and actionable way. As you do at 37signals, we always try to build into our product as much value with the least amount of complexity. So when we look at something like histograms we ask questions that make sure our customers will get good information without a lot of complexity: how many buckets should we support? Should the user be able to provide the threshold settings for each bucket? How will the user be able to pre-determine what those threshold settings should be? Stuff like that.

We like Apdex because it simplifies the problem down to one customizable setting: what is a satisfactory response time for the app? They call this setting T. Then it breaks down all transactions into 3 categories (or buckets): Satisfactory (<= T), Tolerable (> T and <= 4T) and Frustrating (>T and errors). (For those who want to learn more about Apdex scores in New Relic, here is a short video http://newrelic.com/demos/apdex.html)

The question is, what is the increased value that arbitrary specification of thresholds and buckets will buy our typical (never “Average” ;) customer, and is it worth the added user interface complexity to provide that added value? The question is open and we will continue to strive to achieve the right balance. 37signals sets a great example of finding the right depth of functionality / configurability to satisfy the majority of its target users and use cases. We strive to do the same! Thanks!

Lew Cirne New Relic

## Will Leinweber

on 06 Aug 09Standard Deviation

isa measure of variance. It lets you know when the mean is a good measure of the distribution and when it isn’t.## Andrew Banks

on 07 Aug 09If you’re a statistician, there are actually three ways to “average.”

The Mean. Add up all the numbers. Divide the total by the number of samples. This is what most people mean (ahem, no pun intended) when they say the “average”.The Median. Don’t add up the numbers. Line them all up, in order. Go to the middle of the line and pick that number.The Mode. Again, don’t add up the numbers. Add up how many times each number happens. Pick the number that happens most often.Each has its uses. For your response times example, I would first round your samples to the nearest tenth of a second. Then find the Mode.

## Andrew Banks

on 07 Aug 09Eh, I’m an idiot. Even the mode would be skewed by hundreds of short responses.

I guess you just have to start with, what are my longest response times, and do they happen rarely or a lot?

## Kenneth Corrêa

on 07 Aug 09Listen to Mr. Dan Kjaergaard – You should really listen to Mr. Taleb’s talk, or even better, read his book, the Black Swab

## Rob Chanter

on 07 Aug 09Heat maps, a la Sun’s Fishworks, are an even better alternative to histograms. You get histogram-like breakdowns over time, so you can see the effects of things like cache ramp-up. See for example some of Brendan Gregg’s work at http://blogs.sun.com/brendan/category/Fishworks. There’s no way you could get that sort of info with averaging tools like sar or iostat.

## Lonny Eachus

on 07 Aug 09You seem to have forgotten that there are other kinds of “average”. In cases like the ones cited, the median and even the mode can relay a lot of useful information, without displaying a lot of meaningless data to wade through.

## Berserk

on 07 Aug 09Both mode and median will be heavily skewed by the massive amount of fast responses. I would not be surprised if both mode and median are lower then the mean in cases like these (with severe positive skewness).

While some sort of variation analysis could be fun to do, I think I lean to the percentile camp (of which the median is a special case, but too far to the “left” to be of real use here imho). I don’t think one should stick to just one percentile, if I did this I would probably choose 75, 90, 95, 99 and 99.5 percentile (since 1 % is too big for 37s :)) or something like that.

## ssp

on 07 Aug 09I haven’t been to this blog for a WHILE and I was pretty freaked out.

This is a screengrab of what I saw and how I interpreted it

Would it kill you guys to squeeze the year in there?

## Eamon

on 07 Aug 09We use percentile instead of averages for just about everything. Perl’s Statistics::Descriptive makes it super simple. Here’s a one-liner:

sar | cut -c27-31 | grep '\.' | perl -MStatistics::Descriptive -e 'my $stat = Statistics::Descriptive::Full->new(); $stat->add_data(<>); for (50, 75, 90, 95, 99) { printf("%d%% were under %.2f\n", $_, $stat->percentile($_)); }'

This produces output like so:

50% were under 8.96

75% were under 9.85

90% were under 10.33

95% were under 12.28

99% were under 15.24

## Tor Løvskogen Bollingmo

on 08 Aug 09You don’t want faster respond time, you want faster respond times ;-)

## Bill Kayser

on 09 Aug 09We would all be a lot better off if we relied on the geometric mean instead of the arithmetic mean. If you study the distributions of response times of web applications you’ll discover very quickly they never fit a normal distribution.

A geometric mean is much more appropriate for the long tail distributions you see in response time histograms. Unlike the traditional arithmetic mean it does not get skewed by outliers.

And the arithmetic standard distribution is practically meaningless on this kind of data. In nearly every graph I’ve looked at the deviation is greater than the mean.

A median value would also give a much more meaningful representation of the user experience since it is also impervious to the outliers, but its an expensive piece of data to collect.

I recommend this paper by David Ciemiewicz which lays out a pretty strong case why mean is misleading and geometric mean is much more useful.

http://bit.ly/teBRz

Just to be clear, I also agree that a histogram is the ultimate answer to understanding the user experience.

## David Lifson

on 09 Aug 09When I was at Amazon.com, we ask, “What is the latency for the worst 0.1% of calls? How about the worst 1%? 10%?” We called this the tp99.9, tp99, and tp90, respectively.

We almost exclusively ignored the tp50, because it was always good. We used the tp99.9 most frequently, as that was a great pointer to trouble spots.

Another way to look at it is “percentage of calls above x”. So the u1000 is the percentage of calls that took longer than 1000ms.

## Ben Darlow

on 10 Aug 09Why are standard deviations “of little use”? This is precisely the kind of scenario that makes standard deviations useful. The important takeaway from having “some of your requests take 5ms and others take 5,000ms” is in knowing just

what proportionof those requests are tending towards the higher number.Sigh. Yet another 37signals post that dismisses something which has been established in another field for centuries.

## Axie

on 10 Aug 09What you need is to quantify uncertainty. What you need is Monte Carlo Simulation methods.

It is something resolved since the 1940s.

## This discussion

isclosed.