A/B Testing Tech Note: determining sample size

In discussions on our posts about A/B testing the Highrise home page, a number of people asked about sample size and how long to run a test for. It’s a good question, and one that’s important to understand. Running an A/B test without thinking about statistical confidence is worse than not running a test at all—it gives you false confidence that you know what works for your site, when the truth is that you don’t know any better than if you hadn’t run the test.

There’s no simple answer or generic “rule of thumb” that you can use, but you can very easily determine the right sample size to use for your test.

What drives our needed sample size?

There are a few concerns that drive the sample size required for a meaningful A/B test:

1) We want to be reasonably sure that we don’t have a false positive—that there is no real difference, but we detect one anyway. Statisticians call this Type I error.

2) We want to be reasonably sure that we don’t miss a positive outcome (or get a false negative). This is called Type II error.

3) We want to know whether a variation is better, worse or the same as the original. Why do we want to know the difference between worse vs same? I probably won’t switch from the original if the variation performs worse, but I might still switch even if it’s the same—for a design or aesthetic preference, for example.

What not to do

There are a few “gotchas” that are worth watching out for when you start thinking about the statistical significance of A/B tests:

1) Don’t look at your A/B testing tool’s generic advice that “about 100 conversions are usually required for significance”. Your conversion rate and desired sensitivity will determine this, and A/B testing tools are always biased to want you to think you have significant results as quickly as possible.

2) Don’t continuously test for significance as your sample grows, or blindly keep the test running until you reach statistical significance. Evan Miller wrote a great explanation of why you shouldn’t do this, but briefly:

If you stop your test as soon as you see “significant” differences, you might not have actually achieved the outcome you think you have. As a simple example of this, imagine you have two coins, and you think they might be weighted. If you flip each coin 10 times, you might get heads on one all of the time, and tails on the other all of the time. If you run a statistical test comparing the portion of flips that got you heads between the two coins after these 10 flips, you’ll get what looks like a statistically significant result—if you stop now, you’ll think they’re weighted heavily in different directions. If you keep going and flip each coin another 100 times, you might now see that they are in fact balanced coins and there is no statistically significant difference in the number of heads or tails.

If you keep running your test forever, you’ll eventually reach a large enough sample size that a 0.00001% difference tests as significant. This isn’t particularly meaningful, however.

3) Don’t rely on a rule of thumb like “16 times your standard deviation squared divided by your sensitivity squared”. Same thing with the charts you see on some websites that don’t make their assumptions clear. It’s better than a rule of thumb like “100 conversions”, but the math isn’t so hard it’s worth skipping over, and you’ll gain an understanding of what’s driving required sample size in the process.

How to calculate your needed sample size

Instead of continuously testing or relying on generic rules of thumb, you can calculate the needed sample size and statistical significance very easily. For simplicity, I’ve assumed you’re doing an A vs B test (two variations), but this same approach can be scaled for other things.

1) Specify the outcome you’re trying to measure. We typically measure conversion to signup as the primary measure, but depending on what you’re testing, it might be button clicks, newsletter signups, etc. In almost every case, you’ll be measuring a proportion—e.g., the portion of landing page visitors who complete signup, or the portion of landing page visitors who sign up for a newsletter.

2) Decide how substantial of a difference you’d like to detect – this is the sensitivity of the test. I generally target an A/B test that will have a statistically meaningful sample size that detects a 10% difference in conversion rate (e.g., to detect 11% vs. 10% conversion rate). This is a somewhat arbitrary decision you’ll have to make—testing a reasonably large difference will help to make sure you don’t spend forever testing in a local minima, but instead that you are moving on to test potentially bigger changes. Jesse Farmer has a great article on balancing speed vs. certainty in A/B testing.

3) Calculate the required sample size based on your baseline conversion rate and your desired sensitivity. Since we’re dealing with proportions, we want to perform a simple statistical analysis called a “power analysis for two independent proportions”. Let’s break this down:

power analysis – is a statistical tool to determine the minimum sample size required so that you can be reasonably confident that you are detecting meaningful differences between two values.
two independent – since we fully separate visitors (they see only the A or only the B variant), our test is nominally independent; the results for variation A aren’t based on the results for variation B.
proportions – we’re comparing conversion rates, which are a proportion.

Virtually any statistical programming tool will let you do this, and there are some free, graphical native and web-based tools to do this if you search for the above term. I’ll use R to demonstrate how to calculate sample size, but the general principles will apply to any tool.

The function in R we will use is power.prop.test:

     power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05,
                     power = NULL,
                     alternative = c("two.sided", "one.sided"),
                     strict = FALSE)

We’re going to leave n null, since that’s what we’re solving for. p1 and p2 are set based on our baseline conversion level (10% in our example) and the sensitivity we’re trying to detect (a 10% difference vs. baseline conversion, or 11% in our example). We want a two-sided alternative, because we’re interested in testing whether the variation is either higher or lower than the original.

sig.level (significance level) and power are a little bit more complicated to explain, but briefly:

Significance level governs the chance of a false positive. A significance level of 0.05 means that there is a 5% chance of a false positive. As Wikipedia puts it, “choosing level of significance is an arbitrary task, but for many applications, a level of 5% is chosen, for no better reason than that it is conventional.”

Statistical power represents the probability that you’ll get a false negative. A power of 0.80 means that there is an 80% chance that if there was an effect, we would detect it (or a 20% chance that we’d miss the effect). Again, Wikipedia has wisdom about what power to pick—”...there are no formal standards for power…most researchers assess the power of their tests using 0.80 for adequacy”.

The effect of picking a significance level of 0.05 and power of 0.8 means that we are 4 times more likely to get a false negative than a false positive. We’re generally more concerned about getting a false positive—making a change that doesn’t actually improve things, than we are about not making a change at all, which is why we accept a greater likelihood of a false negative.

When we plug these values in to R, we get results like:

> power.prop.test(p1=0.1, p2=0.11, power=0.8, alternative='two.sided', sig.level=0.05)

     Two-sample comparison of proportions power calculation

              n = 14750.79
             p1 = 0.1
             p2 = 0.11
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

 NOTE: n is number in *each* group

This means that we need about 15k observations for each variation to be confident that the two conversion rates are significantly different. For a test with just a variation and an original, this means we need about 30k observations in total. This is based on testing two groups, but if we wanted to add a third, we could do this by just adding another 15k for that variation, as long as we’re only comparing each variation to the original.

4. At the end of your test, if you’ve reached your pre-determined sample size and see a difference greater than your minimum sensitivity, you should have a statistically significant result. You can explicitly test for this, but I’ll leave that as an exercise for the reader or for a later post.

Finally, don’t be discouraged by the sample sizes required – in almost every case, they’re bigger than you’d like them be. If you’re fortunate enough to have a high traffic website, you can test a new variation every few days, but otherwise, you may need to run your tests for several weeks. It’s still much better to be testing something slowly than to test nothing at all.

Full disclosure: I am not a statistician. This is not a statistics textbook.

Noah wrote this on Sep 20 2011 There are 9 comments.

Ian Clarke

on 20 Sep 11

I outline a more sophisticated approach in an answer to a quora question.

Its an algorithm that decides in realtime how much traffic to give to difference versions of a site, based on their performance (eg. conversion rate).

To begin with they’ll all get the same amount of traffic, but as it gathers performance data it will gradually give more and more traffic to the best performing version until its getting all the traffic.

One nice thing about this is that you can add new versions to the test at any time, and it will handle them accordingly.

brion

A few things to consider to keep it simple. 1: Drop the term observation and only think conversions. 2: You will want to see 4-500 conversion per tested recipe/variation. 3: Most will also want to understand and compare both days of the week. Meaning, if you even get your 500 conversions per recipe, you will want to let the test run for 2 weeks regardless. 4: Upon your findings, from an A/B…N campaign, patients and certainty are paramount, If you are testing A/B/C/D/E and C wins (compared to default). Run the campaign again with only A and C, repeat the findings then push 100% of traffic to recipe C to monetize on the findings

NL

@Brion – advice like “4-500 conversion per tested recipe” is perhaps a bit dangerous, because the statistical power that delivers is widely different at 2% vs. 10% vs. 50% conversion rate. While that might be adequate for some conversion rates, it’s wholly insufficient for others.

This is why general advice that A/B testing tools provide about sample size is usually less than desirable, since it isn’t based on your baseline conversion rate.

Michael

Thanks, Noah. I’d be interested to read more from you about R in particular.

Rahul

on 21 Sep 11

Thanks Noah!

David

I prefer to do it in a more qualitative fashion. Allow the server to alternate or randomize A and B, and simply plot a graph of conversions/views ratio over time. For one application I worked on, A and B seemed equal, but then we noticed that A worked slightly better during the day, and B worked better at night. By allowing the page itself to morph over time, we were able to increase conversions quite a bit more than just a different version. Since then we’ve found a G that works for both time periods, but it earned us a lot of customers early on.

Anonymous Coward

on 22 Sep 11

Sample size requirements are necessary when you are trying to control the false negative rate (say there is no difference when there is aka 1 – power of the test))

You can still control the false positive rate (type I error) with smaller sample sizes using standard hypothesis testing. If you control for false positive and A/B test positive, then you have some evidence of a difference in conversion rate. A false positive is much worse than a false negative, since usually a positive result converts into something actionable.

False negatives are really important in clinical trials since they can kill a line of medical investigation. For A/B testing, a false negative means you still have no idea about the relative merits of A or B. Without testing you would have just picked one. Same outcome with a test resulting in negative.

Therefore it still makes sense to test even if you cannot target a predetermined power level, controlling for false positives. A false negative, while undesirable, for an A/B test this is the same disutility as not testing at all.

Whenever following the outcome of a hypothesis test you have to do risk assessment and consider the users cost ( hazard) of type I and type II errors- there is a tradeoff in the rates when data is limited.

Shane Johnston

on 26 Sep 11

Thanks 37Signals for putting this out there. I can’t tell you how sick I am of hearing about statistical significance as if it were an absolute. Once again, it all comes down to context and knowing what you’re measuring.

One gotcha I’ve seen is: Don’t fall in love with the tool. Ask the same question in different ways, don’t just run a singular study and shout the results from a mountain top. If the effect you observed is real, you should be able to replicate it.

Vit

on 27 Sep 11

http://headmetrics.com/ – cheap a/b testing tool