In discussions on our posts about A/B testing the Highrise home page, a number of people asked about sample size and how long to run a test for. It’s a good question, and one that’s important to understand. Running an A/B test without thinking about statistical confidence is worse than not running a test at all—it gives you false confidence that you know what works for your site, when the truth is that you don’t know any better than if you hadn’t run the test.

There’s no simple answer or generic “rule of thumb” that you can use, but you can very easily determine the right sample size to use for your test.

What drives our needed sample size?

There are a few concerns that drive the sample size required for a meaningful A/B test:

1) We want to be reasonably sure that we don’t have a false positive—that there is no real difference, but we detect one anyway. Statisticians call this Type I error.

2) We want to be reasonably sure that we don’t miss a positive outcome (or get a false negative). This is called Type II error.

3) We want to know whether a variation is better, worse or the same as the original. Why do we want to know the difference between worse vs same? I probably won’t switch from the original if the variation performs worse, but I might still switch even if it’s the same—for a design or aesthetic preference, for example.

What not to do

There are a few “gotchas” that are worth watching out for when you start thinking about the statistical significance of A/B tests:

1) Don’t look at your A/B testing tool’s generic advice that “about 100 conversions are usually required for significance”. Your conversion rate and desired sensitivity will determine this, and A/B testing tools are always biased to want you to think you have significant results as quickly as possible.

2) Don’t continuously test for significance as your sample grows, or blindly keep the test running until you reach statistical significance. Evan Miller wrote a great explanation of why you shouldn’t do this, but briefly:

  • If you stop your test as soon as you see “significant” differences, you might not have actually achieved the outcome you think you have. As a simple example of this, imagine you have two coins, and you think they might be weighted. If you flip each coin 10 times, you might get heads on one all of the time, and tails on the other all of the time. If you run a statistical test comparing the portion of flips that got you heads between the two coins after these 10 flips, you’ll get what looks like a statistically significant result—if you stop now, you’ll think they’re weighted heavily in different directions. If you keep going and flip each coin another 100 times, you might now see that they are in fact balanced coins and there is no statistically significant difference in the number of heads or tails.
  • If you keep running your test forever, you’ll eventually reach a large enough sample size that a 0.00001% difference tests as significant. This isn’t particularly meaningful, however.

3) Don’t rely on a rule of thumb like “16 times your standard deviation squared divided by your sensitivity squared”. Same thing with the charts you see on some websites that don’t make their assumptions clear. It’s better than a rule of thumb like “100 conversions”, but the math isn’t so hard it’s worth skipping over, and you’ll gain an understanding of what’s driving required sample size in the process.

How to calculate your needed sample size

Instead of continuously testing or relying on generic rules of thumb, you can calculate the needed sample size and statistical significance very easily. For simplicity, I’ve assumed you’re doing an A vs B test (two variations), but this same approach can be scaled for other things.

1) Specify the outcome you’re trying to measure. We typically measure conversion to signup as the primary measure, but depending on what you’re testing, it might be button clicks, newsletter signups, etc. In almost every case, you’ll be measuring a proportion—e.g., the portion of landing page visitors who complete signup, or the portion of landing page visitors who sign up for a newsletter.

2) Decide how substantial of a difference you’d like to detect – this is the sensitivity of the test. I generally target an A/B test that will have a statistically meaningful sample size that detects a 10% difference in conversion rate (e.g., to detect 11% vs. 10% conversion rate). This is a somewhat arbitrary decision you’ll have to make—testing a reasonably large difference will help to make sure you don’t spend forever testing in a local minima, but instead that you are moving on to test potentially bigger changes. Jesse Farmer has a great article on balancing speed vs. certainty in A/B testing.

3) Calculate the required sample size based on your baseline conversion rate and your desired sensitivity. Since we’re dealing with proportions, we want to perform a simple statistical analysis called a “power analysis for two independent proportions”. Let’s break this down:

  • power analysis – is a statistical tool to determine the minimum sample size required so that you can be reasonably confident that you are detecting meaningful differences between two values.
  • two independent – since we fully separate visitors (they see only the A or only the B variant), our test is nominally independent; the results for variation A aren’t based on the results for variation B.
  • proportions – we’re comparing conversion rates, which are a proportion.

Virtually any statistical programming tool will let you do this, and there are some free, graphical native and web-based tools to do this if you search for the above term. I’ll use R to demonstrate how to calculate sample size, but the general principles will apply to any tool.

The function in R we will use is power.prop.test:

     power.prop.test(n = NULL, p1 = NULL, p2 = NULL, sig.level = 0.05,
                     power = NULL,
                     alternative = c("two.sided", "one.sided"),
                     strict = FALSE)

We’re going to leave n null, since that’s what we’re solving for. p1 and p2 are set based on our baseline conversion level (10% in our example) and the sensitivity we’re trying to detect (a 10% difference vs. baseline conversion, or 11% in our example). We want a two-sided alternative, because we’re interested in testing whether the variation is either higher or lower than the original.

sig.level (significance level) and power are a little bit more complicated to explain, but briefly:
  • Significance level governs the chance of a false positive. A significance level of 0.05 means that there is a 5% chance of a false positive. As Wikipedia puts it, “choosing level of significance is an arbitrary task, but for many applications, a level of 5% is chosen, for no better reason than that it is conventional.”
  • Statistical power represents the probability that you’ll get a false negative. A power of 0.80 means that there is an 80% chance that if there was an effect, we would detect it (or a 20% chance that we’d miss the effect). Again, Wikipedia has wisdom about what power to pick—”...there are no formal standards for power…most researchers assess the power of their tests using 0.80 for adequacy”.

The effect of picking a significance level of 0.05 and power of 0.8 means that we are 4 times more likely to get a false negative than a false positive. We’re generally more concerned about getting a false positive—making a change that doesn’t actually improve things, than we are about not making a change at all, which is why we accept a greater likelihood of a false negative.

When we plug these values in to R, we get results like:

> power.prop.test(p1=0.1, p2=0.11, power=0.8, alternative='two.sided', sig.level=0.05)

     Two-sample comparison of proportions power calculation

              n = 14750.79
             p1 = 0.1
             p2 = 0.11
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

 NOTE: n is number in *each* group

This means that we need about 15k observations for each variation to be confident that the two conversion rates are significantly different. For a test with just a variation and an original, this means we need about 30k observations in total. This is based on testing two groups, but if we wanted to add a third, we could do this by just adding another 15k for that variation, as long as we’re only comparing each variation to the original.

4. At the end of your test, if you’ve reached your pre-determined sample size and see a difference greater than your minimum sensitivity, you should have a statistically significant result. You can explicitly test for this, but I’ll leave that as an exercise for the reader or for a later post.

Finally, don’t be discouraged by the sample sizes required – in almost every case, they’re bigger than you’d like them be. If you’re fortunate enough to have a high traffic website, you can test a new variation every few days, but otherwise, you may need to run your tests for several weeks. It’s still much better to be testing something slowly than to test nothing at all.

Full disclosure: I am not a statistician. This is not a statistics textbook.