In college, I worked for a couple of years in a lab that tested the effectiveness of surgical treatments for ACL rupture using industrial robotics. Sometimes, the reconstructions didn’t hold. The surgeons involved were sometimes frustrated; it can be hard to look at data showing that something you did didn’t work. But for the scientists and engineers, all that mattered was that we’d followed our testing protocol and gathered some new data. I came to learn that this attitude is exactly what it takes to be a successful scientist over the long term and not merely a one-hit wonder.

Occasionally, when we’re running an A/B test someone will ask me what I call “success” for a given test. My answer is perhaps a bit surprising to some:

  • I don’t judge a test based on what feedback we might have gotten about it.
  • I don’t judge a test based on what we think we might have learned about why a given variation performed.
  • I don’t judge a test based on the improvement in conversion or any other quantitative measure.

I only judge a test based on whether we designed and administered it properly.

As an industry, we don’t yet have a complete analytical model of how people make decisions, so we can’t know in advance what variations will work. This means that there’s no shame in running variations that don’t improve conversion. We also lack any real ability to understand why a variation may have succeeded, so I don’t care much whether or not we understood the results at a deeper level.

The only thing we can fully control is how we set up the experiment, and so I judge a test based on criteria like:

  • Did we have clear segmentation of visitors into distinct variations?
  • Did we have clear, measurable, quantitative outcomes linked to those segments?
  • Did we determine our sample size using appropriate standards before we started running the test, and run the test as planned, not succumbing to a testing tool’s biased measure of significance?
  • Can we run the test again and reproduce the results? Did we?

This might sound a lot like the way a chemist evaluates an experiment about a new drug, and that’s not by accident. The way I look at running an A/B test is much the same as I did when I was working in that lab: if you run well-designed, carefully implemented experiments, the rest will take care of itself eventually.

You might hit paydirt this time, or it might take 100 more tests, but all that matters is that you keep trying carefully. I evaluate the success of our overall A/B testing regimen based on whether it improves our overall performance, but not individual tests; individual tests are just one step along what we know will be a much longer road.