A/B Testing: It's not about the results, and it's definitely not about the why

In college, I worked for a couple of years in a lab that tested the effectiveness of surgical treatments for ACL rupture using industrial robotics. Sometimes, the reconstructions didn’t hold. The surgeons involved were sometimes frustrated; it can be hard to look at data showing that something you did didn’t work. But for the scientists and engineers, all that mattered was that we’d followed our testing protocol and gathered some new data. I came to learn that this attitude is exactly what it takes to be a successful scientist over the long term and not merely a one-hit wonder.

Occasionally, when we’re running an A/B test someone will ask me what I call “success” for a given test. My answer is perhaps a bit surprising to some:

I don’t judge a test based on what feedback we might have gotten about it.
I don’t judge a test based on what we think we might have learned about why a given variation performed.
I don’t judge a test based on the improvement in conversion or any other quantitative measure.

I only judge a test based on whether we designed and administered it properly.

As an industry, we don’t yet have a complete analytical model of how people make decisions, so we can’t know in advance what variations will work. This means that there’s no shame in running variations that don’t improve conversion. We also lack any real ability to understand why a variation may have succeeded, so I don’t care much whether or not we understood the results at a deeper level.

The only thing we can fully control is how we set up the experiment, and so I judge a test based on criteria like:

Did we have clear segmentation of visitors into distinct variations?
Did we have clear, measurable, quantitative outcomes linked to those segments?
Did we determine our sample size using appropriate standards before we started running the test, and run the test as planned, not succumbing to a testing tool’s biased measure of significance?
Can we run the test again and reproduce the results? Did we?

This might sound a lot like the way a chemist evaluates an experiment about a new drug, and that’s not by accident. The way I look at running an A/B test is much the same as I did when I was working in that lab: if you run well-designed, carefully implemented experiments, the rest will take care of itself eventually.

You might hit paydirt this time, or it might take 100 more tests, but all that matters is that you keep trying carefully. I evaluate the success of our overall A/B testing regimen based on whether it improves our overall performance, but not individual tests; individual tests are just one step along what we know will be a much longer road.

Noah wrote this on May 01 2012 There are 8 comments.

condor

on 01 May 12

Noah, I completely understand where you’re coming from, it’s a very logical approach, but it can be a slippery slope; don’t lose sight that there’s actual humans behind the activities you’re testing. It’s easy to abstract away the actors when you’re focused on the mechanics. http://37signals.com/01.html

Eric

I agree that haphazard testing can be bad, often worse than none, but if you need to do testing 100 time before hitting paydirt then you are doing something wrong; you are probably choosing the wrong inputs or measuring the wrong outputs.

A/B testing, just like source code tests, are not free and should be biased against parts that will make the most difference to users experience.

Carlos del Rio

I very much agree that A/B testing is process driven. If you follow good work flow and procedures all data is good data.

However, A/B testing is not a theoretical endeavour, as Eric points out, all of your success and decision making should be around the economic significance of your testing workflow. The outcome of a test should be held against what decision-making processes it can improve.

Bill Wagner

This is a great post from someone who is tasked with ‘running’ the tests. The message for those of us who are not tactically deployed in a testing role is that a commitment to testing should be central to how you run your business. Folks who ‘don’t get it’ think you can run a handful of A/B and magically gain insight that drives growth. In reality, it takes a concerted, sustained effort; from picking the right things to test, to running clean testing, to having people who can digest the results and add their own insights. Once the commitment to a ‘testing culture’ is made, getting consistent, repeatable methodology in place for the entire process – from conception to application of results – is key to the overall success of the business.

Lance JOnes

As someone who currently leads Web optimization for Adobe’s consulting arm (formerly Omniture), I must respectfully disagree, Noah. :-)

I have led the creation of hundreds of tests (prior to Adobe, I drove the testing program for Intuit’s global division) and you can, in fact, predict visitor behavior once you learn what makes them tick.

When I started working for Intuit, we had about a 10% win rate for our split/multivariate tests. After 3-4 years of running tests and learning from each winner and loser, we achieved a win rate of 40% (i.e., 4 in every 10 tests produced a statistically meaningful lift).

I continue to use this same approach at Adobe, with great success.

The more you know about your customers & visitors, the better you can predict their behavior. It is absolutely doable.

NL

@Lance (and @Eric and @Carlos too)-

I absolutely did not mean to say that you can’t improve at designing such that you have a greater win rate. You absolutely can, and the sort of improvement you describe is admirable.

I’m also not trying to suggest or endorse a shotgun approach where you try random things until you find out what works.

Even with a 40% hit rate - heck, even with a 90% hit rate - you’ll still have misses, and while you can and absolutely should learn from them, a miss doesn’t make a test a failure.

In my opinion, a test is a success as long as you learn something from it. The only way I know of to ensure that we’re able to learn something from a test is to ensure that it’s properly set up and run—that’s the only thing we can definitively control.

By treating every test that teaches you something as a success, you build a culture in which testing is a methodical and regular part of the way you work.

We’ve seen our overall performance from testing improve over time, and I’m very aware of the economic significance of it. Each individual test in isolation is a single experiment in a much broader view of things, and I’m much more interested in whether the experiment moves us in an overall more positive direction (by teaching us something) than whether it was a smashing success on it’s own.

Sorry if I was unclear.

Jeff Link

on 02 May 12

A/B testing in a lab I would think is a lot different than A/B testing in a website. @Eric, testing 100 different things is probably not the best, but when you’re talking about website visitors, giving 100, 1000 or 10000 people a different way to look at the site (different color, different button placement, etc) can be immensely useful to see how those users react to it.

Maybe I’m simplifying it, but I think simple A/B tests that, as Noah points out, are setup correctly then it is LESS (not saying not important just less) important because if you can make A/B testing an easy and repeatable process, then your worry/focus becomes less on how you’re doing the test and more on WHAT you’re doing in the test.

My friend had a great article written up about him, and I think 37signals uses his service. More info on A/B testing: http://www.wired.com/wiredenterprise/2012/04/ff_abtesting/

Alejandro Sanchez

on 07 May 12

I’ve been doing some A/B testing on my new site, it’s crazy how many things will throw people off. I was working on a clients website that sold diamond supply snapbacks. Did some page recording and found that somehow even though there was 3 different checkout buttons visible from the checkout and 1 visible from the snapback page itself they couldn’t it. I kept increasing the size until it was comically huge. Guess what? Snapback25 now sells over 30% more supreme and obey products than before!