Behind the scenes: A/B testing part 2: How we test

A few weeks ago, we shared some of what we’ve been testing with the Highrise marketing page. We’ve continued to test different concepts for that page and we’ll be sharing some of the results from those tests in the next few weeks, but before we do that, I wanted to share some of how we approach and implement A/B tests like this.

Deciding what to test

Our ideas for what to test come from everywhere: from reading industry blogs (some examples: Visual Website Optimizer, ABtests.com), a landing page someone saw, an ad in the newspaper (our long form experiments were inspired in part by the classic “Amish heater” ads you frequency see in the newspaper), etc. Everyone brings ideas to the table, and we have a rough running list of ideas – big and small – to test.

My general goal is to have at least one, and preferably several A/B tests running at any given time across one or more of our marketing sites. There’s no “perfect” when it comes to marketing sites, and the only way you learn about what works and doesn’t work is to continuously test.

We might be simultaneously testing a different landing page, the order of plans on the plan selection page, and wording on a signup form simultaneously. These tests aren’t always big changes, and may only be exposed to a small portion of traffic, but any time you aren’t testing is an opportunity you’re wasting. People have been testing multiple ‘layers’ in their sites and applications like this for a long time, but Google has really popularized this lately (some great reading on their infrastructure is available here).

Implementing the tests

We primarily use two services and some homegrown glue to run our A/B tests. Essentially, our “tech stack” for running A/B tests goes like this:

We set up the test using Optimizely, which makes it incredibly easy for anyone to set up tests – it doesn’t take any knowledge of HTML or CSS to change the headline on a page, for example. At the same time, it’s powerful enough for almost anything you could want to do (it’s using jQuery underneath, so you’re only limited by the power of the selector), and for wholesale rewrites of a page we can deploy an alternate version and redirect to that page. There are plenty of alternatives to Optimizely as well – Visual Website Optimizer, Google Website Optimizer, etc. – but we’ve been quite happy with Optimizely.
We add to the stock Optimizely setup a Javascript snippet that is inserted on all pages (experimental and original) that identifies the test and variation to Clicky, which we use for tracking behavior on the marketing sites. Optimizely’s tracking is quite good (and has improved drastically over the last few months), but we still primarily use Clicky for this tracking since it’s already nicely setup for our conversion “funnel” and offers API access.
We also add to Optimizely another piece of Javascript to rewrite all the URLs on the marketing pages to “tag” each visitor that’s part of an experiment with the experimental group. When a visitor completes signup, Queenbee – our admin and billing system – stores that tag in a database. This lets us easily track plan mix, retention, etc. across experimental groups (and we’re able to continue to do this far into the future).
Finally, we do set up some click and conversion goals in Optimizely itself. This primarily serves as a validation—visitor tracking is not an exact science, and so I like to verify that the results we tabulate from our Clicky tracking are at least similar to what Optimizely measures directly.

Evaluating the results

Once we start a test, our Campfire bot ‘tally’ takes center stage to help us evaluate the test.

We’ve set up tally to respond to a phrase like “tally abtest highrise landing page round 5” with two sets of information:

The “conversion funnel” for each variation—what portion of visitors reached the plan selection page, reached the signup form, and completed signup. For each variation, we compare these metrics to the original for statistical significance. In addition, tally estimates the required sample size to detect a 10% difference in performance, and we let the experiment run to that point (for a nice explanation of why you should let tests run based on a sample size as opposed to stopping when you think you’ve hit a significant result, see here).
The profile of each variation’s “cohort” that has completed signup. This includes the portion of signups that were for paying plans, the average price of those plans, and the net monthly value of a visitor to any given variation’s landing page (we also have a web-based interface to let us dig deeper into these cohorts’ retention and usage profiles). These numbers are important—we’d rather have lower overall signups if it means we’re getting a higher value signup.

Tally sits in a few of our Campfire rooms, and anyone at 37signals can check on the results of any test that’s going on or recently finished anytime in just a few seconds.

Once a test has finished, we don’t just sit back and bask in our higher conversion rates or increased average signup value—we try to infer what worked and what didn’t work, design a new test, and get back to experimenting and learning.

Noah wrote this on Aug 02 2011 There are 19 comments.

Rudy

on 02 Aug 11

What happened to designing for yourself? Doesn’t it yield better results when you make what YOU love?

JF

Rudy: We don’t think testing is at odds with designing for yourself or designing what you love. We love all the designs we test, we just want to learn what makes one convert (or educate, or inform, or…) better than another.

Good to know. I was worried that A/B testing might mean assessing quality by proxy.

Anonymous Coward

For the TDLR crowd: 37signals is using Optimizely to perform A/B testing on highrise – no metric/results given.

Miles

@ Anonymous Coward

I think you missed the introductory ‘graph:

Also, I think its TLDR, as in Too Long, Didn’t Read

Gary Bury

Very interesting. Do you use google analytics or just clicky?

NL

@Gary – we don’t use Google Analytics at all in our A/B testing flow. I do look at it from time to time to confirm trends in Clicky, and for longer-term historical data (we’ve only been using Clicky since the start of this year, versus years of historical data in Google Analytics).

Adrian

If you ever end up doing multi-variate tests, you can send a Custom Variable labeling your test combo into Google Analytics so you can still do some advanced segmentation in there.

Dan Siroker

@Adrian, you can actually have Optimizely send this data automatically via custom variables to Google Analytics using Optimizely’s one-click Google Analytics integration.

Here is an explanation of how this works: http://support.optimizely.com/kb/getting-started/how-to-integrate-optimizely-with-your-google-analytics-account

Chris Neumann

on 03 Aug 11

I’m also using Optimizely lately, and particularly like their recent KISSmetrics integration, which is another nice way to track what’s happen further down the funnel (ie did people in experiment A actually sign up and pay us?)

I’m running conversion funnel tests for a number of clients these days, and what I’ve found is that most companies can make a ton of progress by just doing two basic things: 1. Making a benefit-oriented headline that speaks to the target customer’s pain. Most companies talk about features 2. Make a single, clear, call to action. Most sites don’t have an obvious button which says what is about to happen next, ie “Sign up – no credit card required” or something of the sort.

Another point that you bring up in the post is that you need to be testing a hypothesis. You might get inspiration from the Amish heater ads, but it’s important to test some hypothesis that you got that inspiration from, ie “Hmm, I wonder if our customers will respond to this idea of the offer expiring in 14 days if they don’t sign up now” That way you’re designing the test with a clear idea of how you’re hoping to improve, and you will learn more than if you just throw up random stuff that you saw somewhere.

vincent

what’s the reason for using Optimizely over Google Analytics ?

Sorry, i mean Google optimizer

Ugur Gundogmus

Testing is great. But when it is overdone, it`s very difficult to have a stable, consistent website. For example, all of your sites are different now. Basecamp has a different website. Highrise has a different website. Backpack (and Campfire) has a different website. Plus, your blog is completely different than rest of your sites (in terms of design).

Is this the way it should be? I`m not sure.

Again, I`m not against testing and I wholeheartedly believe in it. But, tests must not destroy the identity of a website or a business for the short term gains. In my opinion, change should be slow and steady. We must not even feel a test is going on (Google way).

I will give you an extreme example now: Can you test a different 37Signals logo to see a different one will perform better (whatever your performance definition is)?

Thanks.

Simon

on 05 Aug 11

I have been doing a lot of working getting my e-mail opt in on my blog. I also have added free video resources.

I have made the offer of 4 free guides to planning, designing, marketing and improve your web site to anyone who opts in.

I am trying different ideas to get interaction i.e. pop up panel after period of time.

I like the idea of split testing headlines, etc. As I use Mailchimp for my list, is there any downside to using their functionality?

Simon Colchester, UK

Jake

on 06 Aug 11

I am new to your blog; just discovered it this morning. I’ve been using Basecamp for a little while now for my volunteer job and love it. I’m visually-impaired and use software that reads the computer screen aloud to me, and I’m happy to report that Basecamp works great with this software. I haven’t used any of your other products yet. Whether Basecamp was created with accessibility in mind or it just happened to be accessible without any of you knowing it, I want to thank you and please keep up the great work.

Notemoz

on 09 Aug 11

Thanks for the post.

We do the same thing at http://notemoz.com

Just changed the sign up button, more users are signed up

Paul Montwill

7 days without a blog post!? Guys, what is wrong with you? :)

@Paul

37signals is swimming in cash, that’s what they are doing.

Another AC

I think they have something cooking in the oven.

Behind the scenes: A/B testing part 2: How we test

Deciding what to test

Implementing the tests

Evaluating the results

Rudy

JF

Rudy

Anonymous Coward

Miles

Gary Bury

NL

Adrian

Dan Siroker

Chris Neumann

vincent

vincent

Ugur Gundogmus

Simon

Jake

Notemoz

Paul Montwill

Anonymous Coward

Another AC

This discussion is closed.

About Noah