A few weeks ago, we shared some of what we’ve been testing with the Highrise marketing page. We’ve continued to test different concepts for that page and we’ll be sharing some of the results from those tests in the next few weeks, but before we do that, I wanted to share some of how we approach and implement A/B tests like this.

Deciding what to test

Our ideas for what to test come from everywhere: from reading industry blogs (some examples: Visual Website Optimizer, ABtests.com), a landing page someone saw, an ad in the newspaper (our long form experiments were inspired in part by the classic “Amish heater” ads you frequency see in the newspaper), etc. Everyone brings ideas to the table, and we have a rough running list of ideas – big and small – to test.

My general goal is to have at least one, and preferably several A/B tests running at any given time across one or more of our marketing sites. There’s no “perfect” when it comes to marketing sites, and the only way you learn about what works and doesn’t work is to continuously test.

We might be simultaneously testing a different landing page, the order of plans on the plan selection page, and wording on a signup form simultaneously. These tests aren’t always big changes, and may only be exposed to a small portion of traffic, but any time you aren’t testing is an opportunity you’re wasting. People have been testing multiple ‘layers’ in their sites and applications like this for a long time, but Google has really popularized this lately (some great reading on their infrastructure is available here).

Implementing the tests

We primarily use two services and some homegrown glue to run our A/B tests. Essentially, our “tech stack” for running A/B tests goes like this:

  1. We set up the test using Optimizely, which makes it incredibly easy for anyone to set up tests – it doesn’t take any knowledge of HTML or CSS to change the headline on a page, for example. At the same time, it’s powerful enough for almost anything you could want to do (it’s using jQuery underneath, so you’re only limited by the power of the selector), and for wholesale rewrites of a page we can deploy an alternate version and redirect to that page. There are plenty of alternatives to Optimizely as well – Visual Website Optimizer, Google Website Optimizer, etc. – but we’ve been quite happy with Optimizely.
  2. We add to the stock Optimizely setup a Javascript snippet that is inserted on all pages (experimental and original) that identifies the test and variation to Clicky, which we use for tracking behavior on the marketing sites. Optimizely’s tracking is quite good (and has improved drastically over the last few months), but we still primarily use Clicky for this tracking since it’s already nicely setup for our conversion “funnel” and offers API access.
  3. We also add to Optimizely another piece of Javascript to rewrite all the URLs on the marketing pages to “tag” each visitor that’s part of an experiment with the experimental group. When a visitor completes signup, Queenbee – our admin and billing system – stores that tag in a database. This lets us easily track plan mix, retention, etc. across experimental groups (and we’re able to continue to do this far into the future).
  4. Finally, we do set up some click and conversion goals in Optimizely itself. This primarily serves as a validation—visitor tracking is not an exact science, and so I like to verify that the results we tabulate from our Clicky tracking are at least similar to what Optimizely measures directly.

Evaluating the results

Once we start a test, our Campfire bot ‘tally’ takes center stage to help us evaluate the test.

We’ve set up tally to respond to a phrase like “tally abtest highrise landing page round 5” with two sets of information:

  1. The “conversion funnel” for each variation—what portion of visitors reached the plan selection page, reached the signup form, and completed signup. For each variation, we compare these metrics to the original for statistical significance. In addition, tally estimates the required sample size to detect a 10% difference in performance, and we let the experiment run to that point (for a nice explanation of why you should let tests run based on a sample size as opposed to stopping when you think you’ve hit a significant result, see here).
  2. The profile of each variation’s “cohort” that has completed signup. This includes the portion of signups that were for paying plans, the average price of those plans, and the net monthly value of a visitor to any given variation’s landing page (we also have a web-based interface to let us dig deeper into these cohorts’ retention and usage profiles). These numbers are important—we’d rather have lower overall signups if it means we’re getting a higher value signup.

Tally sits in a few of our Campfire rooms, and anyone at 37signals can check on the results of any test that’s going on or recently finished anytime in just a few seconds.

Once a test has finished, we don’t just sit back and bask in our higher conversion rates or increased average signup value—we try to infer what worked and what didn’t work, design a new test, and get back to experimenting and learning.