Every website has a purpose. Maybe you want people to buy a product, donate to your cause, download your app, or subscribe to your mailing list. How can you confidently modify your site to make it more effective with respect to this goal?

A/B testing is a process by which multiple variants of a website are presented to different users randomly and statistical tools are used to see whether any variant is more effective, according to an overall goal metric such as conversions or revenue.

A/B testing and its kin are used frequently by large web properties you know and trust, like Google, Amazon, and Microsoft. Success stories abound.

Getting results with “confidence”

Suppose you want to know if a coin is fair or not, so you decide to flip it five hundred times. It came up heads 316 times—that’s heads 63% of the time, rather than the expected 50%. Can you confidently say that this coin is not fair?

Now what if you’d stopped after 250 trials, at which point it was 57% heads. Could you say it’s not fair then? What if you’d stopped after the first ten trials, where you got five heads and five tails. Could you stop then and confidently say that it is fair?

Statistics gives us a way to quantify the intuitive notion of “confidence” that we feel when we say that it’s an unfair coin after 500, but might not be sure either way after just ten trials. The confidence interval is the probability that the variation observed between the variant and the control *cannot* be due to normal variation within the results. For our coin-flipping example, then, it’s the probability that the variation from the expected result (it not being heads exactly 50% of the time) could simply result from random variation.

After all, if we flip 100 coins, we’re not always guaranteed that we’ll get fifty heads and fifty tails. There’s a certain amount of normal variation. We want to be confident that the effect we observe actually means the coin is not a fair coin. “Confident” is normally quoted at the 95% confidence interval (a 1 in 20 chance that we have a false positive) and we say “very confident” at 99%.

When an experiment has been completed, ShrimpTest will tell you both what the difference was between each candidate and the control, and the confidence interval of this result. Smaller differences are harder to observe, so you need more trials to confidently detect a difference, while larger differences are easier and thus take fewer trials. Results without 95% confidence or results quoted before an experiment completes should not be used to make decisions.

What A/B testing is not

A/B testing is a great tool as it can tell you with confidence that one treatment of your website is better than another, taking the guesswork out of the design process. However, if you have a need to understand why your visitors are behaving a certain way, or prefer a certain treatment, qualitative methods such as focus groups or surveys would be required. A/B testing cannot tell you why one treatment is better than another.

This makes it doubly important to agree upon a goal metric ahead of time. Is your goal to increase pageviews? Conversions? Repeat visit rates? Purchases? Total revenue? Changing the goal posts down the line makes it harder—and often impossible—to know which variant is best.

A/B testing also is only a tool: it won’t point you to other variations that you should be testing and thus is not a replacement for good design. The usefulness of testing is limited by the tests that you set up. If you only test very slight variations, you’ll only learn about the relative effectiveness of those variants. There’s no guarantee that a radically different design, layout, header, etc., wouldn’t do better if you tried. Luckily, some references (such as ABTests.com) exist to give you some ideas of what kinds of variation you can test.

Finally, A/B testing will only compare the effectiveness of variants within that single experiment, assuming all other variables to be constant. This means that you should refrain from running multiple experiments against the same goal metric. For example, perhaps you want to create a button which people want to click. If you vary the size of the button as one experiment, and the text on the button as another experiment, the results of both experiments would become invalid.

If you do have a particular need for seeing how the combination of different parameters conspire to become more or less effective (called “interactions”), the more advanced method of multivariate testing is required. ShrimpTest does not offer multivariate testing.

Next steps:

If you would like to dive right into testing, learn about creating experiments and interpreting results with ShrimpTest. Alternatively, read more about A/B testing best practices.

Learn more:

Some good references for A/B testing on the web:

Most of the material in this section is covered in my talk on ShrimpTest: