Last week I stated that my goal for last week was to build a functioning (albeit with no UI) system for visitor randomization, variant selection, and counting for a simple conversion metric.

I accomplished this goal and ran an A/A test over the weekend. I created two variants of my YARPP page: a control and a variant, and told ShrimpTest to assign 30% of the testing population to the variant. I then counted a conversion-type metric: the number of people who clicked to download the plugin.

Here are the two treatments I ran:

yarpp.png yarpp.png
Control Variant

If you think the control and the variant look the same, well, you’d be right! This is an A/A test, so the point isn’t to actually learn whether one treatment fares better than another, but rather to establish that the system is working and a baseline for what behavior is “normal.”

The first thing is to verify that our random assignment distribution is working as we intended it to, and we see no problem there: 4158 cookies were assigned to the control, and 1760 to the variant. That’s 70.26% for the control and 23.74% for the variant. Right on.

The second thing to check is to see whether we can say, with confidence, that either treatment was better than the other. Again, because this is an A/A test, we would want the null hypothesis to win. The null hypothesis states that any differences we saw between the different treatments was simply due to chance.

First we calculate the average conversion rates for each of the treatments. Here’s the raw data:

  Control Variant
N (total visitors in the group) 779 330
Converted 76 23
Proportion converted 0.0975 0.0696

So it looks like the variant did worse than the control. The question is, can we say with confidence that this amount of variation was not simply due to chance?

The way we do this is to calculate the z-score (or standard score) between the variant and the control. The z-score is a measure of how far apart the results were, as populations, and is calculated in units of standard deviations. I’m going to gloss over the mathematical details here (I hope to write an article or do a video in the future explaining the stats), but our z-score here is -1.58, and our 95% confidence threshold is -1.645 (a good primer on why this magical number is our threshold is available here). As -1.58 is closer to 0 than -1.645, we cannot say, with 95% confidence, that the null hypothesis is correct: the difference in performance was due to chance. Good thing, too, because we know the two treatments here were exactly the same. In reality, if we keep running this test for a longer period of time, the z-score should get closer to 0.

However, even though we know that the two treatments were exactly the same, and thus there is no rational way that the null hypothesis would be falsified, it’s important to note that our statistics here are to 95% certainty: that means that, on average, one in twenty times we run such an A/A test, we’ll get a false positive.