In coming up with the notion of “experiment completion” and creating the associated events, I’ve been reading a bit more about experiment sample size and strategies for its computation. In particular, I found the following post was both very inspiring and troubling: How Not To Run An A/B Test. The upshot is that simply waiting for getting a result that hits 95% confidence and ending the experiment then has the danger of resulting in invalid results. Computing a fixed sample size in advance is the solution.

The problem with sample size

The problem is that the regular way of computing an experiment sample size is based on power considerations, where power is the probability that a test will reject a false hypothesis, which in turn is based on how much variance there is in the goal metric being recorded. Suppose our goal metric is “number of dollars spent at the store”. Our control might look like this:

CONTROL: $10, $11, $9.5, $10, $10, $10.5, $9, $5, $11, $11, $10.5…

and our variant looks like this:

VARIANT 1: $7, $6, $7, $8, $8.5, $5.5…

Clearly, it looks very much like this variant is acting differently than the control… in particular, it seems to be performing about 20% (or more) worse. But what if the original control looked like the following?

HYPOTHETICAL CONTROL: $8, $14, $12, $6, $7.5, $5, $16, $12, $18, $9, $11.5…

This distribution has approximately the same average as the first control, but has much more variance. If we were asked whether we could confidently say that variant 1 above looks the same or different as they hypothetical control, we would be much more hesitant.

This is why the variance of the goal metric is crucial in determining the confidence of a result and, thus, the required sample size to detect differences.

The problem here is that, most of the time, we won’t know what the variance of the metric (measured in standard deviations) is prior to running the experiment. Some solutions, like Bayesian approaches, exist, but at the cost of greatly complicating our statistics algorithms, which I would prefer not to do.

The question is, then, how do we compute a valid sample size ahead of time, or early on in the experiment period?