A/B testing is a useful tool to determine which page layout or copy works best to drive users to reach a given goal. Companies like 37 signals use A/B testing to improve conversion rates on the site, HubSpot uses it to increase email conversions, and Zynga uses A/B testing to increase engagement in its games [1,2,3].Whenever you run an A/B test you must decide when you have gathered enough data, and you can pick the winning idea and implement it for all users. People typically plug the conversion numbers into an online calculator*, and if the result is 'significant' they pick the winner.
However deciding when to stop a test using significance is wrong.
Significance testing is useful when the goal is inference. If we want to make falsifiable statements, and draw conclusions through experimentation, then we use statistical significance to measure certainty. However in the business setting, we want to make a decision with some other goal in mind: increasing conversions, improving ease of use, maximizing profit, or some other objective. In these cases there are better criteria to determine an experiment stopping point.
Suppose we are running a test with two ideas, call them A and B, and one idea is better than the other. The longer that we run the test, the better we are able to quantify how much better A is than B. However, the longer we run the test the more users that we expose to the inferior idea.
In a classical testing environment we would decide that we want to be able to detect differences in conversions as 1 percentage point, we would pick a confidence level (say 95%), find the appropriate sample size and run the test. At the end of the test we could say either, A is better than B, B is better than A, or A and B are within 1 percentage point, and we would be certain with 95% confidence that the statement we made was correct.
However using a Bayesian approach, we first determine how many people will be exposed to the result of the test. We need to balance cost of the test, against the cost of making the wrong decision. The more users that will be exposed to the result, the higher the cost of making the wrong decision, so we can justify running a longer test.
This cost is formally called 'regret', and is measured as the difference between the actualized, as compared to the optimal revenue that could be realized if we had perfect information.
These two approaches have been debated in the clinical trial literature. In clinical trials scientists must balance providing a potentially inferior treatment to patients, against the learnings that they gain to help future patients. The Bayesian approach was developed by Anscombe in the 60s, and is widely used in clinical trials today .
Anscombe provides a formula to determine the stopping point of an experiment. The experiment should be terminated when the following condition is true.
Where y is the difference between results of A and B, k is the expected number of future users who will be exposed to a result, and n is the number of users who are exposed to the test so far. And Phi-inverse is the quantile function of the standard normal.
So what does this mean? How do using the results from the Anscombe paper affect the actual performance.
In the following example we simulate 100,000 visits to the site, with two ideas, idea A has a 21% conversion rate, and idea B has a 20% conversion rate. We evaluate how the two ideas perform using a significance test, compared to Anscombe's stopping rule, and compared to picking a fixed sample size of 10K.
Then we simulate 10,000 different iterations and calculate the regret.
|Method||Mean Regret||(95% quantiles)||Correct version chosen|
|Repeated Significance||150||(-4.0, 620)||72%|
|Fixed Sample Size||115||(10, 225)||96%|
Below are two plots of a typical path of an experiment. We plot the advantage of A over B. While A, in the long run, is better than B, there are short spells where B performs better than A. We also see a visual representation of the two confidence intervals. Zooming in on the first 2,000 visitors, we see the problem with the repeated significance testing. A short run of conversions on idea B results in B being declared the winner around 100 visitors into the test.
Using Anscombe's stopping rule is much better than using significance testing. With a 40% less regret than using repeated significance testing. The traditional way of using repeated significance testing leads to higher regret, and produces the correct answer less often than Anscombe's method or using a fixed sample size.
In conclusion, using Anscombe's methods minimizes regret, at the cost of giving up the ability to make inference. If you aim to make inferences about which ideas work best, you should pick a sample size prior to the experiment and run the experiment until the sample size is reached. But if you want to maximize conversions, use Anscombe's rule.
*There are a lot of A/B testing calculators out there. If you do decide to pick your sample size in advance. I personally like ABBA . Since it uses Aggresti-Coull confidence intervals and performs mulitple test correction, which avoid two other pitfalls not covered in this article.
 Anscombe. (1963). Sequential Medical Trials. Trials, 58(302), 365-383.
I'd like to thank Eric Schwartz for introducing me to the idea of optimizing sequential trials.