A/B Testing: What Is a Type 1 and Type 2 Error and How to Avoid Them

AB testing errors type 1 and type 2 error vs

One of the main reasons to carry out A/B testing is to get verifiable results that are repeatable. The only way to achieve this is to use scientific methods. The goal is to obtain the objective truth – free from guesswork, conjecture, and any personal feelings on which variation is best.

However, sometimes testers make errors and these can easily be overlooked bringing bad results. When marketers carry out A/B testing or multivariant testing during their conversion rate optimization work, every test is subject to several possible types of error. Common types of errors are type 1 error and type 2 error.

Despite how easy tools make A/B testing, you as the user must understand both scientific methodology and how to interpret the results to avoid making bad decisions.

It’s your job to design the tests, and this is where errors tend to arise, within the experimental design. No A/B testing tool can detect these errors. It’s up to you to spot them when they do occur, or rather prevent them from happening in the first place.

So what are these errors, what’s the difference between a type 1 and type 2 error, and how to avoid a type 1 error and a type 2 error?

Let’s find out! 

What is a type 1 error - false positive?

A false positive can occur when testing a new popup overlay (variation B) vs the original control (variation A). You decide to change the background image to test a more emotive one.

After 10 days of running variation A vs variation B, you check the outcome. The results seem clear, showing a big improvement in conversion. Consequently, the A/B testing is concluded and variation B is implemented as the winner.

However after several months, the results were not better than the original, in fact, they were worse.

This is an example of a false positive and a type 1 error. 

A type 1 error is an experimentally tested outcome, a result that suggests a positive correlation, indicating a superior option that turned out not to be true.

How is this possible?

Simply put, it’s the human factor introducing errors. Often this is the result of not doing sufficient research on what should be tested. There are many possible variables that must be accounted for when designing tests, you only need to miss one for your test hypothesis to be wrong. 

If all things are equal, free from outside influences the results of this A/B test would have provided correct results. If you find yourself in this position, you missed something or you let external factors influence the results.

Ultimately there was a flaw in your scientific method, the point is, YOU as the tester didn’t account for it.

Why split tests fail?

  • Your persona is too broad
  • Your sample size is too small
  • You are testing the wrong thing
  • Your test duration is too short

What is a type 2 error - false negative?

Let’s work with the same scenario above, the original (A) (control) vs new variation (B). In this case, the result shows no change in conversion between the two. In this case, you may decide to keep the original or switch to the new version, based on other factors such as personal preference.

In this case, the Null Hypothesis (definition below) is considered correct (incorrectly).

The test was flawed and version B was a much better option, thus the scenario potentially leads to a decision that was incorrect. The problem in this scenario is that it is likely that you would never know that version B was better. That is unless you eliminate the error prior to retesting.

A type 2 error is when the null hypothesis (no difference) is considered correct – incorrectly.

Testing significance

Before you run your test, you need to calculate what the level of significance should be for the test. Here you are deciding what result determines success.

Generally, this should be based on the Null Hypothesis, which is the default position that there is no significant difference between the two.

What positive deviation from this position should you deem significant? The general consensus is that you should keep testing until your statistical significance is at least 90% but preferably 95% or over before making a decision based upon it, or in other words your confidence in the result is over 95%.

Another factor that must be considered is the sample size. The smaller the sample size the greater the margin of error. What this means is as your conversion rates get higher, the smaller the sample size you need to measure improvement.

Check out this sample size calculator to understand what I mean by this and to see what sample size your A/B test should have.

How to avoid type 1 and type 2 errors?

  • Generally only A/B test one change at a time
  • Don’t A/B test if you have a small low-traffic website, with a conversion volume below 1000 per month. It’s just not worth your time.
  • Make sure you are testing the right thing.

Share this

Written by