# Background

A/B Testing is a term typically use in commercial settings to refer to the analysis and design of controlled experiments. Typically introductory posts on A/B Testing focus on simple between-subject randomised controlled trials (RCTs). In such experiments, participants (e.g. users, customers) are randomised into one of two groups. One group gets some treatment (e.g. a voucher code) and the other receives nothing (‘business-as-usual’), or some lesser treatment without the feature the trial is aiming to investigate. This post will focus on these simple treatment-control designs, but the lessons will generalise to other designs.

Controlled experiments can come in a variety of flavours, not limited to those described above. For example, the psychology experiments I carried out during my PhD used within-subjects designs where each participant receives both a treatment and a control. Not only this, participants were exposed to multiple trials of both the treatment and control. Experiments can also have multiple treatment and control groups, rather than just one of each. For example, if you’re interested in improving an email campaign then multiple different emails (treatments) could be tested in the same trial. There could also be one control group who receive nothing, then another who receive a standard email that has been used in the past. With this more complex design you could ask questions like

- Is one of our new emails better than the other new ones?
- Is
*any*email better than no email? - Which email is best?
- Are any of the new emails better than our old standard email?

In additional to the range of options for designing experiments, participants can be sampled in a number of ways. Consider our email example above. We could simply randomly assign our customers into the different treatment and control groups, but this isn’t our only option. There might be various subgroups of our customers that we’re interested in. For example, our email might be communicating important information about a customer’s account. It would be important to ensure that the email is engaging and accessible for all customers, not just the majority. With this analysis in mind, we could randomly sample *within* the groups we’re interested in (e.g. age, location). Assuming every customer is in one group, this would be a case of stratified sampling ^{1}.

# Some complications in A/B Testing

This post will focus on an issue commonly faced when analysing A/B tests – dealing with baselines differences between the treatment and control groups. We’ll focus on two different forms this typically takes

- Baseline differences in the outcome of interest
- Baseline differences in additional covariates that predict the outcome

Throughout we’ll use the example of a simple RCT with one treatment and one control, where our outcome of interest is engagement with an email we’ve sent out.

# Baseline differences in outcomes

Suppose we’ve completed our RCT and find that the customers in our treatment group historically engaged with emails at a higher rate than our control group. A common thing we might be tempted to do is test whether this baseline difference is significant (Boer et al., 2015). These tests are often used to determine “whether randomization was successful” (Boer et al., 2015), but this represents a misunderstanding of why we randomise (Harvey, 2018). We do not randomise to ensure that there are no baseline differences between our groups, we randomise to remove any relationship between baseline values and treatment assignment. The test of baseline differences is then testing a hypothesis we already know to be true: that there is no association between baseline engagement and treatment assignment. We *know* that the baseline differences are the result of chance because we did the randomisation! ^{2} As Bland and Altman (2011) put it (as cited by Harvey, 2018):

“….performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance. Such a procedure is clearly absurd.”

## Difference of differences

If testing for baseline differences in irrelevant, then what should we do? In a simple experiment where we have no covariates, we might consider doing a difference-of-differences analysis. For example, consider the table below

Group | Pre-experiment engagement | Post-experiment engagement |
---|---|---|

Control | 0.3 | 0.29 |

Treatment | 0.4 | 0.41 |

Naively, we might look at the post-experiment difference between the treatment and control groups and conclude our new email is a great success. However, there is clearly a big differences in the pre-experiment engagement between our groups. We could instead do a difference-of-differences analysis. Here we’d subtract the baseline engagement from the post-experiment value for each group. Looking at this diff-of-diffs our email doesn’t look quite as good.

Group | Pre engagement | Post engagement | Diff from pre |
---|---|---|---|

Control | 0.3 | 0.29 | -0.01 |

Treatment | 0.4 | 0.41 | 0.01 |

### Limitations

While a diff-of-diffs analysis is certainly a useful starting point, it has some limitations. Most commonly people point out that it assumes the coefficient predicting post-experiment outcomes from pre-experiment values is 1 (e.g. Jackson, 2018; Harien & Li, 2019). This might sound a bit esoteric – it certainly did to me at first – so lets go through what it means.

In our initial naive analysis we wanted to predict engagement with the emails simply from group, which we can represent as

\[ engagement = \beta_0 + \beta_{group} * group \] The thing we’re interested in is \(\beta_{group}\), which represents the average treatment effect. In the diff-of-diffs analysis the \(engagement\) outcome variable is replaced with the difference in engagement. To see why this assumes that the coefficient predicting post-engagement from pre-engagement is 1, lets add pre-engagement to our model.

\[ engagement = \beta_{group} * group + \beta_{pre} * pre \] Here \(\beta_{pre}\) represents the relationship between pre and post engagement. If we set \(\beta_{pre}\) to 1 we can simplify things a bit.

\[ \begin{align} engagement & = \beta_{group} * group + 1 * pre \\ engagement & = \beta_{group} * group + pre \\ engagement - pre & = \beta_{group} * group \\ engagement \ diff & = \beta_{group} * group \\ \end{align} \] By setting \(\beta_{pre}\) to 1, the model simplified to the diff-of-diffs analysis.

However, there’s no reason why \(\beta_{pre}\) should always be one. Instead of assuming this with a diff-of-diffs analysis, we could just include baseline performance in our model and allow \(\beta_{pre}\) to be different from 1.

## When to include baseline performance

It might be tempting to think that we only need to include baseline performance in our model when there are differences between treatment and control at baseline. However, adjusting for baseline values by including them in our model will increase the precision of our estimated treatment effect, even when there aren’t baseline differences (Senn, 2019). This fact is exploited by CUPED, a technique for variance reduction in experiments by adjusting for baseline data (Deng, Xu, Kohavi, & Walker, 2013, Xie, & Aurisset, 2016, Jackson, 2018).

# Handling other covariates

Baseline performance is just one covariate we may have available. In our email example, we might have plenty of additional information about the customers such as their time spent on-site, or amount of money spent with us. We can include these additional covariates in our model as well to produce a more reliable estimate of the treatment effect, controlling for these factors. Below is the results of a simulation to show the benefit of including additional covariates.

1000 datasets are being simulated with 1000 participants each. 50% of participants are in the treatment group and the rest are in the control group. There are also additional covariates that predict the outcome. The plot below show the distribution of estimated treatment effects for a simple model with only group as a predictor versus one with the additional covariates included. The true effect size is shown by the vertical line. (The code for this simulation is at the end of the post.)

On average the estimated treatment effect is the same for each model (0.2 vs 0.202). This makes sense as participants were randomly assigned to treatment and control. The covariates had an equivalent effect on both the control and treatment groups, meaning they’ll wash out on average leaving the correct treatment effect (for more discussion of this point see Senn, 2005).

However, the estimates are clearly far more variable for the simple model. This means that while the simple model might get the estimate right on average, there are many cases where including the covariates would improve the estimate. The plot below shows this more clearly. Each line is a single sample with the estimates for the simple model on the left and the model with covariates on the right. We can see many cases where including the covariates brings the estimate closer to the true values shown by the vertical line. Overall, the estimate for the model with covariates is closer to the true value 66.1% of the time.

This reduction in variance also improves our ability to detect the treatment effect, as well reducing our uncertainty about the effect size (Senn, 2019; Jackson, 2018). For the model without the covariates, a significant effect of treatment is found in 58.9% of samples versus 89.8% for the covariate model. In other words, including the covariates increases our ability to detect the true effect. The simple model is therefore underpowered compared to the covariate model to detect an effect of the size used here (for more on power see Morey, 2019).

# Conclusion

We’ve seen how making use of baseline values for our outcome measure, as well as additional covariates, help estimate and detect treatment effects. The approach taken here is very similar to what CUPED does (Deng, Xu, Kohavi, & Walker, 2013; Xie, & Aurisset, 2016; Jackson, 2018). Indeed, Uber make use of CUPED for bias reduction when running online experiments (Harien & Li, 2019).

# Code

Below is the code for the simulations and density plot.

```
# some packages
library(MASS)
library(tidyverse)
library(broom)
library(viridis)
library(scales)
# simulation function
lr_sim <-
function(N = 1000, # sample size
beta_0 = 1, # intercept
beta_group = 0.2, # effect of group
# covariate coefficients
beta_covar = c(0.7, 0.4, 0.2)) {
# create the predictors
# assuming independent standarised predictors
x_group = rep(0:1, each = N / 2) # control (0) vs treatment (1) flag
# covariance matrix for covariates
sigma <- matrix(c(1, 0.4, 0.3, 0.4, 1, 0.6, 0.3, 0.6, 1), nrow = 3)
# create the covariates
x_covar = mvrnorm(N, mu = rep(0, 3), Sigma = sigma)
# create the outcome variable by
# combining the data and the coefficients
# with some random noise added
engagement = beta_0 + x_group * beta_group + x_covar %*% beta_covar + rnorm(N)
# create a df from the data
df <- data.frame(engagement, x_group, x_covar)
# set the colnames
colnames(df) <- c('engagement', 'group', 'pre', 'time', 'spend')
# fit an linear regression and
# get a tidy df of the coeficients
coef_simple <- lm(engagement ~ group, data = df) %>%
tidy()
# do the same with covariates
coef_covar <- lm(engagement ~ ., data = df) %>%
tidy()
# combine the two dfs and filter to
# just include the row for the
# group effects
bind_rows(Simple = coef_simple,
Covariates = coef_covar,
.id = 'model') %>%
filter(term == 'group')
}
# call the simulation function 1,000 times
set.seed(2019)
sim_results <- map_df(1:1000, ~ lr_sim())
# plot densities for the estimates
sim_results %>%
ggplot() +
aes(estimate, color = model, fill = model) +
# geoms -----------------------------
geom_density(alpha = .4) +
geom_rug() +
geom_vline(xintercept = 0.2, linetype = 2) +
# annotations for labels ------------
annotate(
'text',
x = 0.375,
y = 5,
label = 'Model with covariates',
family = 'mono',
color = plasma(2, end = 0.6)[1],
fontface = 'bold'
) +
annotate(
'text',
x = 0,
y = 2,
label = 'Simple model' ,
family = 'mono',
color = plasma(2, end = 0.6)[2],
fontface = 'bold'
) +
# scales and theme ------------------
scale_x_continuous(breaks = seq(0, 0.4, by = 0.1)) +
scale_color_viridis_d(option = 'plasma', end = .6) +
scale_fill_viridis_d(option = 'plasma', end = .6) +
theme_minimal(base_family = 'mono') +
theme(legend.position = 'none',
panel.grid = element_blank()) +
# labels ----------------------------
labs(
x = 'Estimated treatment effect',
y = 'Density',
title = 'Distribution of estimates treatment effects',
subtitle = 'For a group-only vs a group + covariates model'
)
```

This code creates the second plot showing the paired estimates for each sample.

```
sim_results %>%
mutate(idx = rep(1:1000, each = 2)) %>%
ggplot() +
aes(fct_rev(model), estimate, group = idx) +
geom_line(alpha = .1, color = plasma(1, begin = 0.3)) +
geom_hline(yintercept = 0.2, linetype = 2) +
theme_minimal(base_family = 'mono') +
labs(x = 'Model',
y = 'Estimated treatment effect',
title = 'Treatments effects for the simple \nand covariate models',
subtitle = 'Each line is one of the 1,000 simulations')
```

Analysis of such trials won’t be discussed here, but would probably be best suited to a multi-level model. We could also fit a multi-level model and

*post*-stratify if the sample wasn’t stratified. I’m planning to talk about multi-level models and post-stratification in more detail in a future post.↩More technically, the p-value for a difference test between treatment and control represents the probability of observing a test statistic as-or-more extreme than the one observed assuming there is no difference. But due to our randomisation we know already know the null (no difference) is true, so any significant difference is the result of chance.↩