Simulating A/B tests with data.table

library(tidyverse)
library(data.table)

set.seed(63)

knitr::opts_chunk$set(warning = F, message = F)

A/B tests involve comparing two different groups, typically with a randomised trial. Traditionally this might be a clinical trial where a new drug is given to a treatment group who are compared with a control group who did not receive the drug. Other early A/B tests were done in agriculture, where fields would be split into sections that were treated differently. A/B tests are also used widely in business to test things like new website features, emails, or machine learning models.

This post describes some functions to simulate A/B tests. Subsequent posts will then make use of this code to explore topics like early stopping for A/B tests. The type of A/B test being simulated here is a simple two-group test where the outcome is binary (0 = fail, 1 = success). In the context of a new website feature this could represent clicks or conversion. For emails, a binary outcome might be whether the email was opened. The function will allow for multiple weeks of testing to be simulated. This feature will then facilitate future analysis of early stopping.

Simulating a week of a test

The basic building block for simulating A/B tests is getting a weeks worth of data for the test and control groups. The core function used here is rbinom(). When first tackling this problem I played around with Rcpp to create my own version of rbinom(). However, a quick bit of benchmarking showed it was no faster than the rbinom() function built into R. If we print out this function we can see why.

rbinom
## function (n, size, prob) 
## .Call(C_rbinom, n, size, prob)
## <bytecode: 0x7f888d76ade8>
## <environment: namespace:stats>

The output shows that rbinom() is already calling C with .Call(C_rbinom, n, size, prob)! While it was fun trying out Rcpp, I could have saved myself some time by not optimising too early.

The function below simulates a single week of testing. We’re making use of data.table for its combination of speed and relatively simple syntax. I initially worked on a set of functions using the tidyverse but found them to be too slow.

A data.table is returned by the function with two columns, control and treatment. For the control column, we get n_per_week values from a binomial distribution with baseline probability of being a success (1). For the treatment column the probability of the value being 1 is shift by treatment_effect.

sim_week <- function(n_per_week, baseline, treatment_effect) {
  data.table(
    control = rbinom(n_per_week, 1, baseline),
    treatment = rbinom(n_per_week, 1, baseline + treatment_effect)
  )
}

Below the function is called with treatment_effect = 0 meaning there’s no real difference between the contol and treatment groups.

sim_week(1000, 0.5, 0) %>%
  colMeans()
##   control treatment 
##     0.498     0.518

The treatment_effect could be set of 0.1 to indicate a 60% chance of the outcome between 1 for the treatment group, versus 50% for control.

sim_week(1000, 0.5, 0.1) %>%
  colMeans()
##   control treatment 
##     0.483     0.593

Simulate multiple weeks of testing

Now our basic building block is in place, we can move to slightly more complicated problem: simulating multiple weeks of a test. The complication here stem from the fact that weeks aren’t independent. The data for the second week of a test accumulates on top of the first week. In terms of arguments to the function, the only thing that is new is n_weeks, which is the number of weeks of data we want to simulate.

sim_weeks <-
  function(n_per_week,
           baseline,
           treatment_effect,
           n_weeks) {
    # create list where each element is the data
    # for a week created by sim_week()
    weeks_list = replicate(n_weeks,
                           sim_week(n_per_week, baseline, treatment_effect),
                           simplify = F)
    
    # collapse the list of weeks into a
    # data.table with a week id column
    weeks_dt <- rbindlist(weeks_list, idcol = 'week')
    
    weeks_dt %>%
      # melt into long format with week as the id
      melt(id.vars = 'week',
           variable.name = 'group',
           value.name = 'success') %>%
      # sum success flag by week and group (treat vs control)
      .[, .(successes_per_week = sum(success)), by = c('week', 'group')] %>%
      # create the cumulative successes for each week by group
      .[, successes := cumsum(successes_per_week), by = 'group'] %>%
      # create the cumulative failures by week
      .[, failures := week * n_per_week - successes] %>%
      # order by week
      .[order(week)]
  }

To begin with, the function creates a list with the new data for each week. These samples are independent because they only reflect the new data for each week, without combining with previous weeks. This list is then combined into a single data.table with a column to identify the week. There’s a nice function (rbindList()) provided by data.table for this.

# create list where each element is the data
# for a week created by sim_week()
weeks_list = replicate(n_weeks,
                       sim_week(n_per_week, baseline, treatment_effect),
                       simplify = F)

# collapse the list of weeks into a
# data.table with a week id column
weeks_dt <- rbindlist(weeks_list, idcol = 'week')

The next step is to take these samples and combine them with previous weeks. We take our data.table containing the new data for each week and combine it following a number of steps.

  1. melt() into a long format.
  2. Create a column called success_per_week by summing the binary success value by week and group (treatment vs control).
  3. Calculate the cumulative sum of successes within each group ordered by week into a column called successes.
  4. Create a failures column containing the cumulative number of zeros for each week.
  5. Order by week.
weeks_dt %>%
  # melt into long format with week as the id
  melt(id.vars = 'week',
       variable.name = 'group',
       value.name = 'success') %>%
  # sum success flag by week and group (treat vs control)
  .[, .(successes_per_week = sum(success)), by = c('week', 'group')] %>%
  # create the cumulative successes for each week by group
  .[, successes := cumsum(successes_per_week), by = 'group'] %>%
  # create the cumulative failures by week
  .[, failures := week * n_per_week - successes] %>%
  # order by week
  .[order(week)]

Calling this function, we can see the output it gives us. This has been designed to make statistical tests for each week easy to add, which we’ll turn to next.

sim_weeks(500, 0.5, 0.2, 2) %>%
  head()
##    week     group successes_per_week successes failures
## 1:    1   control                247       247      253
## 2:    1 treatment                342       342      158
## 3:    2   control                226       473      527
## 4:    2 treatment                345       687      313

Add statistical tests each week

In a future post we’ll explore early stopping for A/B tests. To faciliate this the next function runs a prop.test() for each week of cumulative data.

sim_test <-
  function(n_per_week,
           baseline,
           treatment_effect,
           n_weeks) {
    # get the data
    sim_result <-
      sim_weeks(n_per_week, baseline, treatment_effect, n_weeks)
    
    # create an empty dt to populate below
    dt_prop_test_result = data.table(
      week = 1:n_weeks,
      control_prop = vector(mode = 'numeric', length = n_weeks),
      treatment_prop = vector(mode = 'numeric', length = n_weeks),
      p_value = vector(mode = 'numeric', length = n_weeks)
    )
    
    for (w in 1:n_weeks) {
      # create a matrix with successes and fails in
      # the columns control and treatment in the rows
      mat = as.matrix(sim_result[week == w, .(successes, failures)])
      
      # do the prop.test
      prop_test_result = prop.test(mat)
      
      # update the results data.table
      dt_prop_test_result[week == w, `:=`(
        control_prop = prop_test_result$estimate[1],
        treatment_prop = prop_test_result$estimate[2],
        p_value = prop_test_result$p.value
      )]
    }
    
    return(dt_prop_test_result)
  }

Breaking this down a bit, the function starts by simply calling sim_weeks(). Next an empty data.table is instantiated to be filled with values. This will then be populated with the week, proportions in the control and treatment groups, and the \(p\)-value for the prop.test().

# create an empty dt to populate below
dt_prop_test_result = data.table(
  week = 1:n_weeks,
  control_prop = vector(mode = 'numeric', length = n_weeks),
  treatment_prop = vector(mode = 'numeric', length = n_weeks),
  p_value = vector(mode = 'numeric', length = n_weeks)
)

Finally, the function loops through the weeks, first creating a matrix of successes and fails in the format favoured by prop.test(). The prop.test() is then run and the results are added into dt_prop_test_result.

for (w in 1:n_weeks) {
  # create a matrix with successes and fails in
  # the columns control and treatment in the rows
  mat = as.matrix(sim_result[week == w, .(successes, failures)])
  
  # do the prop.test
  prop_test_result = prop.test(mat)
  
  # update the results data.table
  dt_prop_test_result[week == w, `:=`(
    control_prop = prop_test_result$estimate[1],
    treatment_prop = prop_test_result$estimate[2],
    p_value = prop_test_result$p.value
  )]
}

Running the function we can see the output in a neat format for analysis.

sim_test(500, 0.5, 0.1, 2) %>%
  head()
##    week control_prop treatment_prop      p_value
## 1:    1        0.490          0.624 2.650377e-05
## 2:    2        0.502          0.597 2.394067e-05

Simulate multiple tests

The final thing that we’ll want to do to run simulation experiments is run multiple tests. The function below allows us to do that.

sim_tests <-
  function(n_per_week,
           baseline,
           treatment_effect,
           n_weeks,
           n_tests) {
    # replicate the test n_tests times
    tests_list = replicate(n_tests,
                           sim_test(n_per_week, baseline, treatment_effect, n_weeks),
                           simplify = F)
    
    # return a data.table with an id col called test
    rbindlist(tests_list, idcol = 'test')
  }

First off, a list of results from running sim_test() n_tests times is created.

# replicate the test n_tests times
tests_list = replicate(
  n_tests,
  sim_test(n_per_week, baseline, treatment_effect, n_weeks),
  simplify = F
)

This is then combined into a single data.table with an id column called test.

# return a data.table with an id col called test
rbindlist(tests_list, idcol = 'test')

Running the final function gives an output that can then be used to analyse A/B tests.

sim_tests(
  n_per_week = 500,
  baseline = 0.5,
  treatment_effect = 0,
  n_weeks = 2,
  n_tests = 5
) %>%
  head(10)
##     test week control_prop treatment_prop    p_value
##  1:    1    1        0.462          0.516 0.10001468
##  2:    1    2        0.467          0.511 0.05441929
##  3:    2    1        0.478          0.506 0.41090763
##  4:    2    2        0.480          0.506 0.26350568
##  5:    3    1        0.446          0.524 0.01619845
##  6:    3    2        0.487          0.510 0.32517730
##  7:    4    1        0.514          0.508 0.89931895
##  8:    4    2        0.519          0.481 0.09798734
##  9:    5    1        0.498          0.478 0.56910237
## 10:    5    2        0.488          0.508 0.39548488

A simple example of how this function could be used is to check the false-positive rate for tests. In the example below, there’s no true treatment effect meaning that 5% of p-values are below 0.05. A future post will look into some of these topics in more depth.

dat_sim_results <- 
  sim_tests(
    n_per_week = 500,
    baseline = 0.5,
    treatment_effect =0,
    n_weeks = 1,
    n_tests = 1000
  )

# false positive rate
mean(dat_sim_results$p_value < 0.05)
## [1] 0.056

Benchmarking

On my 2015 Macbook Pro, the sim_tests() function takes about 10 seconds to simulate 500 tests with 1000 customers per week and 4 weeks.

bench = bench::mark(
  sim_tests = sim_tests(1000, 0.5, 0, 4, 500),
  min_iterations  = 5,
  check = FALSE
)

Appendix: code

#======================================
# simulate a week of a test
#======================================

sim_week <- function(n_per_week, baseline, treatment_effect) {
  data.table(
    control = rbinom(n_per_week, 1, baseline),
    treatment = rbinom(n_per_week, 1, baseline + treatment_effect)
  )
}

#======================================
# simulate weeks of testing
#======================================

sim_weeks <-
  function(n_per_week,
           baseline,
           treatment_effect,
           n_weeks) {
    # create list where each element is the data
    # for a week created by sim_week()
    weeks_list = replicate(n_weeks,
                           sim_week(n_per_week, baseline, treatment_effect),
                           simplify = F)
    
    # collapse the list of weeks into a
    # data.table with a week id column
    weeks_dt <- rbindlist(weeks_list, idcol = 'week')
    
    weeks_dt %>%
      # melt into long format with week as the id
      melt(id.vars = 'week',
           variable.name = 'group',
           value.name = 'success') %>%
      # sum success flag by week and group (treat vs control)
      .[, .(successes_per_week = sum(success)), by = c('week', 'group')] %>%
      # create the cumulative successes for each week by group
      .[, successes := cumsum(successes_per_week), by = 'group'] %>%
      # create the cumulative failures by week
      .[, failures := week * n_per_week - successes] %>%
      # order by week
      .[order(week)]
  }

#======================================
# simulate weeks of testing with
# a prop.test for each week
#======================================

sim_test <-
  function(n_per_week,
           baseline,
           treatment_effect,
           n_weeks) {
    # get the data
    sim_result <-
      sim_weeks(n_per_week, baseline, treatment_effect, n_weeks)
    
    # create an empty dt to populate below
    dt_prop_test_result = data.table(
      week = 1:n_weeks,
      control_prop = vector(mode = 'numeric', length = n_weeks),
      treatment_prop = vector(mode = 'numeric', length = n_weeks),
      p_value = vector(mode = 'numeric', length = n_weeks)
    )
    
    for (w in 1:n_weeks) {
      # create a matrix with successes and fails in
      # the columns control and treatment in the rows
      mat = as.matrix(sim_result[week == w, .(successes, failures)])
      
      # do the prop.test
      prop_test_result = prop.test(mat)
      
      # update the results data.table
      dt_prop_test_result[week == w, `:=`(
        control_prop = prop_test_result$estimate[1],
        treatment_prop = prop_test_result$estimate[2],
        p_value = prop_test_result$p.value
      )]
    }
    
    return(dt_prop_test_result)
  }

#======================================
# simulate a set of tests of the same
# type using the above functions
#======================================

sim_tests <-
  function(n_per_week,
           baseline,
           treatment_effect,
           n_weeks,
           n_tests) {
    # replicate the test n_tests times
    tests_list = replicate(n_tests,
                           sim_test(n_per_week, baseline, treatment_effect, n_weeks),
                           simplify = F)
    
    # return a data.table with an id col called test
    rbindlist(tests_list, idcol = 'test')
  }

See also