Crash Course in A/B testing

54
Crash Course in A/B testing A statistical perspective Wayne Tai Lee

Transcript of Crash Course in A/B testing

Page 1: Crash Course in A/B testing

Crash Course in A/B testingA statistical perspective

Wayne Tai Lee

Page 2: Crash Course in A/B testing

Roadmap• What is A/B testing?

• Good experiments and the role of statistics• Similar to proof by contradiction• “Tests”• Big data meets classic asymptotics• Complaints with classical hypothesis testing• Alternatives?

Page 3: Crash Course in A/B testing

What is A/B Testing• An industry term for controlled and randomized experiment

between treatment/control groups.• Age old problem….especially with humans

Page 4: Crash Course in A/B testing

What most people know:

A

B

Gather samples

Assign treatments

Apply treatments

Measure Outcome

Compare

?

Page 5: Crash Course in A/B testing

What most people know:

A

B

?

Only difference is in the treatment!

Page 6: Crash Course in A/B testing

Reality:

A

B

??????

Variability fromSamples/Inputs

Variability fromTreatment/function

Variability fromMeasurement

How do we accountfor all that?

Page 7: Crash Course in A/B testing

• If there are variabilities in addition to the treatment effect, how can we identify/isolate the effect from the treatment?

Confounding:

Page 8: Crash Course in A/B testing

• Controlled variability• Systematic and desired• i.e. our treatment

• Bias • Systematic but not desired• Anything that can confound our study

• Noise • Random error but not desired• Won’t confound the study but makes it hard to

make a decision.

3 Types of Variability:

Page 9: Crash Course in A/B testing

How do we categorize each?

A

B

??????

Variability fromSamples/Inputs

Variability fromTreatment/function

Variability fromMeasurement

Page 10: Crash Course in A/B testing

Reality:

A

B

??????

Good instrumentation!

Page 11: Crash Course in A/B testing

Reality:

A

B

??????

Randomize assignment!Convert bias to noise

Page 12: Crash Course in A/B testing

Reality:

A

B

??????

Randomize assignment!Convert bias to noise

Your population can be skewed or biased….but that only restricts the generalizability of the results

Page 13: Crash Course in A/B testing

Reality:

A

B

?

Think about what you want to measure and how!Minimize the noise level/variability in the metric.

Page 14: Crash Course in A/B testing

A good experiment in general:

- Good design and implementation should be used to avoid bias.- For unavoidable biases, use randomization to turn it into noise.- Good planning to minimize noise in data.

Page 15: Crash Course in A/B testing

How do we deal with noise?

- Bread and butter of statisticians!- Quantify the magnitude of the treatment- Quantify the magnitude of the noise- Just compare…..most of the time

Page 16: Crash Course in A/B testing

Formalizing the Comparison

Similar to proof by contradiction- You assume the difference is by chance (noise)

Page 17: Crash Course in A/B testing

Formalizing the Comparison

Similar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption

Page 18: Crash Course in A/B testing

Formalizing the Comparison

Similar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption- If the surprise surpasses a threshold, we reject the assumption.- ….nothing is “100%”

Page 19: Crash Course in A/B testing

Difference due to chance?

ID PVPerson 1 39Person 2 209Person 3 31Person 4 98Person 5 9Person 6 151

Red -> treatment; Black -> control

Page 20: Crash Course in A/B testing

Difference due to chance?

ID PV | mean meanPerson 1 39 | 72 124.5Person 2 209 |Person 3 31 |Person 4 98 |Person 5 9 |Person 6 151 |

Red -> treatment; Black -> control

Diff = -52.5….so what?

Let’s measure the difference in means!

Page 21: Crash Course in A/B testing

Difference due to chance?

ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151

Red -> treatment; Black -> control

If there was no difference from the treatment, shuffling the treatment statuscan emulate the randomization of the samples.

Page 22: Crash Course in A/B testing

Difference due to chance?

ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151

Red -> treatment; Black -> control

Diff = 122.25 – 24 = 98.25

Page 23: Crash Course in A/B testing

Difference due to chance?

ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151

Red -> treatment; Black -> control

Diff = 107. 5 – 53.5 = 54

Page 24: Crash Course in A/B testing

Difference due to chance?

Our original -52.5

50000 repeats later…..

Page 25: Crash Course in A/B testing

Difference due to chance?

Our original -52.5

46.5% of the permutations yielded a larger if not the same difference as our original sample (in magnitude). Are you surprised by the initial results?

Page 26: Crash Course in A/B testing

“Tests”

Congratulations!

- You just learned the permutation test!- The 46.5% is the p-value under the permutation test.

Page 27: Crash Course in A/B testing

“Tests”

Congratulations!

- You just learned the permutation test!- The 46.5% is the p-value under the permutation test.

Problems:- Permuting the labels can be computationally costly.

- Not possible before computers!- Statistical theory says there are many tests out there.

Page 28: Crash Course in A/B testing

“Tests”

28

Standard t-test:1) Calculate delta:

= mean_treatment – mean_control2) Assumes follows a Normal distribution then calculatethe p-value.

3) If p-value < 0.05 then we reject the assumption that there is nodifference between treatment and control.

p-value = sum of red areas

0-

Page 29: Crash Course in A/B testing

29

Wait, our metrics may not be Normal!

Big data meets classic Stats

Page 30: Crash Course in A/B testing

Big Data meets Classic Stat

30

Wait, our metrics may not be Normal! We care about the “mean ofthe metric” and not the actual metric distribution.

Page 31: Crash Course in A/B testing

Big Data meets Classic Stat

31

Wait, our metrics may not be Normal!

Central Limit Theorem:The “mean of the metric” will be

Normal if the sample size is LARGE!

We care about the “mean ofthe metric” and not the actual metric distribution.

Page 32: Crash Course in A/B testing

Big Data meets Classic Stat

32

Assumptions with t-test- Normality of %delta

- Guaranteed with large sample sizes- Independent Samples- Not too many 0’s

That’s IT!!!- Easy to automate.- Simple and general.

Page 33: Crash Course in A/B testing

What are “Tests”?

33

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Page 34: Crash Course in A/B testing

What are “Tests”?

34

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Guarantees:• By setting the p-value to compare to a 5% threshold, we control

P( Test says difference exists | In reality NO difference) <= 5%

Page 35: Crash Course in A/B testing

What are “Tests”?

35

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Guarantees:• By setting the p-value to compare to a 5% threshold, we control

P( Test says difference exists | In reality NO difference) <= 5%

• By setting the power of the test to be 80%, we controlP( Test says difference exists | In reality difference exists) >= 80%

Page 36: Crash Course in A/B testing

What are “Tests”?

36

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Guarantees:• By setting the p-value to compare to a 5% threshold, we control

P( Test says difference exists | In reality NO difference) <= 5%

• By setting the power of the test to be 80%, we controlP( Test says difference exists | In reality difference exists) >= 80%• Increasing this often requires more data

Page 37: Crash Course in A/B testing

Meaning:

37

Reality Difference existNo difference

All treatments

Impactful treatmentsUseless treatments

Page 38: Crash Course in A/B testing

Meaning:

38

Reality Difference existNo difference

Test DecisionNo difference Difference Exists No difference Difference Exists

All treatments

Impactful treatmentsUseless treatments

Page 39: Crash Course in A/B testing

Meaning:

39

Reality Difference existNo difference

Test DecisionNo difference Difference Exists No difference Difference Exists

All treatments

Impactful treatmentsUseless treatments

>=80%<=5%Guaranteesthrough conventional thresholds

>95% <20%

Page 40: Crash Course in A/B testing

Meaning:

40

Reality Difference existNo difference

Test DecisionNo difference Difference Exists No difference Difference Exists

All treatments

Impactful treatmentsUseless treatments

>=80%<=5%Guaranteesthrough conventional thresholds

>95% <20%

Jargon Significance level Power

Page 41: Crash Course in A/B testing

Meaning:

41

- Most appropriate over repeated decision making- E.g. spammer or not

Page 42: Crash Course in A/B testing

Meaning:

42

- Most appropriate over repeated decision making- E.g. spammer or not

- Not seeing a difference could mean- There is no difference- Not enough power

Page 43: Crash Course in A/B testing

Meaning:

43

- Most appropriate over repeated decision making- E.g. spammer or not

- Not seeing a difference could mean- There is no difference- Not enough power

- Seeing a difference could mean- There is a difference- Got unlucky/lucky

Page 44: Crash Course in A/B testing

Meaning:

44

- Most appropriate over repeated decision making- E.g. spammer or not

- Not seeing a difference could mean- There is no difference- Not enough power

- Seeing a difference could mean- There is a difference- Got unlucky/lucky

- Your specific test is either impactful or not. (100% or 0%)Not what most people want to hear….

Page 45: Crash Course in A/B testing

Complaints with Hypth Testing

45

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

Page 46: Crash Course in A/B testing

Complaints with Hypth Testing

46

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?

Page 47: Crash Course in A/B testing

Complaints with Hypth Testing

47

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?

• Multiple Hypothesis testing• 5% false positive is 1 out of 20. Quite high!• http://xkcd.com/882/• Most published results are false still (Ioannidis 2005)

Page 48: Crash Course in A/B testing

Complaints with Hypth Testing

48

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?

• Multiple Hypothesis testing• 5% false positive is 1 out of 20. Quite high!• http://xkcd.com/882/• Most published results are false still (Ioannidis 2005)

• What is it answering?• Nothing specific about your test…. probabilities are

over repeated trials.

Page 49: Crash Course in A/B testing

Abuse: Prosecutor Fallacy

49

Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low.

If she was innocent, the chance of both children dying is low

p-value = P( two deaths | innocent )

Page 50: Crash Course in A/B testing

Abuse: Prosecutor Fallacy

50

Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low.

If she was innocent, the chance of both children dying is low

p-value = P( two deaths | innocent )

In fact, we should be looking at P( innocent | two deaths )

This is the prosecutor’s fallacy.

Page 51: Crash Course in A/B testing

Example:

51

All Mothers

Guilty Mothers Innocent Mothers

Two deaths Two deaths

Page 52: Crash Course in A/B testing

Example: base line matters!

52

All Mothers

Guilty Mothers Innocent Mothers

Two deaths Two deaths

P-value can be small. But base line can be huge.

Page 53: Crash Course in A/B testing

Any Alternatives?

53

P( innocent | two deaths ) is what we want…… but does it make sense?

Bayesian methodology:P( difference exists | data )

This requires knowing P(difference exists), i.e. the prior- Philosophical debate, “What is a probability?”- Easy to cheat the numbers

Page 54: Crash Course in A/B testing

Questions?

54

- How to deal with multiple hypothesis testing?- What are we doing in the company?- Rumor has it that “Multi-armed bandit > A/B testing”?