Crash Course in A/B testing

Post on 12-Apr-2017

1.076 views 0 download

Transcript of Crash Course in A/B testing

Crash Course in A/B testingA statistical perspective

Wayne Tai Lee

Roadmap• What is A/B testing?

• Good experiments and the role of statistics• Similar to proof by contradiction• “Tests”• Big data meets classic asymptotics• Complaints with classical hypothesis testing• Alternatives?

What is A/B Testing• An industry term for controlled and randomized experiment

between treatment/control groups.• Age old problem….especially with humans

What most people know:

A

B

Gather samples

Assign treatments

Apply treatments

Measure Outcome

Compare

?

What most people know:

A

B

?

Only difference is in the treatment!

Reality:

A

B

??????

Variability fromSamples/Inputs

Variability fromTreatment/function

Variability fromMeasurement

How do we accountfor all that?

• If there are variabilities in addition to the treatment effect, how can we identify/isolate the effect from the treatment?

Confounding:

• Controlled variability• Systematic and desired• i.e. our treatment

• Bias • Systematic but not desired• Anything that can confound our study

• Noise • Random error but not desired• Won’t confound the study but makes it hard to

make a decision.

3 Types of Variability:

How do we categorize each?

A

B

??????

Variability fromSamples/Inputs

Variability fromTreatment/function

Variability fromMeasurement

Reality:

A

B

??????

Good instrumentation!

Reality:

A

B

??????

Randomize assignment!Convert bias to noise

Reality:

A

B

??????

Randomize assignment!Convert bias to noise

Your population can be skewed or biased….but that only restricts the generalizability of the results

Reality:

A

B

?

Think about what you want to measure and how!Minimize the noise level/variability in the metric.

A good experiment in general:

- Good design and implementation should be used to avoid bias.- For unavoidable biases, use randomization to turn it into noise.- Good planning to minimize noise in data.

How do we deal with noise?

- Bread and butter of statisticians!- Quantify the magnitude of the treatment- Quantify the magnitude of the noise- Just compare…..most of the time

Formalizing the Comparison

Similar to proof by contradiction- You assume the difference is by chance (noise)

Formalizing the Comparison

Similar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption

Formalizing the Comparison

Similar to proof by contradiction- You assume the difference is by chance (noise)- See how the data contradicts the assumption- If the surprise surpasses a threshold, we reject the assumption.- ….nothing is “100%”

Difference due to chance?

ID PVPerson 1 39Person 2 209Person 3 31Person 4 98Person 5 9Person 6 151

Red -> treatment; Black -> control

Difference due to chance?

ID PV | mean meanPerson 1 39 | 72 124.5Person 2 209 |Person 3 31 |Person 4 98 |Person 5 9 |Person 6 151 |

Red -> treatment; Black -> control

Diff = -52.5….so what?

Let’s measure the difference in means!

Difference due to chance?

ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151

Red -> treatment; Black -> control

If there was no difference from the treatment, shuffling the treatment statuscan emulate the randomization of the samples.

Difference due to chance?

ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151

Red -> treatment; Black -> control

Diff = 122.25 – 24 = 98.25

Difference due to chance?

ID PV ID PVPerson 1 39 1 39Person 2 209 2 209Person 3 31 3 31Person 4 98 4 98Person 5 9 5 9Person 6 151 6 151

Red -> treatment; Black -> control

Diff = 107. 5 – 53.5 = 54

Difference due to chance?

Our original -52.5

50000 repeats later…..

Difference due to chance?

Our original -52.5

46.5% of the permutations yielded a larger if not the same difference as our original sample (in magnitude). Are you surprised by the initial results?

“Tests”

Congratulations!

- You just learned the permutation test!- The 46.5% is the p-value under the permutation test.

“Tests”

Congratulations!

- You just learned the permutation test!- The 46.5% is the p-value under the permutation test.

Problems:- Permuting the labels can be computationally costly.

- Not possible before computers!- Statistical theory says there are many tests out there.

“Tests”

28

Standard t-test:1) Calculate delta:

= mean_treatment – mean_control2) Assumes follows a Normal distribution then calculatethe p-value.

3) If p-value < 0.05 then we reject the assumption that there is nodifference between treatment and control.

p-value = sum of red areas

0-

29

Wait, our metrics may not be Normal!

Big data meets classic Stats

Big Data meets Classic Stat

30

Wait, our metrics may not be Normal! We care about the “mean ofthe metric” and not the actual metric distribution.

Big Data meets Classic Stat

31

Wait, our metrics may not be Normal!

Central Limit Theorem:The “mean of the metric” will be

Normal if the sample size is LARGE!

We care about the “mean ofthe metric” and not the actual metric distribution.

Big Data meets Classic Stat

32

Assumptions with t-test- Normality of %delta

- Guaranteed with large sample sizes- Independent Samples- Not too many 0’s

That’s IT!!!- Easy to automate.- Simple and general.

What are “Tests”?

33

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

What are “Tests”?

34

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Guarantees:• By setting the p-value to compare to a 5% threshold, we control

P( Test says difference exists | In reality NO difference) <= 5%

What are “Tests”?

35

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Guarantees:• By setting the p-value to compare to a 5% threshold, we control

P( Test says difference exists | In reality NO difference) <= 5%

• By setting the power of the test to be 80%, we controlP( Test says difference exists | In reality difference exists) >= 80%

What are “Tests”?

36

• Statistical tests are just procedures that depend on data to make a decision.

• Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.

Guarantees:• By setting the p-value to compare to a 5% threshold, we control

P( Test says difference exists | In reality NO difference) <= 5%

• By setting the power of the test to be 80%, we controlP( Test says difference exists | In reality difference exists) >= 80%• Increasing this often requires more data

Meaning:

37

Reality Difference existNo difference

All treatments

Impactful treatmentsUseless treatments

Meaning:

38

Reality Difference existNo difference

Test DecisionNo difference Difference Exists No difference Difference Exists

All treatments

Impactful treatmentsUseless treatments

Meaning:

39

Reality Difference existNo difference

Test DecisionNo difference Difference Exists No difference Difference Exists

All treatments

Impactful treatmentsUseless treatments

>=80%<=5%Guaranteesthrough conventional thresholds

>95% <20%

Meaning:

40

Reality Difference existNo difference

Test DecisionNo difference Difference Exists No difference Difference Exists

All treatments

Impactful treatmentsUseless treatments

>=80%<=5%Guaranteesthrough conventional thresholds

>95% <20%

Jargon Significance level Power

Meaning:

41

- Most appropriate over repeated decision making- E.g. spammer or not

Meaning:

42

- Most appropriate over repeated decision making- E.g. spammer or not

- Not seeing a difference could mean- There is no difference- Not enough power

Meaning:

43

- Most appropriate over repeated decision making- E.g. spammer or not

- Not seeing a difference could mean- There is no difference- Not enough power

- Seeing a difference could mean- There is a difference- Got unlucky/lucky

Meaning:

44

- Most appropriate over repeated decision making- E.g. spammer or not

- Not seeing a difference could mean- There is no difference- Not enough power

- Seeing a difference could mean- There is a difference- Got unlucky/lucky

- Your specific test is either impactful or not. (100% or 0%)Not what most people want to hear….

Complaints with Hypth Testing

45

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

Complaints with Hypth Testing

46

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?

Complaints with Hypth Testing

47

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?

• Multiple Hypothesis testing• 5% false positive is 1 out of 20. Quite high!• http://xkcd.com/882/• Most published results are false still (Ioannidis 2005)

Complaints with Hypth Testing

48

• People get really stuck on p-values and tests.• Confusing, boring, and formulaic.

• Statistical significance != Scientific significance• You could detect a .000001 difference, so what?

• Multiple Hypothesis testing• 5% false positive is 1 out of 20. Quite high!• http://xkcd.com/882/• Most published results are false still (Ioannidis 2005)

• What is it answering?• Nothing specific about your test…. probabilities are

over repeated trials.

Abuse: Prosecutor Fallacy

49

Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low.

If she was innocent, the chance of both children dying is low

p-value = P( two deaths | innocent )

Abuse: Prosecutor Fallacy

50

Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low.

If she was innocent, the chance of both children dying is low

p-value = P( two deaths | innocent )

In fact, we should be looking at P( innocent | two deaths )

This is the prosecutor’s fallacy.

Example:

51

All Mothers

Guilty Mothers Innocent Mothers

Two deaths Two deaths

Example: base line matters!

52

All Mothers

Guilty Mothers Innocent Mothers

Two deaths Two deaths

P-value can be small. But base line can be huge.

Any Alternatives?

53

P( innocent | two deaths ) is what we want…… but does it make sense?

Bayesian methodology:P( difference exists | data )

This requires knowing P(difference exists), i.e. the prior- Philosophical debate, “What is a probability?”- Easy to cheat the numbers

Questions?

54

- How to deal with multiple hypothesis testing?- What are we doing in the company?- Rumor has it that “Multi-armed bandit > A/B testing”?