Chris Stuccio - Data science - Conversion Hotel 2015
-
Upload
webanalisten-nl -
Category
Data & Analytics
-
view
598 -
download
0
Transcript of Chris Stuccio - Data science - Conversion Hotel 2015
BAYESIAN STATISTICS
Chris Stucchio @stucchio
https://chrisstucchio.com
| https://github.com/stucchio
09 Oct 2015
INGREDIENTS:
A Null HypothesisVersion A and B have the same conversion rate
An Alternative HypothesisVersion B’s conversion rate is 5% or more higher than A’s
A Test StatisticWhich we expect to be close to 0 if the null hypothesis is true and far from 0 if it is false. For example
T =CONVERSIONS A
VISITORS A
CONVERSIONS B
VISITORS B-
How Frequentist A/B Tests Work
• If N is at least a certain size, then the probability of T exceeding a certain cutoff is less than 0.05 (the significance cutoff) assuming the null hypothesis is true
• If N is at least a certain size, then the probability of T being smaller than a certain cutoff is less than 0.20 (the power cutoff) assuming the alternative hypothesis is true
TWO PIECES OF MATH
T =CONVERSIONS A
VISITORS A
CONVERSIONS B
VISITORS B-
How Frequentist A/B Tests Work
Suppose the control conversion rate is 5%, and we are seeking a 20% lift in an experiment.
EXAMPLE
• If we have at least 7,600 samples per variation, then
there is a 5% chance of a false positive assuming both variations are equal.
• There is also a 20% chance of a false negative
assuming B has at least a 6% conversion rate.
How Frequentist A/B Tests Work
P -VALUEA probability of a false positive “at least as extreme” as the result
you just saw in a hypothetical A/A test.
SIGNIFICANCE LEVEL (= 100% - P-VALUE)
A probability of NOT seeing a false positive at least as extreme.
These numbers are highly dependent on your null and alternative hypothesis, so you have to choose them carefully.
How Frequentist A/B Tests Work
(Many vendors, including VWO until recently, incorrectly referred to the significance level as “Chance to Beat Control”.)
You've run an A/B test. Your A/B testing software has given you a p-value of p=0.03 for a one-tailed test.
(Note that several or none of the statements may be correct.)
• You have disproved the null hypothesis (that is, there is no difference between the variations).
• The probability of the null hypothesis being true is 0.03.
• You have proved your experimental hypothesis (that the variation is better than the control).
• The probability of the variation being better than control is 97%.
• You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision is 3%.
• You have a reliable experimental finding in the sense that if the experiment were repeated a great number of times, you would obtain a significant result on 97% of occasions.
WHICH OF THE FOLLOWING IS TRUE?
How Frequentist A/B Tests Work
ALL ARE FALSEBUT TRY TELLING THAT TO CUSTOMERS
Study shows 100% of psychology graduates and 80% of
professors get that question wrong.
Misinterpretations of Significance: A Problem
Students Share with Their Teachers?
How Frequentist A/B Tests Work
A PRACTICAL QUIZ
An A/B test is run, and it is observed that B has a higher mean than A with a p-value of 4%. What is the probability that B is really better than A?
‣ 96%
‣ 95%
‣ 80%
How Frequentist A/B Tests Work
So how do we compute this probability?
A PRACTICAL QUIZ
An A/B test is run, and it is observed that B has a higher mean than A with a p-value of 4%. What is the probability that B is really better than A?
How Frequentist A/B Tests Work
‣ 96%
‣ 95%
‣ 80%
‣ Cannot be determined from the information given
OUR FIRST BAYESIAN
CALCULATION
Make unrealistic assumptions to simplify the math. (This is a pedagogical exercise.)
ASSUME THERE ARE ONLY TWO POSSIBILITIES IN THE WORLD
• Null Hypothesis (Control and Variation are Equal)
• Alternate Hypothesis (Variation beats control by at least 20%)
WE WILL ASSUME EACH OF THESE OCCURS WITH FIXED PROBABILITY
How Frequentist A/B Tests Work
CONSIDER A SPHERICAL COW(Physics phrase describing calculations that illustrate the
point, but are ridiculously oversimplified.)
Our First Bayesian Calculation
Need to know base rate - fraction of A/B tests which actually
have true result.
Suppose base rate is 5% - i.e., 95% of ideas suck.
This means exactly 5% of tests have a variation which is 20% better than control, and 95% of tests have a variation identical to control.
Our First Bayesian Calculation
TEST SAYS WIN TEST SAYS LOSE
REAL WINNER 40 (80% of 50) 10 50
REAL LOSER 47 (5% of 950) 903 950
87 913
Suppose base rate is 5% - i.e., 95% of ideas suck.
Consider 1000 A/B tests:
Our First Bayesian Calculation
PROBABILITY OF REAL WINNER: 40 / 87 = 46%
TEST SAYS WIN TEST SAYS LOSE
REAL WINNER 40 10 50
REAL LOSER 47 903 950
87 913
Suppose base rate is 5% - i.e., 95% of ideas suck.
Consider 1000 A/B tests:
Our First Bayesian Calculation
PROBABILITY OF REAL WINNER: 240 / 275 = 87%
TEST SAYS WIN TEST SAYS LOSE
REAL WINNER 240 60 300
REAL LOSER 35 665 700
275 725
Suppose base rate is 30% - i.e., 70% of ideas suck.
Consider 1000 A/B tests:
Our First Bayesian Calculation
Our First Bayesian Calculation
THE PRIOROur opinion before we
have any data
- PAUL SAMUELSON
When events change,I change my mind. What do you do?
Our First Bayesian Calculation
THE POSTERIORWe’ve changed our opinion
after seeing the data
BAYESIAN STATISTICS
‣ Come up with a
subjective Prior opinion
‣ Gather evidence
‣ Change your opinion
and form a Posterior
BAYES RULE
The mathematically optimal way to change your opinion
IMPROVING THE ACCURACY OF OUR MODEL
Unrealistic Assumptions
• Only possible conversion rates are 5% and 6% - why not
4.3% or 5.5%?
• Ignores cost/benefit. If B has a 3% conversion rate,
choosing it is very bad. If B has a 4.99% conversion rate,
choosing it is almost harmless.
• Results in previous test based on looking at results only
once, then making decision. Our users check test results every day.
THERE ARE MORE THAN TWO POSSIBLE CONVERSION RATES
It’s not realistic to assume that conversion rates are either 5% or 6%.
This is just not a useful picture of reality:
THERE ARE MORE THAN TWO POSSIBLE CONVERSION RATES
Conversion rate can be 4%, 5%, 5.34%, 6.21%, or any other value
between 0 and 100%. Represent with continuous functions
CREDIBLE INTERVALS99% probability that true conversion rate is at least
16.9% and not more than 23.3%.
THE PRIORWe generally think conversion
rates are low.
ONE VISITOR, ONE CLICKOur opinion updates, and higher conversion
rates are more likely
6 VISITORS, 1 CLICKWe update our opinion, that first click was
probably a fluke.
22 VISITORS, 1 CLICKWe update our opinion, that first click was
probably a fluke.
207 VISITORS, 4 CLICKSWe are confident the CR is
approximately 2%.
PRIORS ARE SUBJECTIVEBayesian Analysis starts by “pulling a prior out
of your posterior”.
POSTERIORS CONVERGETheorem (stylized): Rational Bayesians never “agree
to disagree” when sufficient data is available.
JOINT POSTERIORS - REPRESENTING ALL VARIATIONS
So far we only form opinions about conversion
rate of one variation.
Need to represent probability of things like “conversion
rate of A is 4.5% and conversion rate of B is 6.3%”.
SOLUTION IS CALLED
JOINT POSTERIOR
TWO POSTERIORS ON TWO DIFFERENT CONVERSION RATES
COMBINE TO FORM JOINT POSTERIORPoint (0.10, 0.15) represents “A has a conversion rate of
10%, B has a conversion rate of 15%”.
Opinions About the World
• Start with an uneducated opinion, the prior.
• Gather data.
• Change your opinion and end up educated
with a posterior.
MAKING DECISIONS
Maximize Revenue, don’t Test for Truth
• Designed by and for scientists.
• Question: “Do jellybeans cause acne?”
• Run A/B test, give B group jellybeans. Measure amount of acne in both groups.
• If p < 0.05, publish paper in good journal - “Jellybeans cause acne.”
• If p >= 0.05, publish paper in bad journal - “Jellybeans don’t cause acne, but we did a good experiment to check.”
Hypothesis Testing
Goal of hypothesis testing is to avoid publishing false results.
Think Like a Trader
look for interesting phenomena, and publish papers when they find them.
• CRO is more like trading - the goal is to get more conversions = $.
• If A == B, thinking A > B is harmless; instead of getting a 5% conversion rate with B, you are stuck with a 5% conversion rate with A. Money lost: $0.
• If the CR of A is 4.9% and B is 5%, a wrong decision costs only 0.1%. If CR of A is 4%, a wrong decision costs 10x more!
buy and sell stocks with the goal of making money.
A SCIENTISTS
A TRADER
B > A (50% CHANCE) B = A (50% CHANCE)
DEPLOY A Lose Even
DEPLOY B Win Even
ASYMMETRIC COSTS AND FALSE POSITIVES
Smart decision: Deploy B.
Heads you win, tails you don’t lose.
Cost of a Mistake
Suppose we choose variation x. The cost of this choice is:
Loss[x] = Max (CR[i] - CR[x])
This is simple opportunity cost - it’s the difference
between the best choice and our choice.
Key point: bigger mistakes cost us more money.
Cost of a Mistake
EXAMPLE
A. 5%
‣ Loss[A] = Max(5% - 5%, 6% - 5%, 4.5% - 5%) = 1%
‣ Loss[B] = Max(5% - 6%, 6% - 6%, 4.5% - 6%) = 0%
‣ Loss[C] = Max(5% - 4.5%, 6% - 4.5%, 4.5% - 4.5%) = 1.5%
B. 6% C. 4.5%
Expected Loss
CR A = 4% CR A = 5% CR A = 6%
CR B = 4% 0% 0% 0%
CR B = 5% 1% 0% 0%
CR B = 6% 2% 1% 0%
BEFORE HAVING ANY DATA: Only problem - we don’t know true conversion rate. So we compute expected value.
EXPECTED LOSS FOR A IS = (1/9) 1% + (1/9) 2% + (1/9) 1% = 0.44%
(Probability of each cell is 1/9.)
Expected Loss
CR A = 4% CR A = 5% CR A = 6%
CR B = 4% 0% 1% 2%
CR B = 5% 0% 0% 1%
CR B = 6% 0% 0% 0%
BEFORE HAVING ANY DATA: Only problem - we don’t know true conversion rate. So we compute expected value.
EXPECTED LOSS FOR B IS = (1/9) 1% + (1/9) 2% + (1/9) 1% = 0.44%
No decision, loss > threshold of caring = 0.01
Expected Loss
EXPECTED LOSS FOR A IS = (1/4) 1% + (1/4) 2% + (1/4) 1% = 1%
AFTER GATHERING DATA, WE RULE OUT SOME POSSIBILITIES:
(All black cells have probability ¼, grey cells have probability 0. WILD OVERSIMPLIFICATION.)
CR A = 4% CR A = 5% CR A = 6%
CR B = 4% 0% 0% 0%
CR B = 5% 1% 0% 0%
CR B = 6% 2% 1% 0%
Expected Loss
EXPECTED LOSS FOR B IS = 0% < 0.01%
AFTER GATHERING DATA, WE
RULE OUT SOME POSSIBILITIES:
CR A = 4% CR A = 5% CR A = 6%
CR B = 4% 0% 1% 2%
CR B = 5% 0% 0% 1%
CR B = 6% 0% 0% 0%
Smart Decision
How to run a Bayesian A/B test
• Identify a threshold of caring - a value so small that if your conversion rate
drops by less than this, you don’t care.
• Example: I sell $10,000 of product/week on a 2% conversion rate. A 0.05% threshold of caring corresponds to a $250/week change in revenue.
• Run A/B test.
• Periodically (not more than once a week!) compute the expected loss for each variation. If the expected loss for some variation drops below the threshold of caring, deploy that variation.
NOT NECESSARILY A WINNER, BUT IT WON’T LOSE.
Advantages
• Bayesian tests are insensitive to peeking - it’s fine to stop a test early.
• “Chance to beat control” is really the chance that a variation is better than
the control
• Get additional numbers, e.g. chance to beat all - what is the probability that
B is better than A, C and D?
• Credible intervals bound uncertainty - when a winner is deployed, you’ll be
told “variation B is between 0.01% and 25% better than A”. (Confidence
intervals do NOT provide this information.)
• Easy to understand and extend. Is there a cost of switching? Want to account
for other factors? Just include it in the loss function. (Question asked by
Denis @ booking.com, and in Bayesian framework answer was obvious.)
MORE ACCURATE CALCULATIONS Central Limit Theorem with
10,000 data points
MORE ACCURATE CALCULATIONS Central Limit Theorem with
100 data points
WHY THE WORLD DIDN’T GO BAYESIAN SOONERBayesian calculations are 10 million times slower than frequentist - Charles Pickering
and his computers couldn’t handle it.