Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
description
Transcript of Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained
Ronny Kohavi, MicrosoftJoint work with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya XuBased on KDD 2012 talk, available at http://exp-platform.com
Which Test Won 8/27/2012
2BackgroundThis is an extended presentation of the KDD paper presented in Beijing a few weeks ago(KDD = Knowledge Discovery and Data mining)At Bing, we ran thousands of experiments
It is not uncommon to see experiments that impact annual revenue by millions of dollars, sometimes 10s of millionsTrustworthiness is critical, so surprising results are investigated
We share puzzling results that each took weeks to months to analyze deeply, understand, and explain Moreover, the issues uncovered in these specific examples surfaced in multiple other experiments, so they are not isolated incidents
33Twyman’s LawAny figure that looks interesting or
different is usually wrongIf something is “amazing,” find the flaw! It’s usually there.Examples
If you have a mandatory birth date field and people think it’s unnecessary, you’ll find lots of 11/11/11 or 01/01/01If you have an optional drop down, do not default to the first alphabetical entry, or you’ll have lots jobs = AstronautTraffic to web sites doubled between 1-2AM November 6, 2011 for many sites, relative to the same hour a week prior. Why?
In this talk, we share puzzling results that triggered Twyman’s law for us, so we investigated and found the flaw
4Warm-up Experiment: MSN Real Estate“Find a house” widget variations
Which is best for the OEC (Overall Evaluation Criterion) ofRevenue to Microsoft, generated every time a user clicks
4
FED
CBA
5MSN Real EstateVersion C was 8.5% better Since this is the #1 monetization for MSN Real Estate, it improved revenues significantlyIn the “throwdown” (vote for the winning variant), nobody from MSN Real Estate or the company that did the creative voted for the winning widgetThis is very common: we are terrible at correctly assessing the value of our own ideas/designsThis is why running controlled experiments is so critical if we want to be data-driven
6Controlled Experiments in One SlideConcept is trivial
Randomly split traffic betweentwo (or more) versions
A/ControlB/Treatment
Collect metrics of interestAnalyze
Unless you are testing on oneof largest sites in the world, use 50/50% (high stat power)
100%Users
50%Users
50%Users
Control:Existing System
Treatment:Existing System with Feature X
Users interactions instrumented, analyzed & compared
Analyze at the end of the experiment
Must run statistical tests to confirm differences are not due to chanceBest scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)
7Our Intuition is Poor… Really PoorYour baby is not as beautiful as you thinkOur statistic from thousands of controlled experiments:only 10-30% of experiments move the metrics they were designed to improve“Google ran approximately 12,000 randomized experiments in 2009, with [only] about 10 percent of these leading to business changes” – Jim Manzi“80% of the time you/we are wrong about what a customer wants” -- Avinash Kaushik“Netflix considers 90% of what they try to be wrong”-- Mike Moran
8Puzzle 1: OEC for SearchAn OEC is the Overall Evaluation CriterionIt is a metric (or set of metrics) that guides the org as to whether A is better than B in an A/B testIn prior work, we emphasized long-term focus and thinking about customer lifetime value, but operationalizing it is hardSearch engines (Bing, Google) are evaluated on query share (distinct queries) and revenue as long-term goalsPuzzle
A ranking bug in an experiment resulted in very poor search resultsDistinct queries went up over 10%, and revenue went up over 30%What metrics should be in the OEC for a search engine?
9Puzzle 1 ExplainedDegraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevantAnalyzing queries per month, we have
where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions).
Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smallerIn a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equalThe OEC should therefore include the middle term: sessions/user
10Puzzle 2: Click TrackingA piece of code was added, such that when a user clicked on a search result, additional JavaScript was executed(a session-cookie was updated with the destination)before navigating to the destination pageThis slowed down the user experience slightly, so we expected a slightly negative experiment.Results showed that users were clicking more!
Why?
11Puzzle 2: Click Tracking - BackgroundUser clicks (and form submits) are instrumented and form the basis for many metricsInstrumentation is typically done by having the web browser request a web beacon (1x1 pixel image)Classical tradeoff here
Waiting for the beacon to return slows the action (typically navigating away)Making the call asynchronous is known to cause click-loss, as the browsers can kill the request (classical browser optimization because the result can’t possibly matter for the new page)
Small delays, on-mouse-down, or redirect are used
12Puzzle 2: Click Tracking ExplainedClick-loss varies dramatically by browserChrome, Firefox, Safari are aggressive at terminating such reqeuests. Safari’s click loss > 50%.IE respects image requests for backward compatibility reasonsWhite paper available on this issue hereOther cases where this impacts experiments
Opening link in new tab/window will overestimate the click deltaBecause the main window remains open, browsers can’t optimize and kill the beacon request, so there is less click-lossUsing HTML5 to update components of the page instead of refreshing the whole page has the overestimation problem
13Background: Primacy and Novelty Effects
Primacy effect occurs when you change the navigation on a web site
Experienced users may be less efficient until they get used to the new navigationControl has a short-term advantage
Novelty effect happens when a new design is introduced
Users investigate the new feature, click everywhere, and introduce a “novelty” bias that dies quickly if the feature is not truly usefulTreatments have a short-term advantage
14Puzzle 3: Effects TrendGiven the high failure rate of ideas, new experiments are followed closely to determine if new idea is a winnerMultiple graphs of effect look like this
Negative on day 1: -0.55%Less negative on day 2: -0.38%Less negative on day 3: -0.21%Less negative on day 4: -0.13%
The experimenter extrapolates linearlyand says: primacy effect. This will be positive in a couple of days, right?Wrong! This is expected
8/30/2011
8/31/2011
9/1/2011
9/2/2011
9/3/2011
9/4/2011
-1.20%128.80%258.80%388.80%518.80%648.80%778.80%908.80%
Cumulative Effect
15Puzzle 3: Effects TrendFor many metrics, the standard deviation of the mean is proportional to , where is the number of usersAs we run an experiment longer, more users are admitted into the experiment, so grows and the conf interval shrinksThe first days are highly variableThe first day has a 67% chanceof falling outside the 95% CIat the end of the experimentThe second day has a 55% chanceof falling outside this bound.
0 5 10 15 20
-0.80%-0.60%-0.40%-0.20%0.00%0.20%0.40%0.60%0.80%
95% bound 21-day bound
Experiment Days
Effec
t
16Puzzle 3: Effects TrendThe longer graph
This was an A/A test, so the true effect is 0
-1.40%73.60%
148.60%223.60%298.60%373.60%448.60%523.60%598.60%673.60%748.60%823.60%898.60%973.60%
Cumulative Effect
17Puzzle 4: Statistical PowerWe expect the standard deviation of the mean (and thus the confidence interval) to be proportional to , where is the number of usersSo as the experiment runs longer and more users are admitted, the confidence interval should shrinkBut there is a graph of therelative confidence interval sizefor sessions/User over a monthIt is NOT shrinking as expected
18Puzzle 4: Statistical PowerThe distribution is impacted by these factors
Users churn, so they contribute zero visitsNew users join with fresh count of oneWe have a mixture
The conf interval of the percent effect is proportional to
Std-dev/mean/Most of the time, std-dev/Mean is constant, but for metrics like Sessions/UU, it grows as fast as ,as the graph showsRunning an experiment longer does not increase statistical power for some metricsYou must increase the variant sizes
19Puzzle 5: Carryover EffectsExperiment is run, results are surprising.(This by itself is fine, as our intuition is poor.)Rerun the experiment, and the effects disappearReason: bucket system recycles users, and the prior experiment had carryover effectsThese can last for months!Must run A/A tests, or re-randomize
20SummaryOEC: evaluate long-term goals through short-term metricsThe difference between theory and practice is greater in practice than in theory
Instrumentation issues (e.g., click-tracking) must be understoodCarryover effects impact “bucket systems” used by Bing, Google, and Yahoo require rehashing and A/A tests
Experimentation insight:Effect trends are expectedLonger experiments do not increase power for some metrics. Fortunately, we have a lot of users
21Other PapersMultiple papers available at http://exp-platform.com
Survey and practical guideSeven Pitfalls to Avoid when Running Controlled Experiments on the WebOnline Experimentation at Microsoft
Talks and tutorials at http://www.exp-platform.com/Pages/talks.aspx
Questions?