Download - Experiments at Airbnb - Airbnb Engineering

Transcript
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    1/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer

    Experiments at Airbnb

    Airbnb[1]is an online two-sided marketplace that matches people who rent out their

    homes (hosts) with people who are looking for a place to stay (guests). We use

    controlled experiments to learn and make decisions at every step of product

    development, from design to algorithms. They are equally important in shaping the

    user experience.

    While the basic principles behind controlled experiments are relatively straightforward,

    using experiments in a complex online ecosystem like Airbnb during fast-paced product

    development can lead to a number of common pitfalls. Some, like stopping an

    experiment too soon, are relevant to most experiments. Others, like the issue of

    introducing bias on a marketplace level, start becoming relevant for a more specialized

    application like Airbnb. We hope that by sharing the pitfalls weve experienced and

    learned to avoid, we can help you to design and conduct better, more reliable

    experiments for your own application.

    Why experiments?Experiments provide a clean and simple way to make causalinference. Its often

    surprisingly hard to tell the impact of something you do by simply doing it and seeing

    what happens, as illustrated inFigure 1.

    [2]

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.pnghttp://www.airbnb.com/http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.pnghttp://www.airbnb.com/
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    2/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 2

    [3]

    Figure 1 Its hard to tell the effect of this product launch.

    The outside world often has a much larger effect on metrics than product changes do.

    Users can behave very differently depending on the day of week, the time of year, the

    weather (especially in the case of a travel company like Airbnb), or whether they

    learned about the website through an online ad or found the site organically.

    Controlled experiments isolate the impact of the product change while controlling for

    the aforementioned external factors. InFigure 2, you can see an example of a newfeature that we tested and rejected this way. We thought of a new way to select what

    prices you want to see on the search page, but users ended up engaging less with it

    than the old filter, so we did not launch it.

    [4]

    Figure 2 Example of a new feature that we tested and rejected.

    When you test a single change like this, the methodology is often called A/B testing or

    split testing. This post will not go into the basics of how to run a basic A/B test. There

    are a number of companies that provide out of the box solutions to run basic A/B tests

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img2_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    3/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 3

    and a couple of bigger tech companies have open sourced their internal systems for

    others to use. See Clouderas Gertrude[5], Etsys Feature[6], and Facebooks PlanOut[7],

    for example.

    The case of Airbnb

    At Airbnb we have built our own A/B testing framework to run experiments which you

    will be able to read more about in our upcoming blog post on the details of its

    implementation. There are a couple of features of our business that make

    experimentation more involved than a regular change of a button color, and thats why

    we decided to create our own testing framework.

    First, users can browse when not logged in or signed up, making it more difficult to tie

    a user to actions. People often switch devices (between web and mobile) in the midst of

    booking. Also given that bookings can take a few days to confirm, we need to wait for

    those results. Finally, successful bookings are often dependent on available inventory

    and responsiveness of hosts factors out of our control.

    Our booking flow is also complex. First, a visitor has to make a search. The next step is

    for a searcher to actually contact a host about a listing. Then, the host has to accept an

    inquiry and then the guest has to actually book the place.. In addition we have multiple

    flows that can lead to a booking a guest can instantly book some listings without acontact, and can also make a booking request that goes straight to booking. This four

    step flow is visualized inFigure 3. We look at the process of going through these four

    stages, but the overall conversion rate between searching and booking is our main

    metric.

    [8]

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://facebook.github.io/planout/https://github.com/etsy/featurehttps://github.com/cloudera/gertrude
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    4/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 4

    [9]

    Figure 3 Example of an experiment result broken down by booking flow steps.

    How long do you need to run an experiment?

    A very common source of confusion in online controlled experiments is how much time

    you need to make a conclusion about the results of an experiment. The problem with

    the naive method of using of the p-value as a stopping criterion is that the statistical

    test that gives you a p-value assumes that you designed the experiment with a sample

    and effect size in mind. If you continuously monitor the development of a test and the

    resulting p-value, you are very likely to see an effect, even if there is none. Another

    common error is to stop an experiment too early, before an effect becomes visible.

    Here is an example of an actual experiment we ran. We tested changing the maximum

    value of the price filter on the search page from $300 to $1000 as displayed below.

    [10]

    [11]

    Figure 4 Example experiment testing the value of the price filter

    InFigure 5we show the development of the experiment over time. The top graph shows

    the treatment effect (Treatment / Control 1) and the bottom graph shows the p-value

    over time. As you can see, the p-value curve hits the commonly used significant value

    of 0.05 after 7 days, at which point the effect size is 4%. If we had stopped there, we

    would have concluded that the treatment had a strong and significant effect on the

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    5/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 5

    likelihood of booking. But we kept the experiment running and we found that actually,

    the experiment ended up neutral. The final effect size was practically null, with the p-

    value indicating that whatever the remaining effect size was, it should be regarded as

    noise.

    [12]

    [13]

    Figure 5 Result of the price filter experiment over time.

    Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern

    of hitting significance early and then converging back to a neutral result is actually

    quite common in our system. There are various reasons for this. Users often take a long

    time to book, so the early converters have a disproportionately large influence in the

    beginning of the experiment. Also, even small sample sizes in online experiments are

    massive in the scale of classical statistics in which these methods were developed.

    Since the statistical test is a function of the sample- and effect sizes, if an early effect

    size is large through natural variation it is likely for the p-value to be below 0.05 early.

    But the most important reason is that you are performing a statistical test every time

    you compute a p-value and the more you do it, the more likely you are to find an effect.

    As a side note, people familiar with our website might notice that, at time of writing,

    we did in fact launch the increased max price filter, even though the result was neutral.

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    6/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 6

    We found that certain users like the ability to search for high-end places and decided to

    accommodate them, given there was no dip in the metrics.

    How long should experiments run for then? To prevent a false negative (a Type II error),

    the best practice is to determine the minimum effect size that you care about and

    compute, based on the sample size (the amount of new samples that come every day)

    and the certainty you want, how long to run the experiment for, before you start the

    experiment. Here[14]is a resource that helps with that computation. Setting the time in

    advance also minimizes the likelihood of finding a result where there is none.

    One problem, though, is that we often dont have a good idea of the size, or even the

    direction, of the treatment effect. It could be that a change is actually hugely successful

    and major profits are being lost by not launching the successful variant sooner. Or, on

    the other side, sometimes an experiment introduces a bug, which makes it much better

    to stop the experiment early before more users are alienated.

    The moment when an experiment dabbles in the otherwise significant region could

    be an interesting one, even when the pre-allotted time has not passed yet. In the case

    of the price filter experiment example, you can see that when significance was first

    reached, the graph clearly did not look like it had converged yet. We have found this

    heuristic to be very helpful in judging whether or not a result looks stable. It is

    important to inspect the development of the relevant metrics over time, rather than to

    consider the single result of an effect with a p-value.

    We can use this insight to be a bit more formal about when to stop an experiment, if

    its before the allotted time. This can be useful if you do want to make an automated

    judgment call on whether or not the change that youre testing is performing

    particularly well or not, which is helpful when youre running many experiments at the

    same time and cannot manually inspect them all systematically. The intuition behind

    it is that you should be more skeptical of early results. Therefore the threshold under

    which to call a result is very low at the beginning. As more data comes in, you can

    increase the threshold as the likelihood of finding a false positive is much lower later in

    the game.

    We solved the problem of how to figure out the p-value threshold at which to stop an

    experiment by running simulations and deriving a curve that gives us a dynamic (in

    time) p-value threshold to determine whether or not an early result is worth

    investigating. We wrote code to simulate our ecosystem with various parameters and

    http://www.evanmiller.org/ab-testing/sample-size.html
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    7/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 7

    used this to run many simulations with varying values for parameters like the real effect

    size, variance and different levels of certainty. This gives us an indication of how likely

    it is to see false positives or false negatives, and also how far off the estimated effect

    size is in case of a true positive. InFigure 6we show an example decision boundary.

    [15]

    [16]

    Figure 6 An example of a dynamic p-value curve.

    It should be noted that this curve is very particular to our system and the parameters

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    8/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 8

    that we used for this experiment. We share the graph as an example for you to use for

    your own analysis.

    Understanding results in context

    A second pitfall is failing to understand results in their full context. In general, it is

    good practice to evaluate the success of an experiment based on a single metric of

    interest. This is to prevent cherry-picking of significant results in the midst of a sea of

    neutral ones. However, by just looking at a single metric you lose a lot of context that

    could inform your understanding of the effects of an experiment.

    Lets go through an example. Last year we embarked on a journey to redesign our

    search page. Search is a fundamental component of the Airbnb ecosystem. It is the

    main interface to our inventory and the most common way for users to engage with our

    website. So, it was important for us to get it right. InFigure 7you can see the before and

    after stages of the project. The new design puts more emphasis on pictures of the

    listings (one of our assets since we offer professional photography to our hosts) and the

    map that displays where listings are located. You can read about the design and

    implementation process in another blog post here[17].

    [18]

    [19]

    Figure 7 Before and after a full redesign of the search page.

    A lot of work went into the project, and we all thought it was clearly better; our users

    agreed in qualitative user studies. Despite this, we wanted to evaluate the new design

    quantitatively with an experiment. This can be hard to argue for, especially when

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.pnghttp://nerds.airbnb.com/redesigning-search/
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    9/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 9

    testing a big new product like this. It can feel like a missed marketing opportunity if we

    dont launch to everyone at the same time. However, to keep in the spirit of our testing

    culture, we did test the new design to measure the actual impact and, more

    importantly, gather knowledge about which aspects did and didnt work.

    After waiting for enough time to pass, as calculated with the methodology described in

    the previous section, we ended up with a neutral result. The change in the globalmetric was tiny and the p-value indicated that it was basically a null effect. However,

    we decided to look into the context and to break down the result to try to see if we

    could figure out why this was the case. Because we did this, we found that the new

    design was actually performing fine in most cases, except for Internet Explorer. We then

    realized that the new design broke an important click-through action for certain older

    versions of IE, which obviously had a big negative impact on the overall results. When

    we fixed this, IE displayed similar results to the other browsers, a boost of more than

    2%.

    [20]

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    10/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 10

    [21]

    Figure 8 Results of the new search design.

    Apart from teaching us to pay more attention to QA for IE, this was a good example of

    what lessons you can learn about the impact of your change in different contexts. You

    can break results down by many factors like browser, country and user type. It should be

    noted that doing this in the classic A/B testing framework requires some care. If you

    test breakdowns individually as if they were independent, you run a big risk of findingeffects where there arent, just like in the example of continuously monitoring the

    effect of the previous section. Its very common to be looking at a neutral experiment,

    break it down many ways and to find a single significant effect. Declaring victory for

    that particular group is likely to be incorrect. The reason for this is that you are

    performing multiple tests with the assumption that they are all independent, which

    they are not. One way of dealing with this problem is to decrease the p-value by which

    you decide the effect is real. Read more about this approach here[22]. Another way is to

    http://www.evanmiller.org/how-not-to-run-an-ab-test.htmlhttp://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    11/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 1

    model the effects on all breakdowns directly with a more advanced method like logistic

    regression.

    Assuming the system works

    The third and final pitfall is assuming that the system works the way you think or hope

    it does. This should be a concern if you build your own system to evaluate experiments

    as well as if you use a third party tool. In either case, its possible that what the system

    tells you does not reflect reality. This can happen either because its faulty or because

    youre not using it correctly. One way to evaluate the system and your interpretation of

    it is by formulating hypotheses and then verifying them.

    [23]

    [24]

    Figure 9 Results of an example dummy experiment.

    Another way of looking at this is the observation that results too good to be true have a

    higher likelihood of being false. When you encounter results like this, it is good

    practice to be skeptical of them and scrutinize them in whatever way you can think of,

    before you consider them to be accurate.

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    12/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 12

    A simple example of this process is to run an experiment where the treatment is equal

    to the control. These are called A/A or dummy experiments. In a perfect world the

    system would return a neutral result (most of the time). What does your system return?

    We ran many experiments like this (see an example run inFigure 9) and identified a

    number of issues within our own system as a result. In one case, we ran a number of

    dummy experiments with varying sizes of control and treatment groups. A number of

    them were evenly split, for example with a 50% control and a 50% treatment group(where everybody saw exactly the same website). We also added cases like a 75% control

    and a 25% treatment group. The results that we saw for these dummy experiments are

    displayed inFigure 10.

    [25]

    [26]

    Figure 10 Results of a number of dummy experiments.

    You can see that in the experiments where the control and treatment groups are the

    same size, the results look neutral as expected (its a dummy experiment so the

    treatment is actually the same as the control). But, for the case where the group sizes

    are different, there is a massive bias against the treatment group.

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    13/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 13

    We investigated why this was the case, and uncovered a serious issue with the way we

    assigned visitors that are not logged into treatment groups. The issue is particular to

    our system, but the general point is that verifying that the system works the way you

    think it does is worthwhile and will probably lead to useful insights.

    One thing to keep in mind when you run dummy experiments is that you should expect

    some results to come out as non-neutral. This is because of the way the p-value works.For example, if you run a dummy experiment and look at its performance broken down

    by 100 different countries, you should expect, on average, 5 of them to give you a non-

    neutral result. Keep this in mind when youre scrutinizing a 3rd party tool!

    Conclusion

    Controlled experiments are a great way to inform decisions around product

    development. Hopefully, the lessons in this post will help prevent some common A/B

    testing errors.

    First, the best way to determine how long you should run an experiment is to compute

    the sample size you need to make an inference in advance. If the system gives you an

    early result, you can try to make a heuristic judgment on whether or not the trends

    have converged. Its generally good to be conservative in this scenario. Finally, if you do

    need to make procedural launch and stopping decisions, its good to be extra careful by

    employing a dynamic p-value threshold to determine how certain you can be about a

    result. The system we use at Airbnb to evaluate experiments employs all three ideas to

    help us with our decision-making around product changes.

    It is important to consider results in context. Break them down into meaningful cohorts

    and try to deeply understand the impact of the change you made. In general,

    experiments should be run to make good decisions about how to improve the product,

    rather than to aggressively optimize for a metric. Optimizing is not impossible, but

    often leads to opportunistic decisions for short-term gains. By focusing on learning

    about the product you set yourself up for better future decisions and more effective

    tests.

    Finally, it is good to be scientific about your relationship with the reporting system. If

    something doesnt seem right or if it seems too good to be true, investigate it. A simple

    way of doing this is to run dummy experiments, but any knowledge about how the

  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    14/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 14

    1. http://www.airbnb.com/

    2. http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png

    3. http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png

    4. http://nerds.airbnb.com/wp-content/uploads/2014/05/img2_price.png

    5. https://github.com/cloudera/gertrude

    6. https://github.com/etsy/feature

    7. http://facebook.github.io/planout/

    8. http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png

    9. http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png

    10. http://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.png

    11. http://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.png

    12. http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png

    13. http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png

    14. http://www.evanmiller.org/ab-testing/sample-size.html

    15. http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png16. http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png

    17. http://nerds.airbnb.com/redesigning-search/

    18. http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.png

    19. http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.png

    20. http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png

    21. http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png

    system behaves is useful for interpreting results. At Airbnb we have found a number of

    bugs and counter-intuitive behaviors in our system by doing this.

    Together with Will Moss, I gave a public talk on this topic in April 2014. You can watch

    a video recording of it here[27]. We hope this post was insightful for those who want to

    improve their own experimentation.

    Want to work at Airbnb? We're hiring!

    Browse openings

    https://www.airbnb.com/jobs/departments/engineering#jobshttps://www.youtube.com/watch?v=lVTIcf6IhY4
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    15/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    22. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

    23. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png

    24. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png

    25. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png

    26. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png

    27. https://www.youtube.com/watch?v=lVTIcf6IhY4