Experiments at Airbnb - Airbnb Engineering

download Experiments at Airbnb - Airbnb Engineering

of 15

  • date post

    03-Jun-2018
  • Category

    Documents

  • view

    219
  • download

    0

Embed Size (px)

Transcript of Experiments at Airbnb - Airbnb Engineering

  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    1/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer

    Experiments at Airbnb

    Airbnb[1]is an online two-sided marketplace that matches people who rent out their

    homes (hosts) with people who are looking for a place to stay (guests). We use

    controlled experiments to learn and make decisions at every step of product

    development, from design to algorithms. They are equally important in shaping the

    user experience.

    While the basic principles behind controlled experiments are relatively straightforward,

    using experiments in a complex online ecosystem like Airbnb during fast-paced product

    development can lead to a number of common pitfalls. Some, like stopping an

    experiment too soon, are relevant to most experiments. Others, like the issue of

    introducing bias on a marketplace level, start becoming relevant for a more specialized

    application like Airbnb. We hope that by sharing the pitfalls weve experienced and

    learned to avoid, we can help you to design and conduct better, more reliable

    experiments for your own application.

    Why experiments?Experiments provide a clean and simple way to make causalinference. Its often

    surprisingly hard to tell the impact of something you do by simply doing it and seeing

    what happens, as illustrated inFigure 1.

    [2]

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.pnghttp://www.airbnb.com/http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.pnghttp://www.airbnb.com/
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    2/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 2

    [3]

    Figure 1 Its hard to tell the effect of this product launch.

    The outside world often has a much larger effect on metrics than product changes do.

    Users can behave very differently depending on the day of week, the time of year, the

    weather (especially in the case of a travel company like Airbnb), or whether they

    learned about the website through an online ad or found the site organically.

    Controlled experiments isolate the impact of the product change while controlling for

    the aforementioned external factors. InFigure 2, you can see an example of a newfeature that we tested and rejected this way. We thought of a new way to select what

    prices you want to see on the search page, but users ended up engaging less with it

    than the old filter, so we did not launch it.

    [4]

    Figure 2 Example of a new feature that we tested and rejected.

    When you test a single change like this, the methodology is often called A/B testing or

    split testing. This post will not go into the basics of how to run a basic A/B test. There

    are a number of companies that provide out of the box solutions to run basic A/B tests

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img2_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    3/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 3

    and a couple of bigger tech companies have open sourced their internal systems for

    others to use. See Clouderas Gertrude[5], Etsys Feature[6], and Facebooks PlanOut[7],

    for example.

    The case of Airbnb

    At Airbnb we have built our own A/B testing framework to run experiments which you

    will be able to read more about in our upcoming blog post on the details of its

    implementation. There are a couple of features of our business that make

    experimentation more involved than a regular change of a button color, and thats why

    we decided to create our own testing framework.

    First, users can browse when not logged in or signed up, making it more difficult to tie

    a user to actions. People often switch devices (between web and mobile) in the midst of

    booking. Also given that bookings can take a few days to confirm, we need to wait for

    those results. Finally, successful bookings are often dependent on available inventory

    and responsiveness of hosts factors out of our control.

    Our booking flow is also complex. First, a visitor has to make a search. The next step is

    for a searcher to actually contact a host about a listing. Then, the host has to accept an

    inquiry and then the guest has to actually book the place.. In addition we have multiple

    flows that can lead to a booking a guest can instantly book some listings without acontact, and can also make a booking request that goes straight to booking. This four

    step flow is visualized inFigure 3. We look at the process of going through these four

    stages, but the overall conversion rate between searching and booking is our main

    metric.

    [8]

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://facebook.github.io/planout/https://github.com/etsy/featurehttps://github.com/cloudera/gertrude
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    4/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 4

    [9]

    Figure 3 Example of an experiment result broken down by booking flow steps.

    How long do you need to run an experiment?

    A very common source of confusion in online controlled experiments is how much time

    you need to make a conclusion about the results of an experiment. The problem with

    the naive method of using of the p-value as a stopping criterion is that the statistical

    test that gives you a p-value assumes that you designed the experiment with a sample

    and effect size in mind. If you continuously monitor the development of a test and the

    resulting p-value, you are very likely to see an effect, even if there is none. Another

    common error is to stop an experiment too early, before an effect becomes visible.

    Here is an example of an actual experiment we ran. We tested changing the maximum

    value of the price filter on the search page from $300 to $1000 as displayed below.

    [10]

    [11]

    Figure 4 Example experiment testing the value of the price filter

    InFigure 5we show the development of the experiment over time. The top graph shows

    the treatment effect (Treatment / Control 1) and the bottom graph shows the p-value

    over time. As you can see, the p-value curve hits the commonly used significant value

    of 0.05 after 7 days, at which point the effect size is 4%. If we had stopped there, we

    would have concluded that the treatment had a strong and significant effect on the

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    5/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 5

    likelihood of booking. But we kept the experiment running and we found that actually,

    the experiment ended up neutral. The final effect size was practically null, with the p-

    value indicating that whatever the remaining effect size was, it should be regarded as

    noise.

    [12]

    [13]

    Figure 5 Result of the price filter experiment over time.

    Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern

    of hitting significance early and then converging back to a neutral result is actually

    quite common in our system. There are various reasons for this. Users often take a long

    time to book, so the early converters have a disproportionately large influence in the

    beginning of the experiment. Also, even small sample sizes in online experiments are

    massive in the scale of classical statistics in which these methods were developed.

    Since the statistical test is a function of the sample- and effect sizes, if an early effect

    size is large through natural variation it is likely for the p-value to be below 0.05 early.

    But the most important reason is that you are performing a statistical test every time

    you compute a p-value and the more you do it, the more likely you are to find an effect.

    As a side note, people familiar with our website might notice that, at time of writing,

    we did in fact launch the increased max price filter, even though the result was neutral.

    http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png
  • 8/11/2019 Experiments at Airbnb - Airbnb Engineering

    6/15

    29/5/2014 Experiments at Airbnb - Airbnb Engineering

    http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_