Experiments at Airbnb - Airbnb Engineering

8/11/2019 Experiments at Airbnb - Airbnb Engineering

1/15


http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer

Experiments at Airbnb

Airbnb[1]is an online two-sided marketplace that matches people who rent out their

homes (hosts) with people who are looking for a place to stay (guests). We use

controlled experiments to learn and make decisions at every step of product

development, from design to algorithms. They are equally important in shaping the

user experience.

While the basic principles behind controlled experiments are relatively straightforward,

using experiments in a complex online ecosystem like Airbnb during fast-paced product

development can lead to a number of common pitfalls. Some, like stopping an

experiment too soon, are relevant to most experiments. Others, like the issue of

introducing bias on a marketplace level, start becoming relevant for a more specialized

application like Airbnb. We hope that by sharing the pitfalls weve experienced and

learned to avoid, we can help you to design and conduct better, more reliable

experiments for your own application.

Why experiments?Experiments provide a clean and simple way to make causalinference. Its often

surprisingly hard to tell the impact of something you do by simply doing it and seeing

what happens, as illustrated inFigure 1.

[2]
http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.pnghttp://www.airbnb.com/http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.pnghttp://www.airbnb.com/


2/15


http://nerds.airbnb.com/experiments-at-airbnb/?utm_content=bufferd3372&utm_medium=social&utm_source=l inkedin.com&utm_campaign=buffer 2

[3]

Figure 1 Its hard to tell the effect of this product launch.

The outside world often has a much larger effect on metrics than product changes do.

Users can behave very differently depending on the day of week, the time of year, the

weather (especially in the case of a travel company like Airbnb), or whether they

learned about the website through an online ad or found the site organically.

Controlled experiments isolate the impact of the product change while controlling for

the aforementioned external factors. InFigure 2, you can see an example of a newfeature that we tested and rejected this way. We thought of a new way to select what

prices you want to see on the search page, but users ended up engaging less with it

than the old filter, so we did not launch it.

[4]

Figure 2 Example of a new feature that we tested and rejected.

When you test a single change like this, the methodology is often called A/B testing or

split testing. This post will not go into the basics of how to run a basic A/B test. There

are a number of companies that provide out of the box solutions to run basic A/B tests
http://nerds.airbnb.com/wp-content/uploads/2014/05/img2_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png


3/15



and a couple of bigger tech companies have open sourced their internal systems for

others to use. See Clouderas Gertrude[5], Etsys Feature[6], and Facebooks PlanOut[7],

for example.

The case of Airbnb

At Airbnb we have built our own A/B testing framework to run experiments which you

will be able to read more about in our upcoming blog post on the details of its

implementation. There are a couple of features of our business that make

experimentation more involved than a regular change of a button color, and thats why

we decided to create our own testing framework.

First, users can browse when not logged in or signed up, making it more difficult to tie

a user to actions. People often switch devices (between web and mobile) in the midst of

booking. Also given that bookings can take a few days to confirm, we need to wait for

those results. Finally, successful bookings are often dependent on available inventory

and responsiveness of hosts factors out of our control.

Our booking flow is also complex. First, a visitor has to make a search. The next step is

for a searcher to actually contact a host about a listing. Then, the host has to accept an

inquiry and then the guest has to actually book the place.. In addition we have multiple

flows that can lead to a booking a guest can instantly book some listings without acontact, and can also make a booking request that goes straight to booking. This four

step flow is visualized inFigure 3. We look at the process of going through these four

stages, but the overall conversion rate between searching and booking is our main

metric.

[8]
http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://facebook.github.io/planout/https://github.com/etsy/featurehttps://github.com/cloudera/gertrude


4/15



[9]

Figure 3 Example of an experiment result broken down by booking flow steps.

How long do you need to run an experiment?

A very common source of confusion in online controlled experiments is how much time

you need to make a conclusion about the results of an experiment. The problem with

the naive method of using of the p-value as a stopping criterion is that the statistical

test that gives you a p-value assumes that you designed the experiment with a sample

and effect size in mind. If you continuously monitor the development of a test and the

resulting p-value, you are very likely to see an effect, even if there is none. Another

common error is to stop an experiment too early, before an effect becomes visible.

Here is an example of an actual experiment we ran. We tested changing the maximum

value of the price filter on the search page from $300 to $1000 as displayed below.

[10]

[11]

Figure 4 Example experiment testing the value of the price filter

InFigure 5we show the development of the experiment over time. The top graph shows

the treatment effect (Treatment / Control 1) and the bottom graph shows the p-value

over time. As you can see, the p-value curve hits the commonly used significant value

of 0.05 after 7 days, at which point the effect size is 4%. If we had stopped there, we

would have concluded that the treatment had a strong and significant effect on the
http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png


5/15



likelihood of booking. But we kept the experiment running and we found that actually,

the experiment ended up neutral. The final effect size was practically null, with the p-

value indicating that whatever the remaining effect size was, it should be regarded as

noise.

[12]

[13]

Figure 5 Result of the price filter experiment over time.

Why did we know to not stop when the p-value hit 0.05? It turns out that this pattern

of hitting significance early and then converging back to a neutral result is actually

quite common in our system. There are various reasons for this. Users often take a long

time to book, so the early converters have a disproportionately large influence in the

beginning of the experiment. Also, even small sample sizes in online experiments are

massive in the scale of classical statistics in which these methods were developed.

Since the statistical test is a function of the sample- and effect sizes, if an early effect

size is large through natural variation it is likely for the p-value to be below 0.05 early.

But the most important reason is that you are performing a statistical test every time

you compute a p-value and the more you do it, the more likely you are to find an effect.

As a side note, people familiar with our website might notice that, at time of writing,

we did in fact launch the increased max price filter, even though the result was neutral.
http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png


6/15



We found that certain users like the ability to search for high-end places and decided to

accommodate them, given there was no dip in the metrics.

How long should experiments run for then? To prevent a false negative (a Type II error),

the best practice is to determine the minimum effect size that you care about and

compute, based on the sample size (the amount of new samples that come every day)

and the certainty you want, how long to run the experiment for, before you start the

experiment. Here[14]is a resource that helps with that computation. Setting the time in

advance also minimizes the likelihood of finding a result where there is none.

One problem, though, is that we often dont have a good idea of the size, or even the

direction, of the treatment effect. It could be that a change is actually hugely successful

and major profits are being lost by not launching the successful variant sooner. Or, on

the other side, sometimes an experiment introduces a bug, which makes it much better

to stop the experiment early before more users are alienated.

The moment when an experiment dabbles in the otherwise significant region could

be an interesting one, even when the pre-allotted time has not passed yet. In the case

of the price filter experiment example, you can see that when significance was first

reached, the graph clearly did not look like it had converged yet. We have found this

heuristic to be very helpful in judging whether or not a result looks stable. It is

important to inspect the development of the relevant metrics over time, rather than to

consider the single result of an effect with a p-value.

We can use this insight to be a bit more formal about when to stop an experiment, if

its before the allotted time. This can be useful if you do want to make an automated

judgment call on whether or not the change that youre testing is performing

particularly well or not, which is helpful when youre running many experiments at the

same time and cannot manually inspect them all systematically. The intuition behind

it is that you should be more skeptical of early results. Therefore the threshold under

which to call a result is very low at the beginning. As more data comes in, you can

increase the threshold as the likelihood of finding a false positive is much lower later in

the game.

We solved the problem of how to figure out the p-value threshold at which to stop an

experiment by running simulations and deriving a curve that gives us a dynamic (in

time) p-value threshold to determine whether or not an early result is worth

investigating. We wrote code to simulate our ecosystem with various parameters and
http://www.evanmiller.org/ab-testing/sample-size.html


7/15



used this to run many simulations with varying values for parameters like the real effect

size, variance and different levels of certainty. This gives us an indication of how likely

it is to see false positives or false negatives, and also how far off the estimated effect

size is in case of a true positive. InFigure 6we show an example decision boundary.

[15]

[16]

Figure 6 An example of a dynamic p-value curve.

It should be noted that this curve is very particular to our system and the parameters
http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png


8/15



that we used for this experiment. We share the graph as an example for you to use for

your own analysis.

Understanding results in context

A second pitfall is failing to understand results in their full context. In general, it is

good practice to evaluate the success of an experiment based on a single metric of

interest. This is to prevent cherry-picking of significant results in the midst of a sea of

neutral ones. However, by just looking at a single metric you lose a lot of context that

could inform your understanding of the effects of an experiment.

Lets go through an example. Last year we embarked on a journey to redesign our

search page. Search is a fundamental component of the Airbnb ecosystem. It is the

main interface to our inventory and the most common way for users to engage with our

website. So, it was important for us to get it right. InFigure 7you can see the before and

after stages of the project. The new design puts more emphasis on pictures of the

listings (one of our assets since we offer professional photography to our hosts) and the

map that displays where listings are located. You can read about the design and

implementation process in another blog post here[17].

[18]

[19]

Figure 7 Before and after a full redesign of the search page.

A lot of work went into the project, and we all thought it was clearly better; our users

agreed in qualitative user studies. Despite this, we wanted to evaluate the new design

quantitatively with an experiment. This can be hard to argue for, especially when
http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.pnghttp://nerds.airbnb.com/redesigning-search/


9/15



testing a big new product like this. It can feel like a missed marketing opportunity if we

dont launch to everyone at the same time. However, to keep in the spirit of our testing

culture, we did test the new design to measure the actual impact and, more

importantly, gather knowledge about which aspects did and didnt work.

After waiting for enough time to pass, as calculated with the methodology described in

the previous section, we ended up with a neutral result. The change in the globalmetric was tiny and the p-value indicated that it was basically a null effect. However,

we decided to look into the context and to break down the result to try to see if we

could figure out why this was the case. Because we did this, we found that the new

design was actually performing fine in most cases, except for Internet Explorer. We then

realized that the new design broke an important click-through action for certain older

versions of IE, which obviously had a big negative impact on the overall results. When

we fixed this, IE displayed similar results to the other browsers, a boost of more than

2%.

[20]
http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png


10/15



[21]

Figure 8 Results of the new search design.

Apart from teaching us to pay more attention to QA for IE, this was a good example of

what lessons you can learn about the impact of your change in different contexts. You

can break results down by many factors like browser, country and user type. It should be

noted that doing this in the classic A/B testing framework requires some care. If you

test breakdowns individually as if they were independent, you run a big risk of findingeffects where there arent, just like in the example of continuously monitoring the

effect of the previous section. Its very common to be looking at a neutral experiment,

break it down many ways and to find a single significant effect. Declaring victory for

that particular group is likely to be incorrect. The reason for this is that you are

performing multiple tests with the assumption that they are all independent, which

they are not. One way of dealing with this problem is to decrease the p-value by which

you decide the effect is real. Read more about this approach here[22]. Another way is to
http://www.evanmiller.org/how-not-to-run-an-ab-test.htmlhttp://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png


11/15



model the effects on all breakdowns directly with a more advanced method like logistic

regression.

Assuming the system works

The third and final pitfall is assuming that the system works the way you think or hope

it does. This should be a concern if you build your own system to evaluate experiments

as well as if you use a third party tool. In either case, its possible that what the system

tells you does not reflect reality. This can happen either because its faulty or because

youre not using it correctly. One way to evaluate the system and your interpretation of

it is by formulating hypotheses and then verifying them.

[23]

[24]

Figure 9 Results of an example dummy experiment.

Another way of looking at this is the observation that results too good to be true have a

higher likelihood of being false. When you encounter results like this, it is good

practice to be skeptical of them and scrutinize them in whatever way you can think of,

before you consider them to be accurate.
http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png


12/15



A simple example of this process is to run an experiment where the treatment is equal

to the control. These are called A/A or dummy experiments. In a perfect world the

system would return a neutral result (most of the time). What does your system return?

We ran many experiments like this (see an example run inFigure 9) and identified a

number of issues within our own system as a result. In one case, we ran a number of

dummy experiments with varying sizes of control and treatment groups. A number of

them were evenly split, for example with a 50% control and a 50% treatment group(where everybody saw exactly the same website). We also added cases like a 75% control

and a 25% treatment group. The results that we saw for these dummy experiments are

displayed inFigure 10.

[25]

[26]

Figure 10 Results of a number of dummy experiments.

You can see that in the experiments where the control and treatment groups are the

same size, the results look neutral as expected (its a dummy experiment so the

treatment is actually the same as the control). But, for the case where the group sizes

are different, there is a massive bias against the treatment group.
http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.pnghttp://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png


13/15



We investigated why this was the case, and uncovered a serious issue with the way we

assigned visitors that are not logged into treatment groups. The issue is particular to

our system, but the general point is that verifying that the system works the way you

think it does is worthwhile and will probably lead to useful insights.

One thing to keep in mind when you run dummy experiments is that you should expect

some results to come out as non-neutral. This is because of the way the p-value works.For example, if you run a dummy experiment and look at its performance broken down

by 100 different countries, you should expect, on average, 5 of them to give you a non-

neutral result. Keep this in mind when youre scrutinizing a 3rd party tool!

Conclusion

Controlled experiments are a great way to inform decisions around product

development. Hopefully, the lessons in this post will help prevent some common A/B

testing errors.

First, the best way to determine how long you should run an experiment is to compute

the sample size you need to make an inference in advance. If the system gives you an

early result, you can try to make a heuristic judgment on whether or not the trends

have converged. Its generally good to be conservative in this scenario. Finally, if you do

need to make procedural launch and stopping decisions, its good to be extra careful by

employing a dynamic p-value threshold to determine how certain you can be about a

result. The system we use at Airbnb to evaluate experiments employs all three ideas to

help us with our decision-making around product changes.

It is important to consider results in context. Break them down into meaningful cohorts

and try to deeply understand the impact of the change you made. In general,

experiments should be run to make good decisions about how to improve the product,

rather than to aggressively optimize for a metric. Optimizing is not impossible, but

often leads to opportunistic decisions for short-term gains. By focusing on learning

about the product you set yourself up for better future decisions and more effective

tests.

Finally, it is good to be scientific about your relationship with the reporting system. If

something doesnt seem right or if it seems too good to be true, investigate it. A simple

way of doing this is to run dummy experiments, but any knowledge about how the


14/15



1. http://www.airbnb.com/

2. http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png

3. http://nerds.airbnb.com/wp-content/uploads/2014/05/img1_launch.png

4. http://nerds.airbnb.com/wp-content/uploads/2014/05/img2_price.png

5. https://github.com/cloudera/gertrude

6. https://github.com/etsy/feature

7. http://facebook.github.io/planout/

8. http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png

9. http://nerds.airbnb.com/wp-content/uploads/2014/05/img3_flow.png

10. http://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.png

11. http://nerds.airbnb.com/wp-content/uploads/2014/05/img4_max_price.png

12. http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png

13. http://nerds.airbnb.com/wp-content/uploads/2014/05/img5_max_price_results.png

14. http://www.evanmiller.org/ab-testing/sample-size.html

15. http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png16. http://nerds.airbnb.com/wp-content/uploads/2014/05/img6_dynamic_p.png

17. http://nerds.airbnb.com/redesigning-search/

18. http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.png

19. http://nerds.airbnb.com/wp-content/uploads/2014/05/img7_magellan.png

20. http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png

21. http://nerds.airbnb.com/wp-content/uploads/2014/05/img8_magellan_results.png

system behaves is useful for interpreting results. At Airbnb we have found a number of

bugs and counter-intuitive behaviors in our system by doing this.

Together with Will Moss, I gave a public talk on this topic in April 2014. You can watch

a video recording of it here[27]. We hope this post was insightful for those who want to

improve their own experimentation.

Want to work at Airbnb? We're hiring!

Browse openings
https://www.airbnb.com/jobs/departments/engineering#jobshttps://www.youtube.com/watch?v=lVTIcf6IhY4


15/15


22. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

23. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png

24. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9_dummy.png

25. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png

26. http://nerds.airbnb.com/wp-content/uploads/2014/05/img9a_dummy_results.png

27. https://www.youtube.com/watch?v=lVTIcf6IhY4

Experiments at Airbnb - Airbnb Engineering

Documents

Transcript of Experiments at Airbnb - Airbnb Engineering