Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

29

Transcript of Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Page 1: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Page 2: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Airbnb Search Architecture

Maxim Charkov, Engineering Manager [email protected], @mcharkov

Page 3: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

AirbnbTotal Guests

20,000,000+Countries

190

Cities

34,000+Castles

600+

Listings Worldwide

800,000+

Page 4: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Search

www.airbnb.com

Page 5: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Booking Model

Search BookContact Accept

Page 6: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Search Backend

Technical Stack ____________________________

DropWizard as a service framework (incl. Jetty, Jersey, Jackson)

Guice dependency injection framework, Guava libraries, etc.

ZooKeeper (via Smartstack) for service discovery.

Lucene for index storage and simple retrieval.

In-house built real time indexing, ranking, advanced filtering.

Page 7: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Search Backend

~150 search threads

4 indexing threads

Data maintained by indexers:

Inverted Lucene index for retrieval

Forward index for ranking signals

Relevance models

JVM

Page 8: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

What’s in the Lucene index? ____________________________

Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy)

Categorical and numerical properties like room type and maximum occupancy

Calendar information

Full text (descriptions, reviews, etc.)

~40 fields per listing from a variety of data sources, all updated in real time

Page 9: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

Challenges ____________________________

Bootstrap (creating the index from scratch)

Ensuring consistency of the index with ground truth data in real time

Page 10: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

master calendar fraud

SpinalTap

Medusa PersistentStorage

Search2Search1 SearchN…

Page 11: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

master calendar fraud

SpinalTap

Medusa PersistentStorage

Search2Search1 SearchN…

Page 12: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

SpinalTap ____________________________

Responsible for detecting updates happening to the ground truth data (no need to maintain search index invalidation logic in application code)

Tails binary update logs from MySQL servers (5.6+)

Converts them into actionable data objects, called “Mutations”

Broadcasts using a distributed queue, like Kafka or RabbitMQ

Page 13: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing# sources for mysql binary logssources: - name : airslave host : localhost port : 11 user : spinaltap password: spinaltap- name : calendar_db host : localhost port : 11 user : spinaltap password: spinaltap!destinations: - name : kafka clazzName : com.airbnb.spinaltap.destination.kafka.KafkaDestination!pipes: - name : search sources : [“airslave", "calendar_db"] tables : ["production:listings,calendar_db:schedule2s"] destination : kafka

SpinalTap Pipes ____________________________

Each pipe connects one or more binlog sources (MySQL) with a destination (e.g. Kafka)

Configured via YAML files

Page 14: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing{ "seq" : 3, "binlogpos" : "mysql-bin.000002:5217:5273", "id" : -1857589909002862756, "type" : 2, "table" : { "id" : 70, "name" : "users", "db" : "my_db", "columns" : [ { "name" : "name", "type" : 15, "ispk" : false }, { "name" : "age", "type" : 2, "ispk" : false } ] }, "rows" : [ { "1" : { "name" : "eric", "age" : 31, }, "2" : { "name" : "eric", "age" : 28, } } ]}

SpinalTap Mutations ____________________________

Each binlog entry is parsed and converted into one of three event types: “Insert”, “Delete” or “Update”

“Insert” and “Delete” carry the entire row to be inserted or deleted

“Update” mutations contain both the old and the current row

Additional information: unique id, sequence number, column and table metadata

Page 15: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

Medusa ____________________________

Documents in index contain data from ~15 different source tables

Lucene needs a copy of all fields (not just fields that changed) to update the index

We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL

Page 16: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Indexing

Reads from SpinalTap or directly from MySQL

Data from multiple tables is joined into Thrift objects, which correspond to Lucene documents

The intermediate Thrift objects are persisted in Redis

As changes are detected, updated objects are pushed to the Search instances to update Lucene indexes

Can bootstrap the entire index in 3 minutes via multithreaded streaming

Leader election via ZooKeeper

Medusa PersistentStorage

Search2Search1 SearchN…

Page 17: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Ranking Problem ____________________________

Not a text search problem

Users are almost never searching for a specific item, rather they’re looking to “Discover”

The most common component of a query is location

Highly personalized – the user is a part of the query

Optimizing for conversion (Search -> Inquiry -> Booking)

Evolution through continuos experimentation

Page 18: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Ranking Components ____________________________

Relevance

Quality

Bookability

Personalization

Desirability of location

New host promotion

etc.

Page 19: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Several hundred signals determining search ranking:

Properties of the listing (reviews, location, etc.)

Behavioral signals (mined from request logs)

Image quality and click ability (computer vision)

Host behavior (response time/rate, cancellations, etc.)

Host preferences model

DB snapshots Logs

Page 20: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Rankingpublic void attemptLoadData() { DateTime remoteTs = dataLoader.getModTime(pathToSignals);! if (currentTs == null || remoteTs.isAfter(currentTs) { Map<K, D> newSignals = loadData(); if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { synchronized (this) { signalsMap = newSignals; currentTs = remoteTs; this.notifyAll(); } } else { LOG.severe("Failed to load the avro file: " + pathToSignals); } }}!…!ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader = loaders.get(LoaderCollection.Loader.QualitySignals);final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true);

Loading Signals ____________________________

Storing signals in a separate data structure

Pros:

Good fit for this type of update pattern: not real-time, but almost everything changes on each load

No need for costly Lucene index rebuild

Greatly simplifies design

Cons:

Unable to use Lucene retrieval on such data

Page 21: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Life of a Query

Query Understanding

Retrieval

External Calls

Populator Scorer

Third Pass Ranking

Result Generation AirEvents Logging

Geocoding

Configuring retrieval options

Choosing ranking models Quality

Bookability

Relevance2000 results

Filtering and Reranking

Pricing Service

Social Connections

25 results

2000 results

25 results

Page 22: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Second Pass Ranking ____________________________

Traditional ranking works like this:

!then sort by rr

In contrast, second pass operates on the entire list at once:

!Makes it possible to implement features like result diversity, etc.

Page 23: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Life of a Query

Query Understanding

Retrieval

External Calls

Populator Scorer

Third Pass Ranking

Result Generation AirEvents Logging

Geocoding

Configuring retrieval options

Choosing ranking models Quality

Bookability

Relevance2000 results

Filtering and Reranking

Pricing Service

Social Connections

25 results

2000 results

25 results

Page 24: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Page 25: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Page 26: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Page 27: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ranking

Page 28: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Outside of the scope of this talk ____________________________

Ranking models

Machine Learning infrastructure

Tools (loadtest, deploy, etc.)

Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.

Page 29: Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb