Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

download Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

of 29

  • date post

    12-Jul-2015
  • Category

    Software

  • view

    3.283
  • download

    2

Embed Size (px)

Transcript of Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

  • Airbnb Search Architecture

    Maxim Charkov, Engineering Manager maxim.charkov@airbnb.com, @mcharkov

  • AirbnbTotal Guests

    20,000,000+Countries

    190

    Cities

    34,000+Castles

    600+

    Listings Worldwide

    800,000+

  • Search

    www.airbnb.com

  • Booking Model

    Search BookContact Accept

  • Search Backend

    Technical Stack ____________________________

    DropWizard as a service framework (incl. Jetty, Jersey, Jackson)

    Guice dependency injection framework, Guava libraries, etc.

    ZooKeeper (via Smartstack) for service discovery.

    Lucene for index storage and simple retrieval.

    In-house built real time indexing, ranking, advanced filtering.

  • Search Backend

    ~150 search threads

    4 indexing threads

    Data maintained by indexers:

    Inverted Lucene index for retrieval

    Forward index for ranking signals

    Relevance models

    JVM

  • Indexing

    Whats in the Lucene index? ____________________________

    Positions of listings indexed using Lucenes spatial module (RecursivePrefixTreeStrategy)

    Categorical and numerical properties like room type and maximum occupancy

    Calendar information

    Full text (descriptions, reviews, etc.)

    ~40 fields per listing from a variety of data sources, all updated in real time

  • Indexing

    Challenges ____________________________

    Bootstrap (creating the index from scratch)

    Ensuring consistency of the index with ground truth data in real time

  • Indexing

    master calendar fraud

    SpinalTap

    Medusa PersistentStorage

    Search2Search1 SearchN

  • Indexing

    master calendar fraud

    SpinalTap

    Medusa PersistentStorage

    Search2Search1 SearchN

  • Indexing

    SpinalTap ____________________________

    Responsible for detecting updates happening to the ground truth data (no need to maintain search index invalidation logic in application code)

    Tails binary update logs from MySQL servers (5.6+)

    Converts them into actionable data objects, called Mutations

    Broadcasts using a distributed queue, like Kafka or RabbitMQ

  • Indexing# sources for mysql binary logssources: - name : airslave host : localhost port : 11 user : spinaltap password: spinaltap- name : calendar_db host : localhost port : 11 user : spinaltap password: spinaltap!destinations: - name : kafka clazzName : com.airbnb.spinaltap.destination.kafka.KafkaDestination!pipes: - name : search sources : [airslave", "calendar_db"] tables : ["production:listings,calendar_db:schedule2s"] destination : kafka

    SpinalTap Pipes ____________________________

    Each pipe connects one or more binlog sources (MySQL) with a destination (e.g. Kafka)

    Configured via YAML files

  • Indexing{ "seq" : 3, "binlogpos" : "mysql-bin.000002:5217:5273", "id" : -1857589909002862756, "type" : 2, "table" : { "id" : 70, "name" : "users", "db" : "my_db", "columns" : [ { "name" : "name", "type" : 15, "ispk" : false }, { "name" : "age", "type" : 2, "ispk" : false } ] }, "rows" : [ { "1" : { "name" : "eric", "age" : 31, }, "2" : { "name" : "eric", "age" : 28, } } ]}

    SpinalTap Mutations ____________________________

    Each binlog entry is parsed and converted into one of three event types: Insert, Delete or Update

    Insert and Delete carry the entire row to be inserted or deleted

    Update mutations contain both the old and the current row

    Additional information: unique id, sequence number, column and table metadata

  • Indexing

    Medusa ____________________________

    Documents in index contain data from ~15 different source tables

    Lucene needs a copy of all fields (not just fields that changed) to update the index

    We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL

  • Indexing

    Reads from SpinalTap or directly from MySQL

    Data from multiple tables is joined into Thrift objects, which correspond to Lucene documents

    The intermediate Thrift objects are persisted in Redis

    As changes are detected, updated objects are pushed to the Search instances to update Lucene indexes

    Can bootstrap the entire index in 3 minutes via multithreaded streaming

    Leader election via ZooKeeper

    Medusa PersistentStorage

    Search2Search1 SearchN

  • Ranking

    Ranking Problem ____________________________

    Not a text search problem

    Users are almost never searching for a specific item, rather theyre looking to Discover

    The most common component of a query is location

    Highly personalized the user is a part of the query

    Optimizing for conversion (Search -> Inquiry -> Booking)

    Evolution through continuos experimentation

  • Ranking

    Ranking Components ____________________________

    Relevance

    Quality

    Bookability

    Personalization

    Desirability of location

    New host promotion

    etc.

  • Ranking

    Several hundred signals determining search ranking:

    Properties of the listing (reviews, location, etc.)

    Behavioral signals (mined from request logs)

    Image quality and click ability (computer vision)

    Host behavior (response time/rate, cancellations, etc.)

    Host preferences model

    DB snapshots Logs

  • Rankingpublic void attemptLoadData() { DateTime remoteTs = dataLoader.getModTime(pathToSignals);! if (currentTs == null || remoteTs.isAfter(currentTs) { Map newSignals = loadData(); if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { synchronized (this) { signalsMap = newSignals; currentTs = remoteTs; this.notifyAll(); } } else { LOG.severe("Failed to load the avro file: " + pathToSignals); } }}!!ThreadedLoader qualitySignalsLoader = loaders.get(LoaderCollection.Loader.QualitySignals);final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true);

    Loading Signals ____________________________

    Storing signals in a separate data structure

    Pros:

    Good fit for this type of update pattern: not real-time, but almost everything changes on each load

    No need for costly Lucene index rebuild

    Greatly simplifies design

    Cons:

    Unable to use Lucene retrieval on such data

  • Life of a Query

    Query Understanding

    Retrieval

    External Calls

    Populator Scorer

    Third Pass Ranking

    Result Generation AirEvents Logging

    Geocoding

    Configuring retrieval options

    Choosing ranking models Quality

    Bookability

    Relevance2000 results

    Filtering and Reranking

    Pricing Service

    Social Connections

    25 results

    2000 results

    25 results

  • Ranking

    Second Pass Ranking ____________________________

    Traditional ranking works like this:

    !then sort by rr

    In contrast, second pass operates on the entire list at once:

    !Makes it possible to implement features like result diversity, etc.

  • Life of a Query

    Query Understanding

    Retrieval

    External Calls

    Populator Scorer

    Third Pass Ranking

    Result Generation AirEvents Logging

    Geocoding

    Configuring retrieval options

    Choosing ranking models Quality

    Bookability

    Relevance2000 results

    Filtering and Reranking

    Pricing Service

    Social Connections

    25 results

    2000 results

    25 results

  • Ranking

  • Ranking

  • Ranking

  • Ranking

  • Outside of the scope of this talk ____________________________

    Ranking models

    Machine Learning infrastructure

    Tools (loadtest, deploy, etc.)

    Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.