Democratizing Data at Airbnb
CHRIS WILLIAMS / JOHN BODLEY / MAY 11, 2017
Airbnb connects people to unique travel experiences
The problem
tribal knowledge |ˈtrībəl ˈnäləj | noun
Tribal knowledge is any unwritten information that is not commonly known by others within a company
Relying on tribal knowledge stifles productivity
As Airbnb grows so do the challenges around the volume, complexity, and obscurity of data
In a large and complex organization, with a sea of data resources, users struggle to find the right data
Data is often siloed, inaccessible, or lacks context
I’m a recovering Data Scientist who wants to democratize data, automate common workflows, surface relevant
information, and provide context
Tables in our Hive data warehouse200k
> 10,000 Superset charts and dashboards
> 6,000 Experiments and metrics
> 6,000 Tableau workbooks and charts
> 1,500 Knowledge posts
Data resourcesBeyond the data warehouse
With many more data sources and data types to love
and most importantly…
> 3,500 Airbnb employees
PortlandSan Francisco
Los Angeles
TorontoNew York
Miami
Sao Paulo
DublinLondon
Paris
Barcelona
Berlin
Milan
Copenhagen
New Delhi
SeoulBeijing
Tokyo
Sydney
Singapore
Washington, DC
> 20Offices around the world
The mandate
To democratize data and empower Airbnb employees to be data-informed by aiding with data exploration, discovery, and trust
The concept
Search…
It should be fairly evident what we feed into the search indices
But are we missing something?
The relevancy of relationshipsNodes and relationships have equal standing
created consumedSpoke 3
The graph
created
associated
associated
associated
consumed
consumed
created
consumed
The graph
created
associated
associated
associated
consumed
consumed
created
consumed
The graph
created
associated
associated
consumed
consumed
created
consumed
associated
The graph
associated
associated
associated
consumed
consumed
consumed
created created
The graph
created
associated
associated
associated
consumed
created
consumed
consumed
The graph
created
associated
associated
associated consumed
created
consumed
consumed
The graph
created
associated
consumed
consumed
created
consumed
associated
associated
The construction
Databases
6APIs
4Airflow DAG
1
Databases6
APIs4
Airflow DAG1
We leverage all these data resources to build a graph in Hive comprising of nodes and relationships
The workflow is run everyday though the graph is left to soak to prevent flickering
Addressing graph flickering
Addressing graph flickering
The issue is certain types of relationships are sporadic in nature causing the graph to flicker
Persistent vs. transient relationshipsPersistent relationships represent a snapshot in time
createdSpoke 3
Persistent vs. transient relationshipsTransient relationships represent events which are somewhat sporadic in nature
M Tu W Th F
consumedSpoke 3
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
The winding data path
Airflow Data transfer
Python Graph datastore
neo4j-driver Python Neo4j driver
Neo4j Graph database
GraphAware Neo4j/Elasticsearch plugin
Elasticsearch Search engine
Flask Python web framework
Hive Data warehouse
Logical Given our data is represented as a graph it is logical to use a graph database to store the data
Nimble Performance wins when dealing with connected data versus relational databases
Popular It is the world’s leading graph database and the community edition is free
Integrative It integrates well with Python and Elasticsearch
Why we choose Neo4j for our databaseThe four main reasons
The Neo4j and Elasticsearch symbiotic relationshipCourtesy of two GraphAware plugins
Neo4j plugin Provides bi-directional integration which transparently and asynchronously replicate data from Neo4j to Elasticsearch
Elasticsearch plugin Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the search rankings by leveraging the graph topology
Node label hierarchy
:Entity
:Org
:Group :User
:Tableau
:Workbook:Chart
:Hive
:Schema :Table
jane_doe
(:Entity:Org:User {id: ‘jane_doe’})
(:Entity:Hive:Table {id: ‘dim_users’})
(:Entity:Tableau:Chart {id: ‘12345’})
dim_users
12345
MATCH (n:Entity:Org:User {id: ’<id>’}) USING INDEX n:User(id) RETURN n
From local to global uniquenessA mechanism to reference nodes in an abstract manner
GraphAware UUID plugin Transparently assigns a globally unique UUID property to newly created elements (nodes and relationships) which cannot be changed or deleted
Globally unique Enables us to uniquely identify a single node via the Entity label and UUID property which allows for parameterized queries which leads to faster query and execution times
MATCH (n:Entity {uuid: ’<uuid>’}) USING INDEX n:Entity(uuid) RETURN n
/api/graph/nodes/org/user/<id>
/api/graph/nodes/<uuid>
/api/graph/relationships/<uuid>/created/<uuid>
The frontend
web app
Designing the interface and user experience of a data tool should not be an afterthought
Technical data power user; the epitome of a tribal knowledge holder
Daphne Data
User personas
Less data literate; needs to keep tabs on her team’s resources
Manager MelNew employee, new team, or new to data; has no idea what’s going on
Nathan New
Designing for data exploration, discovery, and trust
Company dataSearch Resource details& metadata User data Group data
Company dataSearch User data Group dataResource details& metadata
Search Resource details & metadata Company dataUser data Group data
Google-esque search filters
Resource details & metadata
Context, context, & context
Search Resource details & metadata Company dataUser data Group data
Surface relationships, everything’s a link to promote exploration
Metadata & consumption
Description, external link, social
Column details & value distributionsTable lineageEnrich metadata on the fly
Search Resource details & metadata Company dataUser data Group data
Search Resource details & metadata Company dataUser data Group data
User details & metadata
What they make, what they consume
Search Resource details & metadata Company dataUser data Group data
Former employees also hold tribal knowledge
Search Resource details & metadata Company dataUser data Group data
Group overview
Search Resource details & metadata Company dataUser data Group data
Thumbnails for maximum context
Basic organization functionality
Pinterest-like curation & suggested content
We gather over 15,000 thumbnails from Tableau, Superset, and the Knowledge Repo
Search Resource details & metadata Company dataUser data Group data
Pinning flow from resource page
Edit mode / draggable grid
???? ??
Employees can feel disconnected from Company-level metrics
Search Resource details & metadata Company dataUser data Group data
The technology stack
Application + dependencies
DOM Testing
eslint enzyme mocha
chai
Application state
Styling
khan/aphrodite
The challenges
Proxy nodes Abstracting complexity where necessary while accurately modeling the data ecosystem
Graph merging Non-trivial Git-like merging of graph updates
Data-dense design Balancing simplicity and functionality is hard; most internal design resources are not made for data-rich apps
Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies
The challenges
The future
Game-ification Provide content producers with a sense of value
Alerts & recommendations Move from active exploration to deliver relevant updates and content suggestions
Certified content Use certification to build trust and enable users to filter through a sea of stale content
Network analysis Determine obsolete nodes, critical paths, lines of communication, etc.
The future
The team
The Dataportal teamAnalytics & Experimentation Products
John Bodley Software Engineer
Eli Brumbaugh Experience Designer
Jeff Feng Product Manager
Michelle Thomas Software Engineer
Chris Williams Data Visualization
Thank you
Appendix
Naturally bidirectional relationships
associated
Dealing with mutual relationships
Naturally bidirectional relationships
associated
Modeling both creates an unnecessary relationship
associated
Naturally bidirectional relationships
associated
Most efficient solution is to use a single relationship in the many-to-one direction
CREATE TABLE nodes ( labels ARRAY<STRING>, id STRING, properties STRING )
jane_doe
{ labels:[‘Org’,’User’], id:’jane_doe’ }
{ labels:[‘Hive’,’Table’], id:’dim_users’ }
{ labels:[‘Tableau’,’Chart’], id:’12345’ }
dim_users
12345
CREATE TABLE relationships ( source STRUCT<labels:ARRAY<STRING>,id:STRING>, target STRUCT<labels:ARRAY<STRING>,id:STRING>, type STRING, properties STRING )
Efficient data retrieval
Solution Create an index for every label keyed by the ID and UUID properties which in addition to index hints provides optimal node retrieval
Problem Indexes provide for efficient data retrieval similar to a RDBMS primary key, however they are only defined for a single label as opposed to our tuple of hierarchical labels
Restrictions and workarounds with Neo4j indexes
Top Related