Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts...
Transcript of Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts...
![Page 1: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/1.jpg)
April 17th 2019Jin Hyuk Chang | @jinhyukchang | Engineer, LyftTao Feng | @feng-tao | Engineer, Lyft
Amundsen: A Data Discovery Platform from Lyft
![Page 2: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/2.jpg)
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Demo
• Architecture
• Summary
2
![Page 3: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/3.jpg)
Data platform users
3
Data Modelers Analysts Data Scientists GeneralManagers
Data Platform
Engineers ExperimentersProductManagers
![Page 4: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/4.jpg)
4
Core Infra high level architecture
Custom apps
![Page 5: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/5.jpg)
Data Discovery
5
![Page 6: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/6.jpg)
• My first project is to analyze and predict Data council Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
6
![Page 7: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/7.jpg)
• Option 1: Phone a friend!
• Option 2: Github search
Status quo
7
![Page 8: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/8.jpg)
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
8
![Page 9: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/9.jpg)
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
![Page 10: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/10.jpg)
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
10
![Page 11: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/11.jpg)
Data Scientists spend upto 1/3rd time in Data Discovery...
11
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
![Page 12: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/12.jpg)
Audience for data discovery
12
![Page 13: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/13.jpg)
Data Discovery - User personas
13
Data Modelers Analysts Data Scientists GeneralManagers
Data Platform
Engineers ExperimentersProductManagers
![Page 14: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/14.jpg)
3 Data Scientist personas
Power user
● All info in their head● Get interrupted a lot
due to questions
● Lost● Ask “power users” a
lot of questions
● Dependencies landing on time
● Communicating with stakeholders
Noob user Manager
![Page 15: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/15.jpg)
Search based Lineage based Network based
Where is the table/dashboard for X?What does it contain?
I am changing a data model, who are the owner and most common users?
I want to follow a power user in my team.
Does this analysis already exist?
This table’s delivery was delayed today, I want to notify everyone downstream.
I want to bookmark tables of interest and get a feed of data delay, schema change, incidents.
Data Discovery answers 3 kinds of questions
![Page 16: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/16.jpg)
Meet Amundsen
16
First person to discover the South Pole -Norwegian explorer, Roald Amundsen
![Page 17: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/17.jpg)
Landing page optimized for search
![Page 18: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/18.jpg)
Search results ranked on relevance and query activity
![Page 19: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/19.jpg)
How does search work?
19
![Page 20: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/20.jpg)
Relevance - search for “apple” on Google
20
Low relevance High relevance
![Page 21: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/21.jpg)
Popularity - search for “apple” on Google
21
Low popularity High popularity
![Page 22: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/22.jpg)
Striking the balance
22
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent users]
● Querying activity● Dashboarding● Different weights for automated vs adhoc
querying
![Page 23: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/23.jpg)
Back to mocks...
23
![Page 24: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/24.jpg)
Search results ranked on relevance and query activity
![Page 25: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/25.jpg)
Detailed description and metadata about data resources
![Page 26: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/26.jpg)
Data Preview within the tool
![Page 27: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/27.jpg)
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
![Page 28: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/28.jpg)
Built-in user feedback
![Page 29: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/29.jpg)
Demo
29
![Page 30: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/30.jpg)
Open source in mind
• Pluggable code to each micro-services via Python entry point, etc
• Pluggable API endpoint via Blueprint
• Build your ingestion pipeline like a Lego brick
![Page 31: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/31.jpg)
Amundsen’s architecture
31
![Page 32: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/32.jpg)
32
Postgres Hive Redshift ... PrestoGithubSource
File
Databuilder Crawler
Neo4j ElasticSearch
Metadata Service Search Service
Frontend ServiceML FeatureService
SecurityService
Other Microservices
Metadata Sources
![Page 33: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/33.jpg)
1. Frontend Service
33
![Page 34: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/34.jpg)
34
Postgres Hive Redshift ... PrestoGithubSource
File
Databuilder Crawler
Neo4j ElasticSearch
Metadata Service Search Service
Frontend ServiceML FeatureService
SecurityService
Other Microservices
Metadata Sources
![Page 35: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/35.jpg)
Amundsen table detail page
![Page 36: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/36.jpg)
2. Metadata Service
36
![Page 37: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/37.jpg)
37
Postgres Hive Redshift ... PrestoGithubSource
File
Databuilder Crawler
Neo4j ElasticSearch
Metadata Service Search Service
Frontend ServiceML FeatureService
SecurityService
Other Microservices
Metadata Sources
![Page 38: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/38.jpg)
38
2. Metadata Service
• A thin proxy layer to interact with graph database‒ Currently Neo4j is the default option for graph backend engine‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
![Page 39: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/39.jpg)
Trade Off #1Why choose Graph database
39
![Page 40: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/40.jpg)
Why Graph database?
![Page 41: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/41.jpg)
Why Graph database?
![Page 42: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/42.jpg)
Trade Off #2Why not propagate the metadata back to source
42
![Page 43: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/43.jpg)
Why not propagate the metadata back to source
43
![Page 44: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/44.jpg)
Why not propagate the metadata back to source
44
?
?
![Page 45: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/45.jpg)
Why not propagate the metadata back to source
45
![Page 46: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/46.jpg)
3. Search Service
46
![Page 47: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/47.jpg)
47
Postgres Hive Redshift ... PrestoGithubSource
File
Databuilder Crawler
Neo4j ElasticSearch
Metadata Service Search Service
Frontend ServiceML FeatureService
SecurityService
Other Microservices
Metadata Sources
![Page 48: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/48.jpg)
3. Search Service
• A thin proxy layer to interact with the search backend‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns‒ Normal Search: match records based on relevancy‒ Category Search: match records first based on data type, then
relevancy‒ Wildcard Search
48
![Page 49: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/49.jpg)
Challenge #1How to make the search result more relevant?
49
![Page 50: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/50.jpg)
How to make the search result more relevant?
50
• Define a search quality metric‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:‒ Boost the exact table ranking‒ Support wildcard search (e.g. event_*)‒ Support category search (e.g. column: is_line_ride)
![Page 51: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/51.jpg)
4. Data Builder
51
![Page 52: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/52.jpg)
52
Postgres Hive Redshift ... PrestoGithubSource
File
Databuilder Crawler
Neo4j ElasticSearch
Metadata Service Search Service
Frontend ServiceML FeatureService
OtherServices
Other Microservices
Metadata Sources
![Page 53: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/53.jpg)
Challenge #1Various forms of metadata
53
![Page 54: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/54.jpg)
54
Metadata Sources @ Lyft
![Page 55: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/55.jpg)
Metadata - Challenges
• No Standardization: No single data model that fits for all data resources‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched differently‒ Hive Table: Stored in Hive metastore‒ RDBMS(postgres etc): Fetched through DBAPI interface‒ Github source code: Fetched through git hook‒ Mode dashboard: Fetched through Mode API‒ …
55
![Page 56: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/56.jpg)
Challenge #2Pull model vs Push model
56
![Page 57: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/57.jpg)
Pull model vs. Push model
57
Pull Model Push Model
● Periodically update the index by pulling from the system (e.g. database) via crawlers.
● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message queue
Data graph
![Page 58: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/58.jpg)
Pull model vs. push model
58
Pull Model Push Model
● Onus of integration lays on data graph● No interface to prescribe, hard to maintain
crawlers
● Onus of integration lies on database● Message format serves as the interface● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message queue
Data graph
![Page 59: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/59.jpg)
Pull model vs. push model
59
Pull Model Push Model
● Onus of integration lays on data graph● No interface to prescribe, hard to maintain
crawlers
● Onus of integration lies on database● Message format serves as the interface● Allows for near-real time indexing
Crawler
Database Data graph Database Message queue
Data graph
Preferred if● Near-real time indexing is important● Clean interface doesn’t exist● Other tools like Wherehows are moving
towards Push Model
Preferred if● Waiting for indexing is ok● Working with “strapped” teams● There’s already an interface
![Page 60: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/60.jpg)
4. Databuilder
![Page 61: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/61.jpg)
Databuilder in action
![Page 62: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/62.jpg)
How are we building data? Databuilder
![Page 63: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/63.jpg)
How is databuilder orchestrated?
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
![Page 64: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/64.jpg)
What’s next?
64
![Page 65: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/65.jpg)
Amundsen seems to be more useful than what we thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
65
![Page 66: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/66.jpg)
Impact - Amundsen at Lyft
66
Beta release(internal)
Generally Available (GA) release
Alpha release
![Page 67: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/67.jpg)
Summary
67
![Page 68: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/68.jpg)
Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1(Complete)
Phase 2(In development)
Phase 3(In Scoping)
Streams Schemas Workflows
![Page 69: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/69.jpg)
Summary
• Data Discovery adds 30+% more productivity to Data Scientists
• Metadata is key to the next wave of big data applications
• Amundsen - Lyft’s metadata and data discovery platform
• Blog post with more details: go.lyft.com/datadiscoveryblog
69
![Page 70: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/70.jpg)
Jin Hyuk Chang | @jinhyukchangTao Feng | @feng-tao
Slides at go.lyft.com/amundsen_datacouncil_2019Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/ 70
![Page 71: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed36c89f15ef3476a729bbd/html5/thumbnails/71.jpg)
Backup
71