A Journey into Databricks' Pipelines: Journey and Lessons Learned
-
Upload
databricks -
Category
Software
-
view
1.434 -
download
2
Transcript of A Journey into Databricks' Pipelines: Journey and Lessons Learned
![Page 1: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/1.jpg)
Databricks’ Data Pipelines:Journey and Lessons Learned
Yu Peng, Burak Yavuz07/06/2016
![Page 2: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/2.jpg)
Who Are WeYu Peng
Data Engineer at Databricks
Building Databricks’ next-generation data pipeline on top of Apache Spark
BS in Xiamen UniversityPh.D in The University of Hong Kong
Burak Yavuz
Software Engineer at Databricks
Contributor to Spark since Spark 1.1Maintainer of Spark Packages
BS in Mechanical Engineering at Bogazici UniversityMS in Management Science & Engineering at Stanford University
![Page 3: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/3.jpg)
Building a data pipeline is hard
• At least once or exactly once semantics• Fault tolerance• Resource management• Scalability• Maintainability
![Page 4: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/4.jpg)
Apache® Spark™ + Databricks = Our Solution
• All ETL jobs are built on top of Apache Spark • Unified solution, everything in the same place
• All ETL jobs are run on Databricks platform • Platform for Data Engineers and Scientists
• Test out Spark and Databricks new features
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
![Page 5: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/5.jpg)
Classic Lambda Data Pipelineservice 0
service ...
log collector
….
Centralized Messaging
System
Delta ETL
Batch ETL
StorageSystem
service 1
service ...
log collector
….
service x
service ...
log collector
….
…...
![Page 6: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/6.jpg)
CustomerDep 0
CustomerDep 1
Amazon Kinesis
CustomerDep 2
Databricks Data Pipeline OverviewDatabricksDep
….
![Page 7: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/7.jpg)
CustomerDep 0
CustomerDep 1
Amazon Kinesis
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 DatabricksDep
….
7
![Page 8: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/8.jpg)
CustomerDep 0
CustomerDep 1
Amazon Kinesis
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 DatabricksDep
….
8
![Page 9: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/9.jpg)
Databricks Deployment
CustomerDep 0
CustomerDep 1
Amazon Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 DatabricksDep
….
9
![Page 10: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/10.jpg)
Databricks Deployment
CustomerDep 0
CustomerDep 1
Amazon Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
DatabricksDep
….
10
![Page 11: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/11.jpg)
Databricks Deployment
CustomerDep 0
CustomerDep 1
Amazon Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Sync daemonRaw record batch (json)
Databricks Data Pipeline Overview
Cluster 2 DatabricksDep
….
11
![Page 12: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/12.jpg)
Databricks Deployment
CustomerDep 0
CustomerDep 1
Amazon Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2 DatabricksDep
….
12
![Page 13: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/13.jpg)
Databricks Deployment
CustomerDep 0
CustomerDep 1
Amazon Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….CustomerDep 2
Cluster 0service 0
service x
log-daemon
….service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Data analysis
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
DatabricksDep
….
13
![Page 14: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/14.jpg)
Log collection (Log-daemon)
• Fault tolerance and at least once semantics • Streaming• Batch
• Spark History Server• Multi-tenant and config driven
• Spark container
14
![Page 15: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/15.jpg)
Log Daemon
logStream1
Service 1
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
…..
Service 2
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
Kinesistopic-1
Service x
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
state files
Log DaemonArchitecture
producer
reader
Message Producer
logStream2
producer
reader
logStreamX
producer
reader
…………... …………... …………...
15
topic-2
![Page 16: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/16.jpg)
Sync Daemon
• Read from Kinesis and Write to DBFS• Buffer and write in batches (128 MB or 5 Mins)• Partitioned by date
• A long running Apache Spark job • Easy to scale up and down
16
![Page 17: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/17.jpg)
Databricks Deployment
ETL Jobs
DatabricksFilesystem
No dedupAppend
DedupOverwrite
17
New filesCurrent day
All filesPrevious day
Databricks Jobs
Delta job(every 10 mins)
Batch job(daily)
Raw records
DatabricksFilesystem
ETL Tables(Parquet)
![Page 18: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/18.jpg)
ETL Jobs
• Use the same code for Delta and Batch jobs
• Run as scheduled Databricks jobs
• Use spot instances and fallback to on-demand
• Deliver to Databricks as parquet tables
![Page 19: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/19.jpg)
Lessons Learned- Partition Pruning can save a lot of time and money
Reduced query time from 2800 seconds to just 15 seconds.Don’t partition too many levels as it leads to worse metadata discovery
performance and cost.
19
![Page 20: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/20.jpg)
Lessons Learned- High S3 costs: Lots of LIST Requests
Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s metadata cache even after write operations.
20
![Page 21: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/21.jpg)
Running It All in Databricks - Jobs
![Page 22: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/22.jpg)
Running It All in Databricks - Spark
![Page 23: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/23.jpg)
Data Analysis & ToolsWe get the data in. What’s next?
● Monitoring● Debugging● Usage Analysis● Product Design (A/B testing)
23
![Page 24: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/24.jpg)
DebuggingAccess to logs in a matter of seconds thanks to Apache Spark.
24
![Page 25: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/25.jpg)
MonitoringMonitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.
25
![Page 26: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/26.jpg)
Usage Analysis + Product DesignSparkR + ggplot2 = Match made in heaven
26
![Page 27: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/27.jpg)
SummaryDatabricks + Apache Spark create a unified platform for:
- ETL- Data Warehousing- Data Analysis- Real time analytics
Issues with DevOps out of the question:- No need to manage a huge cluster- Jobs are isolated, they don’t cannibalize each other’s resources- Can launch any Spark version
![Page 28: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/28.jpg)
Ongoing & Future WorkStructured Streaming
- Reduce Complexity of pipeline:Sync Daemon + Delta + Batch Jobs => Single Streaming Job
- Reduce LatencyAvailability of data in seconds instead of minutes
- Event Time Dashboards
28
![Page 29: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/29.jpg)
Try Apache Spark with Databricks
29
http://databricks.com/try
![Page 30: A Journey into Databricks' Pipelines: Journey and Lessons Learned](https://reader031.fdocuments.net/reader031/viewer/2022020314/587c05951a28ab03768b4589/html5/thumbnails/30.jpg)
Thank you.Have questions about ETL with Spark?Join us at the Databricks Booth 3.45-6.00pm!