Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI &...
Transcript of Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI &...
![Page 1: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/1.jpg)
Data Reliability for Data Lakes
Michael Armbrust@michaelarmbrust
![Page 2: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/2.jpg)
1. Collect Everything
• Recommendation Engines• Risk, Fraud Detection• IoT & Predictive Maintenance• Genomics & DNA Sequencing
3. Data Science & Machine Learning
2. Store it all in the Data Lake
The Promise of the Data Lake
Garbage In Garbage Stored Garbage Out
!
!
!
!!
!
!
![Page 3: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/3.jpg)
What does a typical data lake project look like?
![Page 4: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/4.jpg)
Evolution of a Cutting-Edge Data Lake
Events
?AI & Reporting
StreamingAnalytics
Data Lake
![Page 5: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/5.jpg)
Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
StreamingAnalytics
Data Lake
![Page 6: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/6.jpg)
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Eventsλ-arch1
1
1
![Page 7: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/7.jpg)
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
1
21
1
2
![Page 8: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/8.jpg)
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
Reprocessing
Partitioned
1
2
3
1
1
3
2
![Page 9: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/9.jpg)
Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
Reprocessing
Updates
Partitioned
UPDATE &MERGE
Scheduled to Avoid Modifications
1
2
3
1
1
3
4
4
4
2
![Page 10: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/10.jpg)
Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
![Page 11: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/11.jpg)
Data Lake Distractions
No atomicity means failed production jobs leave data in corrupt state requiring tedious recovery
✗
No quality enforcement creates inconsistent and unusable data
No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming
![Page 12: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/12.jpg)
Let’s try it instead with
![Page 13: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/13.jpg)
Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
Reprocessing
Updates
Partitioned
UPDATE &MERGE
Scheduled to Avoid Modifications
1
2
3
1
1
3
4
4
4
2
![Page 14: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/14.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
![Page 15: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/15.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
Full ACID Transaction
Focus on your data flow, instead of worrying about failures.
![Page 16: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/16.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
Open Standards, Open Source
Store petabytes of data without worries of lock-in. Growing community including Presto, Spark and more.
![Page 17: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/17.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
Powered by
Unifies Streaming / Batch. Convert existing jobs with minimal modifications.
![Page 18: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/18.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.
*Data Quality Levels *
![Page 19: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/19.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
• Dumping ground for raw data• Often with long retention (years)• Avoid error-prone parsing
!
![Page 20: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/20.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Intermediate data with some cleanup applied.Queryable for easy debugging!
![Page 21: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/21.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Clean data, ready for consumption.Read with Spark or Presto*
*Coming Soon
![Page 22: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/22.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Streams move data through the Delta Lake• Low-latency or manually triggered• Eliminates management of schedules and jobs
![Page 23: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/23.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Delta Lake also supports batch jobs and standard DML
UPDATE
DELETE MERGE
OVERWRITE
• Retention• Corrections• GDPR
• UPSERTS
INSERT
*DML released in 0.3.0
![Page 24: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/24.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Easy to recompute when business logic changes:• Clear tables• Restart streams
DELETE DELETE
![Page 25: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/25.jpg)
Who is using ?
![Page 26: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/26.jpg)
Used by 1000s of organizations world wide
> 1 exabyte processed last month alone
![Page 27: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/27.jpg)
27
Improved reliability: Petabyte-scale jobs
10x lower compute: 640 instances to 64!
Simpler, faster ETL: 84 jobs → 3 jobshalved data latency
![Page 28: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/28.jpg)
How do I use ?
![Page 29: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/29.jpg)
dataframe.write.format("delta").save("/data")
Get Started with Delta using Spark APIs
dataframe.write.format("parquet").save("/data")
Instead of parquet... … simply say delta
Add Spark Packagepyspark --packages io.delta:delta-core_2.12:0.1.0
bin/spark-shell --packages io.delta:delta-core_2.12:0.1.0
<dependency><groupId>io.delta</groupId> <artifactId>delta-core_2.12</artifactId> <version>0.1.0</version>
</dependency>
Maven
![Page 30: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/30.jpg)
In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively.
dataset("warehouse")
.query(input("kafka").select(…).join(…)) // Query to materialize
.location(…) // Storage Location
.schema(…) // Optional strict schema checking
.metastoreName(…) // Hive Metastore
.description(…) // Human readable description
.expect("validTimestamp", // Expectations on data quality"timestamp > 2012-01-01 AND …",
"fail / alert / quarantine")
*Coming Soon
![Page 31: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/31.jpg)
In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively.
dataset("warehouse").query(input("kafka").withColumn(…).join(…)) // Query to materialize.location(…) // Storage Location.schema(…) // Optional strict schema checking.metastoreName(…) // Hive Metastore.description(…) // Human readable description.expect("validTimestamp", // Expectations on data quality"timestamp > 2012-01-01 AND …","fail / alert / quarantine")
*Coming Soon
![Page 32: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/32.jpg)
In Progress: Declarative Pipelines
Enforce metadata, storage, and quality declaratively.
dataset("warehouse").query(input("kafka").withColumn(…).join(…)) // Query to materialize.location(…) // Storage Location.schema(…) // Optional strict schema checking.metastoreName(…) // Hive Metastore.description(…) // Human readable description.expect("validTimestamp", // Expectations on data quality"timestamp > 2012-01-01 AND …","fail / alert / quarantine")
*Coming Soon
![Page 33: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/33.jpg)
How does work?
![Page 34: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/34.jpg)
Delta On Disk
my_table/_delta_log/
00000.json00001.json
date=2019-01-01/file-1.parquet
Transaction LogTable Versions
(Optional) Partition DirectoriesData Files
![Page 35: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/35.jpg)
Version N = Version N-1 + Actions
Change Metadata – name, schema, partitioning, etcAdd File – adds a file (with optional statistics)Remove File – removes a file
Result: Current Metadata, List of Files
![Page 36: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/36.jpg)
Implementing Atomicity
Changes to the table are stored as ordered, atomic units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
![Page 37: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/37.jpg)
Ensuring Serializablity
Need to agree on the order of changes, even when there are multiple writers. 000000.json
000001.json
000002.json
User 1 User 2
![Page 38: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/38.jpg)
Solving Conflicts Optimistically
1. Record start version2. Record reads/writes3. Attempt commit4. If someone else wins,
check if anything you read has changed.
5. Try again.
000000.json
000001.json
000002.json
User 1 User 2Write: AppendRead: Schema
Write: AppendRead: Schema
![Page 39: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/39.jpg)
Handling Massive Metadata
Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquetRemove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
![Page 40: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/40.jpg)
Road Map
• 0.2.0 – Released!• S3 Support• Azure Blob Store and ADLS Support
• 0.3.0 - Released!• UPDATE (Scala)• DELETE (Scala)• MERGE (Scala)• VACUUM (Scala)
• 0.4.0 – This week!• CONVERT TO DELTA• DESCRIBE HISTORY• Python DML (Update/Delete/Merge)
• Spark 3.0• DDL Support / Hive Metastore• SQL DML Support
![Page 41: Data Reliability for Data Lakes · 2020-04-11 · Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics ... Spark and more. AI & Reporting Streaming Analytics](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed391b4a1895f794116acc9/html5/thumbnails/41.jpg)
Build your own Delta Lakeat https://delta.io