Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet +...
Transcript of Qubole Data Platform – Cloud-native platform for AI ... · Uber Hudi Yes Yes Manual Parquet +...
00Copyright 2017 © Qubole
Qubole Data Platform – Cloud-native platform for AI, Machine Learning, and Analytics
Data Lake
Data Prepand Ingestion Analytics AI and
Machine LearningSelf-service access
Multiple use cases
Financial governance
Elastic scale
Security
DataEngineers
DataAnalysts
DataScientists
PlatformAdministrators
Cloud-Native Data Platform for AI, Machine Learning, and Analytics
. . .
●○○
●●●●
○●●
● GDPR & CCPA
● Rights to erasure
● Rights to rectification
● Regenerating data○ New table with deletions/updates on original table○ Drop original ○ Rename new table○ Expensive process
● Re-structure tables○ User in partitions○ Fast deletions○ Update limited to partition○ Restructuring and updates expensive
● Fast and inexpensive Updates and Deletes● Minimal impact on read performance● Available/extendible to Apache Hive, Apache Spark and Presto
Solution Update/Delte? Primary Engine
Cross Engine Read
Cross Engine Write
Databricks Delta
YES Databricks Spark
No No
Hive ACID v2 YES Hive No Limited
Apache Iceberg(I)
IN PROGRESS Spark Yes Limited
● ORC● Fastest update/deletes● No degradation in read performance in stable/compacted state● Open source so extendible to all engines
Transactions
Locks
DifferentialFiles
● Transaction and Write-Ids○ Transaction Opened and Committed/Rolled-back for each operation○ Aborted, uncommitted transactions not visible○ Write Ids: determines write location○ Atomicity and Isolation
●
Queries
FileSystem
Metastore
CREATE TABLE sample (a int, b int) TBLPROPERTIES('transactional_properties'='insert_only')
INSERT INTO TABLE sample VALUES(10,10)
sample |_____ delta_00001_00001_0000 |_______ 000000_0
Committed Transactions: {1}Aborted Transactions: {}
INSERT INTO TABLE sample VALUES(20,20)
|_____ delta_00002_00002_0000 |_______ 000000_0
Committed Transactions: {1, 2}Aborted Transactions: {}INSERT INTO TABLE sample VALUES(30,30) -- FAIL
|_____ delta_00003_00003_0000 |_______ 000000_0
Committed Transactions: {1, 2}Aborted Transactions: {3}
● Differential Directories○ Delta○ Delete Delta
● Table Types○ Insert Only Tables
■ Base + [Delta]
○ Full Acid Tables■ Base + [Delta] + [Delete Delta]■ Files have additional columns for
Synthetic-RowId
● Base directories
●○○
Queries
CREATE TABLE sample (a int, b int) TBLPROPERTIES('transactional'='true')
INSERT OVERWRITE TABLE sample VALUES(10,10)
Base File.. Txn Bucket Row
Ida b
.. 1 1234 1 10 10
DELETE FROM TABLE sample where a = 10Delete_Delta File.. Txn Bucket Row
Ida b
.. 1 1234 1 null null
●sample |_____ delta_00001_00001_0000 | |_______ 000000_0 | |_____ delta_00002_00002_0000 | |_______ 000000_0 | |_____ delete_delta_00003_00003_0000
|_______ 000000_0
sample |_____ delta_00001_00001_0000 | |_______ 000000_0 | |_____ delta_00002_00002_0000 | |_______ 000000_0 | |_____ delete_delta_00003_00003_0000 | |_______ 000000_0 | |_____ delta_00001_00002_0000
|_______ 000000_0
sample |_____ delta_00001_00001_0000 | |_______ 000000_0 | |_____ delta_00002_00002_0000 | |_______ 000000_0 | |_____ delete_delta_00003_00003_0000 | |_______ 000000_0 | |_____ delta_00001_00002_0000 | |_______ 000000_0 | |_____ base_0004
|_______ 000000_0
sample | |_____ base_0004
|_______ 000000_0
Presto
Insert Only ACID Tables Full ACID Tables
●
TBS
isValid
Filterc2 = X
isValid
y y x z z y
isValid
Result
Result
EMPTYPAGE
y y x z z y
isValid
y y y y y y
isValid SELECT c1 FROM TWHERE c2 = ‘x’
●
● Hive’s RecordReader○ + Minimal work required○ - Heavy performance penalty
●
○ + Extendible across formats
○ - Materialization at Join
○ - Load all Delete_Deltas
in-memory
●○
●○
●○
●
●
● Performance improvements● Support Bucketed Tables● ACID tables with non-ACID data from past● Support ORC generated by Hive Streaming Connection API
Comparisons
Solution Updates/Deletes Snapshot Isolation Compaction/Cleanup File Formats
Databricks Delta Yes Yes Manual Parquet
Hive ACID v2 Yes Yes Automatic ORC/All
Apache Iceberg(I) No Yes None Avro, Parquet, ORC
Uber Hudi Yes Yes Manual Parquet + Avro
26
Comparisons
Solution Updates/Deletes Primary Engine Compaction/Cleanup File Formats
Databricks Delta Available Databricks Spark Manual Parquet
Hive ACID v2 Available Hive Automatic ORC/All
Apache Iceberg(I) Under development Spark, Presto None Avro, Parquet, ORC
Uber Hudi Available Spark Manual Parquet + Avro
27