Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in...
Transcript of Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in...
![Page 1: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/1.jpg)
Zhenxiao Luo
Software Engineer @ Uber
Columnar Storage @ Uber
![Page 2: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/2.jpg)
Mission
Uber Business Highlights
Analytics Infrastructure @ Uber
Why Columnar Storage
Parquet
Columnar Storage for Big Data
Presto
Interactive SQL engine for Big Data
Hoodie
Incremental data ingestion library
Ongoing Work
Agenda
![Page 3: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/3.jpg)
Transportation as reliable as running water, everywhere, for everyone
Uber Mission
![Page 4: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/4.jpg)
Uber Stats
6Continents
73Countries
450Cities
12,000Employees
10+ MillionAvg. Trips/Day
40+ MillionMAU Riders
1.5+ MillionMAU Drivers
![Page 5: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/5.jpg)
Analytics Infrastructure @ Uber
![Page 6: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/6.jpg)
What is Columnar Storage
![Page 7: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/7.jpg)
Why Columnar Storage?
Save Disk Space
● Encoding● Compression
Improve Query Performance
● Only read required data● No need to decode data for
aggregations● Statistics, Dictionary give
potential for optimization
![Page 8: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/8.jpg)
Challenges
Data freshness
● Data arrives in big batches● Updates/Inserts are in rows● Rows -> Columns transformation
Query Performance
● Query engines prefer columns
![Page 9: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/9.jpg)
Parquet: Columnar Storage for Big Data
![Page 10: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/10.jpg)
Parquet @ Uber
Raw Tables
● No preprocessing
● Highly nested
● ~30 minutes ingestion latency
● Huge tables
Modeled Tables
● Preprocessing via Hive ETL
● Flattened
● ~12 hours ingestion latency
![Page 11: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/11.jpg)
What is Presto: Interactive SQL Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, & Netflix
Completely open source
Access to petabytes of data in the Hadoop data lake
![Page 12: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/12.jpg)
How Presto Works
![Page 13: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/13.jpg)
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
○ Inline virtual function calls
○ Inline constants
○ Rewrite inner loops
○ Rewrite type-specific branches
![Page 14: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/14.jpg)
Presto Optimizations for Parquet
Example Query:
SELECT base.driver_uuidFROM rawdata.schemaless_mezzanine_trips_rowsWHERE datestr = '2017-03-02' AND base.city_id in (12)
Data:
● Up to 15 levels of Nesting● Up to 80 fields inside each Struct● Fields are added/deleted/updated inside Struct
![Page 15: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/15.jpg)
Old Parquet Reader
![Page 16: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/16.jpg)
Nested Column Pruning
![Page 17: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/17.jpg)
Columnar Reads
![Page 18: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/18.jpg)
Predicate Pushdown
![Page 19: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/19.jpg)
Dictionary Pushdown
![Page 20: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/20.jpg)
Lazy Reads
![Page 21: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/21.jpg)
Benchmarking Results
![Page 22: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/22.jpg)
Scale of Presto @ Uber
● 2 clusters○ Application cluster
■ Hundreds of machines■ 100K queries per day■ P90: 30s
○ Ad hoc cluster■ Hundreds of machines■ 20K queries per day■ P90: 60s
● Access to both raw and model tables○ 5+ petabytes of data
● Total 120K+ queries per day
![Page 23: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/23.jpg)
● Marketplace pricing○ Real-time driver incentives
● Communication platform○ Driver quality and action platform○ Rider/driver cohorting○ Ops, comms, & marketing
● Growth marketing○ BI dashboard for growth marketing
● Data science○ Exploratory analytics using notebooks
● Data quality○ Freshness and quality check
● Ad hoc queries
Applications of Presto @ Uber
![Page 24: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/24.jpg)
● Update/Insert in Rows
● Data already in immutable Parquet format
○ Could not append to Parquet
○ HDFS does not support update
● Update/Insert spreads across different directories, files
○ Read the whole Parquet files and rebuild new Parquet is time consuming
Late Arriving Updates
![Page 25: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/25.jpg)
● New Update/Insert store in Logs
● (Record Key -> fileId) index
○ Implemented as bloom filter in Parquet Footer
● Versions of file exist under directory
● Metadata under directory, about the most recent version
Hoodie: Incremental Data Ingestion Library
![Page 26: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/26.jpg)
Every a few minutes:
● Read logs
● Get all updated/inserted records
● Get all affected files
● Read Parquet files, apply updates, build new Parquet files
● Build index in Parquet Footer
● Update file version in Metadata
● Clean obsolete versions of files periodically
Hoodie: Data Ingestion
![Page 27: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/27.jpg)
● Hoodie library in Presto
● When Presto Coordinator doing NameNode listing:
○ Read Metadata under directory
○ List all files under directory
○ Only return latest version files for each fileId
Hoodie: Query Engine
![Page 28: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/28.jpg)
Presto Ongoing Work
● GeoSpatial query optimization
● Presto Elasticsearch Connector
● Multi-tenancy Support
● All Active Presto Cross Data Centers
● Authentication and Authorization
● High Available Coordinator
![Page 29: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/29.jpg)
Hadoop Infrastructure & Analytics
● HDFS Erasure Encoding
● HDFS Tiered Storage
● All Active Hadoop Cross Data Centers
● Hive On Spark
● Spark
● Data Visualization
![Page 30: Columnar Storage @ Uber · Applications of Presto @ Uber Update/Insert in Rows Data already in immutable Parquet format Could not append to Parquet HDFS does not support update Update/Insert](https://reader030.fdocuments.net/reader030/viewer/2022040415/5f2ff643816ba37d836b4a44/html5/thumbnails/30.jpg)
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the
use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise
exempt from disclosure under applicable law. All recipients of this document are notified that the information contained
herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any
way disclose this document or any of the enclosed information to any person other than employees of addressee to the
extent necessary for consultations with authorized personnel of Uber.
We are Hiringhttps://www.uber.com/careers/list/27366/
Send resumes to:[email protected] or [email protected]
Interested in learning more about Uber Eng?Eng.uber.com
Follow us on Twitter:@UberEng