Spark Meetup at Uber

download Spark Meetup at Uber

of 46

Embed Size (px)

Transcript of Spark Meetup at Uber

  • D A T A

    Spark & Hadoop @ Uber

  • Who We Are

    Early Engineers On Hadoop team @ Uber

    Kelvin Chu Reza ShiftehfarVinoth Chandar

  • Agenda

    Intro to Data @ Uber Trips Pipeline Into Warehouse Paricon INotify DStream Future

  • Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.

    Ubers Mission

    Transportation as reliable as running water,

    everywhere, for everyone

    300+ Cities 60+ Countries

    And growing...

  • Data @ Uber

    Impact of Data is Huge! 2000+ Unique Users Operating a massive transportation system

    Running critical business operations Payments, Fraud, Marketing Spend, Background Checks

    Unique & Interesting Problems Supply vs Demand - Growth Geo-Temporal Analytics

    Latency Is King Enormous business value in making data available asap

  • Data Architecture: Circa 2014

    Kafka Logs

    Schemaless Databases

    RDBMS Tables

    OLAP Warehouse


    Bulk Uploader

    Amazon S3


    Celery/Python ETL Adhoc SQL

  • Challenges

    Scaling to high volume Kafka streams eg: Event data coming from phones

    Merged Views of DB Changelogs across DCs Some of the most important data - trips (duh!)

    Fragile ingestion model Projections/Transformation in pipelines Data Lake philosophy - raw data on HDFS, transform later using Spark

    Free-form JSON data Data Breakages First order of business - Reliable Data

  • New World Order: Hadoop & Spark

    Kafka Logs

    Schemaless Databases

    RDBMS Tables

    Amazon S3


    OLAP Warehouse


    Adhoc SQL

    Applications Adhoc SQL Machine Learning


    Spark SQL

    Spark /Hive

    Spark Jobs/Oozie


    Data Delivery Services

    RawData Cooked

    Spark/Spark Streaming

  • Trips Pipeline : Problem

    Most Valuable Dataset in Uber (100% Accuracy) Trips stored in Ubers schemaless datastores (sharded

    Mysql), across DCs, cross replicated Need a consolidated view across dcs, quickly (~1-2 hr


    Trip Store (DC1)

    Trip Store (DC2)

    Writes in DC1 Writes in DC2

    Multi Master XDC Replication

  • Trips Pipeline : Architecture

  • Trips Pipeline : ETL via SparkSQL

    Decouples raw ingestion from Relational Warehouse table model

    Ability to provision multiple tables off same data set Picks latest changelog entry in the files

    Applies them in order Applies projections & row level transformations

    Produce ingestible data into Warehouse Uses HiveContext to gain access to UDFs

    explode() etc to flatten JSON arrays. Scheduled Spark Job via Oozie

    Runs every hour (tunable)

  • Paricon : PARquet Inference and CONversion

    Running in production since Feburary 2015 first Spark application at Uber

  • Motivation 1: Data Breakage & Evolution

    Upstream Data Producers

    Downstream Data Consumers

    JSON at S3 data evolving over time and one day

  • Motivation 1: Why Schema Contract

    multiple teams producers consumers

    Avoid data breakage because we have schema evolution systems

    Data to persist in a typed manner analytics

    Serve as documentation understand data faster

    Unit testable

  • Paricon : Workflow





    JSON / Gzip / S3

    Avro schema

    Parquet /In-house HDFSSchema

    Repository and

    Management Systems

    reviewed / consumed

  • Motivation 2: Why Parquet Supports schema 2 to 4 times FASTER than json/gzip

    column pruning wide tables at Uber

    filter predicate push-down compression

    Strong Spark support SparkSQL schema evolution

    schema merging in Spark v1.3 merge old and new compatible schema versions no Alter table ...

  • Paricon : Transfer distcp on Spark

    only subset of command-line options currently Approach

    compute the files list and assign them to RDD partitions avoid stragglers by randomly grouping different dates

    Extras Uber specific logic

    filename conventions backup policies

    internal Spark eco-system faster homegrown delta computation get around s3a problem in Hadoop 2.6

  • Paricon : Infer

    Infer by JsonRDD but not directly

    Challenge: Data is dirty garbage in garbage out

    Two passes approach first: data cleaning second: JsonRDD inference

  • Paricon : Infer

    Data cleaning structured as rules-based engine each rule is an expectation all rules are heuristics

    based on business domain knowledge the rules are pluggable based on topics

    Struct@JsonRDD vs Avro: illegal characters in field names repeating group names more

  • Paricon : Convert

    Incremental conversion assign days to RDD partitions computation and checkpoint unit: day new job or after failure: work on those partial days only

    Most number of codes among the four tasks multiple source formats (encoded vs non-encoded) data cleaning based on inferred schema home grown JSON decoder for Avro file stitching

  • Stitching : MotivationFile size

    Number of files

    HDFS block size

    Inefficient for HDFS Many large files

    break them But a lot more small files

    stitch them

  • Stitching : GoalHDFS Block HDFS Block HDFS Block HDFS Block

    Parquet Block Parquet Block Parquet Block

    HDFS Block HDFS Block HDFS Block HDFS Block

    Parquet File Parquet File Parquet File Parquet File

    One parquet block per file Parquet file slightly less than HDFS


  • Stitching : Algorithms

    Algo1: Estimate a constant before conversion pros: easy to do cons: not work well with temporal variation

    Algo2: Estimate during conversion per RDD partition each day has its own estimate may even self-tuned during the day

  • Stitching : Experiments

    N: number of Parquet files Si: size of the i-th Parquet file B: HDFS block size First part: local I/O - files slightly smaller HDFS block Second part: network I/O - penalty of files going over a block

    Benchmark queries

  • Paricon : Validate

    Modeled as Source and converted tables join equi-join on primary key compare the counts compare the columns content

    SparkSQL easy for implementation hard for performance tuning

    Debugging tools

  • Some Production Numbers

    Inferred: >120 topics Converted: >40 topics Largest single job so far

    process 15TB compressed (140TB uncompressed) data one single topic recover from multiple failures by checkpoints

    Numbers are increasing ...

  • Lessons

    Implement custom finer checkpointing S3 data network fee jobs/tasks failure -> download all data repeatedly to save money and time

    There is no perfect data cleaning 100% clean is not needed often

    Schema parsing implementation tricky and takes much time for testing

  • Komondor: Problem Statement

    Current Kafka->HDFS ingestion service does too much work:

    Consume from Kafka -> Write Sequence Files -> Convert to Parquet -> Upload to HDFS, HIVE compatible way

    Parquet generation needs a lot of memory Local writing and uploading is slow

    Need to decouple raw ingestion from consumable data Move heavy lifting into Spark -> Keep raw-data delivery service lean

    Streaming job to keep converting raw data into Parquet, as they land!

  • Komondor: Kafka Ingestion Service


    Streaming Raw Data Delivery



    Streaming Ingestion

    Batch Verification & File Stitching

    Raw Data

    Consumable Data

  • Komondor: Goals

    Fast raw data into permanent storage Spark Streaming Ingestor to cook raw data

    For now, Parquet generation But opens up polyglot world for ORC, RCFile,....

    De-duplicate of raw data before consumption Shields downstream consumers from at-least-once delivery of pipelines Simply replay events for an entire day, in the event of pipeline outages

    Improved wellness of HDFS Avoiding too many small files in HDFS File stitcher job to combine small files from past days

  • INotify DStream: Komondor De-Duplication

  • INotify DStream: Motivation

    Streaming Job to pick up raw data files Keeps end-to-end latency low vs batch job

    Spark Streaming FileDStream not sufficient Only works 1 directory deep,

    At least have two levels for // Provides the file contents directly

    Loses valuable information in file name. eg: partition num Checkpoint contains an entire file list

    Will not scale to millions of files Too much overhead to run one Job Per Topic

  • INotify DStream: HDFS INotify Similar to Linux iNotify to watch file system changes

    Exposes the HDFS Edit Log as an event stream CREATE, CLOSE, APPEND, RENAME, METADATA, UNLINK events Introduced in Hadoop Summit 2015

    Provides transaction id Client can use to resume from a given position

    Event Log Purged every time the FSImage is uploaded

  • INotify DStream: Implementation

    Provides the HDFS INotify events as a Spark DStream Implementation very similar to KafkaDirectDStream

    Checkpointing is straightforward: Transactions have unique IDs. Just save Transaction ID to permanent storage

    Filed SPARK-10555, vote up if you think it is useful :)

  • INotify DStream: Early Results