Taking Spark Streaming to the Next Level with Datasets and DataFrames

Click here to load reader

  • date post

    16-Apr-2017
  • Category

    Software

  • view

    4.077
  • download

    0

Embed Size (px)

Transcript of Taking Spark Streaming to the Next Level with Datasets and DataFrames

  • Taking Spark Streaming to the next

    level with Datasets and DataFrames

    Tathagata TD Das@tathadas

    Strata San Jose 2016

  • Streaming in Spark

    Spark Streaming changed how people write streaming apps

    2

    SQL Streaming MLlib

    Spark Core

    GraphXFunctional, concise and expressive

    Fault-tolerant state management

    Unified stack with batch processing

    More than 50% users consider most important part of Spark

  • Streaming apps are growing more complex

    3

  • Streaming computations dont run in isolation

    Need to interact with batch data, interactive analysis, machine learning, etc.

  • Use case: IoT Device Monitoring

    IoTDevices

    ETL into long term storage- Prevent data loss- Prevent duplicatesStatus monitoring

    - Handle late data- Process based on

    event time

    Interactively debug issues- consistency

    event stream

    Anomaly detection- Learn models offline- Use online + continuous

    learning

  • 1. Processing with event-time, dealing with late data- DStream API exposes batch time, hard to incorporate event-time

    2. Interoperate streaming with batch AND interactive - RDD/DStream has similar API, but still requires translation- Hard to interoperate with DataFrames and Datasets

    3. Reasoning about end-to-end guarantees- Requires carefully constructing sinks that handle failures correctly- Data consistency in the storage while being updated

    Pain points with DStreams

  • Structured Streaming

  • The simplest way to perform streaming analyticsis not having to reason about streaming at all

  • ModelTrigger: every 1 sec

    1 2 3Time

    data upto 1

    Input data upto 2

    data upto 3

    Que

    ry

    Input: data from source as an append-only table

    Trigger: how frequently to checkinput for new data

    Query: operations on inputusual map/filter/reduce new window, session ops

  • ModelTrigger: every 1 sec

    1 2 3

    output for data up to 1

    Result

    Que

    ry

    Time

    data upto 1

    Input data upto 2

    output for data up to 2

    data upto 3

    output for data up to 3

    Result: final operated tableupdated every trigger interval

    Output: what part of result to write to data sink after every trigger

    Complete output: Write full result table every time

    Output complete output

  • ModelTrigger: every 1 sec

    1 2 3

    output for data up to 1

    Result

    Que

    ry

    Time

    data upto 1

    Input data upto 2

    output for data up to 2

    data upto 3

    output for data up to 3

    Output deltaoutput

    Result: final operated tableupdated every trigger interval

    Output: what part of result to write to data sink after every trigger

    Complete output: Write full result table every timeDelta output: Write only the rows that changed

    in result from previous batchAppend output: Write only new rows

    *Not all output modes are feasible with all queries

  • Static, bounded table

    Dataset, DataFrame

    Streaming, unbounded table

    Single API !

  • Batch ETL with DataFrame

    input = ctxt.read.format("json").load("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.write.format("parquet").save("dest-path")

    Read from Json file

    Select some devices

    Write to parquet file

  • Streaming ETL with DataFrame

    input = ctxt.read.format("json").stream("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.write.format("parquet").outputMode("append").startStream("dest-path")

    Read from Json file stream

    Select some devices

    Write to parquet file stream

  • Streaming ETL with DataFrame

    input = ctxt.read.format("json").stream("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.write.format("parquet").outputMode("append").startStream("dest-path")

    readstream() creates a streaming DataFrame, does not start any of the computation

    writestartStream() defines where & how to output the data and starts the processing

  • Streaming ETL with DataFrame

    input = ctxt.read.format("json").stream("source-path")

    result = input.select("device", "signal").where("signal > 15")

    result.write.format("parquet").outputMode("append").startStream("dest-path")

    1 2 3

    Result[append-only table]

    Input

    Output[append mode]

    new rows in result

    of 2new rows in result

    of 3

  • Continuous Aggregations

    Continuously compute average signal of each type of device

    17

    input.groupBy("device-type").avg("signal")

    input.groupBy(window("event-time", "10min"),"device type")

    .avg("age")

    Continuously compute average signal of each type of device in last 10 minutes of event time

    - Windowing is just a type of aggregation- Simple API for event time based windowing

  • Query Management

    query = result.write.format("parquet").outputMode("append").startStream("dest-path")

    query.stop()query.awaitTermination()query.exception()

    query.sourceStatuses()query.sinkStatus()

    18

    query: a handle to the running streaming computation for managing it

    - Stop it, wait for it to terminate- Get status- Get error, if terminated

    Multiple queries can be active at the same time

    Each query has unique name for keeping track

  • Logically:DataFrame operations on table(i.e. as easy to understand as batch)

    Physically:Spark automatically runs the query in streaming fashion(i.e. incrementally and continuously)

    DataFrame

    Logical Plan

    Continuous, incremental execution

    Catalyst optimizer

    Execution

  • Structured Streaming

    High-level streaming API built on Spark SQL engineRuns the same computation as batch queries in Datasets/DataFramesEvent time, windowing, sessions, sources & sinksEnd-to-end exactly once semantics

    Unifies streaming, interactive and batch queriesAggregate data in a stream, then serve using JDBCAdd, remove, change queries at runtimeBuild and apply ML models

  • Advantages over DStreams

    1. Processing with event-time, dealing with late data

    2. Exactly same API for batch, streaming, and interactive

    3. End-to-end exactly-once guarantees from the system

    4. Performance through SQL optimizations- Logical plan optimizations, Tungsten, Codegen, etc.- Faster state management for stateful stream processing

    21

  • Underneath the Hood

  • Batch Execution on Spark SQL

    23

    DataFrame/Dataset

    Logical Plan

    Execution PlanPlanner

    Logical plan optimization

    Execution plan generation

    RDD jobTungsten

    Code generation

    Abstract representation

    of query

  • Continuous Incremental Execution

    Planner extended to be aware of the streaming logical plans

    Planner generates a continuous series of incremental execution plans, each processing the next chunk of streaming data

    24

    DataFrame/Dataset

    Logical Plan

    Incremental Execution Plan 1

    Incremental Execution Plan 2

    Incremental Execution Plan 3

    Planner

    Incremental Execution Plan 4

  • Streaming Source

    A streaming data source where- Records can be uniquely identified by a offset- Arbitrary segment of the stream data can be read

    based on an offset range- Files, Kafka, Kinesis supported

    More restricted than DStream Receivers Can ensure end-to-end exactly-once guarantees

    25

    Logical Plan

    Sources

    Sink

    Operations

  • Streaming Sink

    Allows output to pushed to external storage

    Each batch of output should be written to storage atomically and idempotently

    Both needed to ensure end-to-end exactly-once guarantees, AND data consistency

    26

    Logical Plan

    Sources

    Sink

    Operations

  • Incremental Execution

    Every trigger interval

    - Planner ask source for next chunk of input data

    - Generates execution plan with the input

    - Generate output data

    - Hands to sink for pushing it out

    27

    Logical Plan

    Sources

    Sink

    Ops

    Incremental Execution Plan

    Input

    get new data as input

    Outputpush output

    generate outputPlanner

  • Offset Tracking with WAL

    Planner saves the next offset range in a write-ahead log (WAL) on HDFS/S3 before incremental execution starts

    28

    Logical Plan

    Sources

    Incremental Execution Plan

    Planner

    processedoffsets

    Offset WAL

    in-progressoffsets

    save offset range to WAL before processing starts

    Input

  • Recovery from WAL

    29

    Logical Plan

    Sources

    Incremental Execution PlanRestarted

    Planner

    processedoffsets

    in-progressoffsets

    Offset WAL

    recover last offset range and restart computation

    After failure, restarted planner recovers last offset range from WAL and restarts failed execution

    Output exactly same as it would been without failure

    Input

  • 30

    Streaming source +

    Streaming sink +

    Offset tracking=

    End-to-end exactly-once guarantees

  • Stateful Stream Processing

    Streaming aggregations require maintaining intermediate "state" data across batches

    31

    compute aggregates 3

    Dog

    Cat 2Dog 1

    Cat 3Dog 1Cow 1

    state data

    compute aggregates 2

    Cat, Cow

    compute aggregates 1

    Cat, Dog, Cat