Structured streaming in Spark

download Structured streaming in Spark

of 13

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Structured streaming in Spark

  • Structured StreamingSpark Streaming 2.0

    https://hadoopist.wordpress.comGiri R Varatharajan

  • What is Structured Streaming in Apache Spark

    Continuous Data Flow Programming Model in Spark introduced in 2.0

    Low Tolerance & High Throughput System Exactly Once Semantic - No Duplicates Stateful Aggregation over the Time, Event,

    Window, Record.

    A Streaming platform built on top of Spark SQL Express your the computational code as your

    batch computational code in Spark SQL


    Alpha Release released with Spark 2.0 Supports HDFS, S3 now and support for Kafka,

    Kinesis and Other Sources very soon.

  • Spark Streaming

    < 2.0Behavior

    Micro Batching : streams are called as Discretized Streams (DStreams)

    Running Aggregations needs to be specified with a updateStateByKey method

    Requires careful construction of fault tolerance.

    Micro Batching

  • Streaming Model

    Live Data Streams Keep appending to the Dataframe called Unbounded


    Runs incremental aggregates on the Unbounded table.

  • Spark Streaming



    Continuous Data Flow : Streams are appended in an Unbounded Table with Dataframes APIs on it.

    No need to specify any method for running aggregates over the time, window, or record.

    Look at the network socket wordcount program. Streaming is performed in Complete, Append,

    Update Mode(s)

    Continuous Data Flow

    Lines = Input TablewordCounts = Result Table

  • Streaming Model

    //Socket Stream - Read as and when it arrives in NetCat Channelval lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()

  • Streaming Model

    val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word").count().orderBy("window")

  • Create/Read Streams


    File Source (HDFS, S3, Text, Parquet, Csv, Json,etc.)

    Socket Stream (NetCat) Kafka, Kinesis and Other Input Sources are Under

    Research so cross your fingers.

    DataStreamReader API (



  • Outputting Streams


    Output Sink Types:

    Parquet Sink - HDFS, S3, Parquet Console Sink - Terminal Memory Sink - In memory table that can be queried over time interactively Foreach Sink DataStreamWriter



    Output Modes:

    Append Mode(Default) New rows only appended Applicable only for Non Aggregated Queries (select,where,filter,join,etc)

    Complete Mode Output the whole result to any Sink Applicable only for aggregated Queries (groupBy, etc)

    Update Mode Updates on any of the row attributes will get appended to the output sink.

  • CheckPointing In case of Failure recover the previous progress and state of a previous query, and continue where

    it left off.

    Configure a CheckPoint location in writeStream method of DataStreamWriter

    Must be configured for Parquet Sink, File Sink.

  • Unsupported Operations yet

    Sort, Limit of First N rows, Distinct on Input Streams

    Joins bt two streaming datasets Outer Joins (FO, LO, RO) bt two streaming


    ds.count() Use ds.groupBy.count() instead

  • Key Takeaways Structured Streaming is still experimental but please try it out. Streaming Events are gathered and appended to a infinite dataframe series (Unbounded Table) and queries are running on

    top of that.

    Development is very similar to the development of Spark for Static Dataframe/DataSets APIs.

    Execute Ad-hoc Queries, Run aggregates, update DBs, track session data, prepare dashboards,etc.

    readStream() - Schema of the Streaming Dataframes are checked only at run time hence its untyped.

    writeStream() with various Output Modes, Output Sinks are available. Always remember when to use what types of Output


    Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks are the upcoming features and are being developed at the open

    source community.

    Structured Streaming is not recommended for Production workloads at this point even if its a File Streaming, Socket


  • Thank You Spark Code is available in my github:

    Other Spark related repositories:

    My blogs and Learning in Spark: