Apache Spark Streaming

download Apache Spark Streaming

of 35

  • date post

    18-Feb-2017
  • Category

    Documents

  • view

    123
  • download

    3

Embed Size (px)

Transcript of Apache Spark Streaming

  • Bartosz.Jankiewicz@gmail.com, Scalapolis 2016

    Make yourself a scalable pipeline with Apache Spark

  • Google Data Flow, 2014

    The future of data processing is unbounded data. Though bounded data will always have an

    important and useful place, it is semantically subsumed by its unbounded counter- part.

  • Jaikumar Vijayan, eWeek 2015

    Analyst firms like Forrester expect demand for streaming analytics services and technologies to grow in the next

    few years as more organisations try to extract value from the huge volumes of data being generated these days

    from transactions, Web clickstreams, mobile applications and cloud services.

    http://www.eweek.com/cp/bio/Jaikumar-Vijayan/

  • Integrate user activity information

    Enable nearly real-time analytics

    Scale to millions visits per day

    Respond to rapidly emerging requirements

    Enable data-science techniques on top of collected data

    Do the above with reasonable cost

  • IngestionSources

    Canonical architecture

    web

    sensor

    audit-event

    micro-service

  • Apache Spark

    Started in 2009

    Developed in Scala with Akka

    Polyglot: Currently supports Scala, Java, Python and R

    The largest BigData community as of 2015

  • Spark use-cases

    Data integration and ETL

    Interactive analytics

    Machine learning and advanced analytics

  • Apache Spark

    Scalable

    Fast

    Elegant programming model

    Fault tolerant

    Scalable Scalable by design Scales to hundreds of nodes Proven in production by

    many companies

  • Apache Spark

    Scalable

    Fast

    Elegant programming model

    Fault tolerant

    Fast You can optimise both for

    latency and throughput Reduced hardware appetite

    due various optimisations Further improvements

    added with Structured Streaming in Spark 2.0

  • Apache Spark

    Scalable

    Fast

    Elegant programming model

    Fault tolerant

    Programming model Functional paradigm Easy to run, easy to test Polyglot (R, Scala, Python,

    Java) Batch and streaming APIs

    are very similar REPL - a.k.a. Spark shell

  • Apache Spark

    Scalable

    Fast

    Elegant programming model

    Fault tolerant

    Fault tollerancy Data is distributed and

    replicated Seamlessly recovers from

    node failure Zero data loss guarantees

    due to write ahead log

  • Runtime model

    Driver Program

    Executor #1

    Your code Spark Context

    Executor #2

    Executor #3

    Executor #4

    p1

    p4

    p2

    p5

    p3

    p6

  • RDD - Resilient Distributed DatasetDriver Program

    Executor #1 Executor #2 Executor #3 Executor #4

    val textFile = sc.textFile(hdfs://")

    Data node #1 Data node #2 Data node #3 Data node #4

  • val rdd: RDD[String] = sc.textFile()

    val wordsRDD = rdd .flatMap(line => line.split(" "))

    val lengthHistogram = wordsRDD .groupBy(word => word.length) .collect

    val aWords = wordsRDD .filter(word => word.startsWith(a)) .saveAsHadoopFile(hdfs://)

    Meet DAG

    B

    C

    E

    D

    F

    A

    B

    C E

    D F

    A

  • DStream

    Series of small and deterministic batch jobs

    Spark chops live stream into batches

    Each micro-batch processing produces a result

    time [s]1 2 3 4 5 6

    RDD1 RDD2 RDD3 RDD4 RDD5 RDD6

  • val dstream: DStream[String] =

    val wordsStream = dstream .flatMap(line => line.split(" ")) .transform(_.map(_.toUpper)) .countByValue() .print()

    Streaming program

  • Its not a free lunch

    The abstractions are leaking

    You need to control level of parallelism

    You need to understand impact of transformations

    Dont materialise partitions in forEachPartition operation

  • Performance factors

    Network operations

    Data locality

    Total number of cores

    How much you can chunk your work

    Memory usage and GC

    Serialization

  • Level of parallelism

    Number of receivers aligned with number of executors

    Number of threads aligned with number of cores and nature of operations - blocking or non-blocking

    Your data needs to be chunked to make use of your hardware

  • Stateful transformations Stateful transformation example

    Stateful DStream operators can have infinite lineages

    That leads to high failure-recovery time

    Spark solves that problem with checkpointing

    val actions[(String, UserAction)] = val hotCategories = actions.mapWithState(StateSpec.function(stateFunction))

  • Monitoring Spark Web UI

    Metrics:

    Console

    Ganglia Sink

    Graphite Sink (works great with Grafana)

    JMX

    REST API

  • Types of sources

    Basic sources:

    Sockets, HDFS, Akka actors

    Advanced sources:

    Kafka, Kinesis, Flume, MQTT

    Custom sources:

    Receiver interface

  • Apache Kafka Greasing the wheels for big data

    Incredibly fast message bus Distributed and fault tolerant Highly scalable Strong order guarantees Easy to replicate across multiple regions

    Broker 1

    Producer

    Broker 2

    Consumer

  • Spark Kafka

    Native integration through direct-stream API

    Offsets information are stored in write ahead logs

    Restart of Spark driver reloads offsets which weren't processed

    Needs to explicitly enabled

  • Storage consideration HDFS works well for large, batch workloads

    HBase works well for random reads and writes

    HDFS is well suited for analytical queries

    HBase is well suited for interaction with web pages and certain types of range queries

    Its pays off to persist all data in raw format

  • Lessons learnt

  • Architecture

    web

  • Final thoughts

    Start with reasonably large batch duration ~10 seconds

    Adopt your level of parallelism

    Use Kryo for faster serialisation

    Dont even start without good monitoring

    Find bottlenecks using Spark UI and monitoring

    The issues usually in surrounding Spark environment

  • ?

  • The End

    Bartosz Jankiewicz

    @oborygen

    bartosz.jankiewicz@gmail.com

    mailto:bartosz.jankiewicz@gmail.com

  • References http://spark.apache.org/docs/latest/streaming-

    programming-guide.html

    https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details

    http://milinda.pathirage.org/kappa-architecture.com/

    http://lambda-architecture.net

    http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

    http://spark.apache.org/docs/latest/streaming-programming-guide.htmlhttp://milinda.pathirage.org/kappa-architecture.com/http://lambda-architecture.net