Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

62
Spark Streaming and Friends Chris Fregly Global Big Data Conference Sept 2014 Kinesi s Streami ng

description

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming and Approximations Lambda Architecture

Transcript of Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Page 1: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streamingand Friends

Chris FreglyGlobal Big Data Conference

Sept 2014

Kinesis

Streaming

Page 2: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Who am I?Former Netflix’er:netflix.github.io

Spark Contributor:github.com/apache/spark

Founder:fluxcapacitor.com

Author:effectivespark.comsparkinaction.com

Page 3: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming-Kinesis Jira

Page 4: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Quick Poll

• Hadoop, Hive, Pig?

• Spark, Spark streaming?

• EMR, Redshift?

• Flume, Kafka, Kinesis, Storm?

• Lambda Architecture?

• Bloom Filters, HyperLogLog?

Page 5: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

“Streaming”

Kinesis

Streaming

Video Streaming

Piping

Big DataStreaming

Page 6: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Agenda• Spark, Spark Streaming Overview • Use Cases• API and Libraries• Execution Model• Fault Tolerance• Cluster Deployment• Monitoring• Scaling and Tuning• Lambda Architecture• Approximations

Page 7: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Overview (1/2)

• Berkeley AMPLab ~2009

• Part of Berkeley Data Analytics Stack (BDAS, aka “badass”)

Page 8: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Overview (2/2)

• Based on Microsoft Dryad paper ~2007

• Written in Scala• Supports Java, Python, SQL, and R• In-memory when possible, not

required• Improved efficiency over MapReduce– 100x in-memory, 2-10x on-disk

• Compatible with Hadoop– File formats, SerDes, and UDFs

Page 9: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Use Cases

• Ad hoc, exploratory, interactive analytics

• Real-time + Batch Analytics– Lambda Architecture

• Real-time Machine Learning• Real-time Graph Processing• Approximate, Time-bound Queries

Page 10: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Explosion of Specialized Systems

Page 11: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Unified Spark Libraries

• Spark SQL (Data Processing)• Spark Streaming (Streaming)• MLlib (Machine Learning)• GraphX (Graph Processing)• BlinkDB (Approximate Queries)• Statistics (Correlations, Sampling, etc)• Others– Shark (Hive on Spark)– Spork (Pig on Spark)

Page 12: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Unified Benefits

• Advancements in higher-level libraries pushed down into core and vice-versa

• Examples– Spark Streaming: GC and memory

management improvements – Spark GraphX: IndexedRDD for random,

hashed access within a partition versus scanning entire partition

Page 13: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark API

Page 14: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Resilient Distributed Dataset (RDD)

• Core Spark abstraction• Represents partitions

across the cluster nodes• Enables parallel processing

on data sets• Partitions can be in-memory or

on-disk• Immutable, recomputable,

fault tolerant• Contains transformation lineage on data

set

Page 15: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

RDD Lineage

Page 16: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark API Overview

• Richer, more expressive than MapReduce

• Native support for Java, Scala, Python, SQL, and R (mostly)

• Unified API across all libraries• Operations = Transformations +

Actions

Page 17: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Transformations

Page 18: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Actions

Page 19: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Execution Model

Page 20: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Execution Model Overview

• Parallel, distributed• DAG-based• Lazy evaluation• Allows optimizations– Reduce disk I/O– Reduce shuffle I/O– Parallel execution– Task pipelining

• Data locality and rack awareness• Worker node fault tolerance using

RDD lineage graphs per partition

Page 21: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Execution Optimizations

Page 22: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Cluster Deployment

Page 23: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Cluster Deployment

Page 24: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Master High Availability

• Multiple Master Nodes• ZooKeeper maintains current Master• Existing applications and workers will

be notified of new Master election• New applications and workers need

to explicitly specify current Master• Alternatives (Not recommended)– Local filesystem – NFS Mount

Page 25: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming

Page 26: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming Overview

• Low latency, high throughput, fault-tolerance (mostly)

• Long-running Spark application• Supports Flume, Kafka, Twitter,

Kinesis, Socket, File, etc.• Graceful shutdown, in-flight message

draining• Uses Spark Core, DAG Execution

Model, and Fault Tolerance

Page 27: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming Use Cases

• ETL on streaming data during ingestion• Anomaly, malware, and fraud detection• Operational dashboards• Lambda architecture– Unified batch and streaming– ie. Different machine learning models for

different time frames

• Predictive maintenance– Sensors

• NLP analysis– Twitter firehose

Page 28: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Discretized Stream (DStream)• Core Spark Streaming abstraction• Micro-batches of RDDs• Operations similar to RDD• Fault tolerance using DStream/RDD lineage

Page 29: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming API

Page 30: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming API Overview

• Rich, expressive API similar to core• Operations– Transformations– Actions

• Window and State Operations• Requires checkpointing to snip long-

running DStream lineage• Register DStream as a Spark SQL

table for querying!

Page 31: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

DStream Transformations

Page 32: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

DStream Actions

Page 33: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Window and State DStream Operations

Page 34: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

DStream Example

Page 35: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming Cluster Deployment

Page 36: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming Cluster Deployment

Page 37: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Scaling Receivers

Page 38: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Scaling Processors

Page 39: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming+

Kinesis

Page 40: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming + Kinesis Architecture

Page 41: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Throughput and Pricing

Page 42: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Demo!

Kinesis

Streaming

Scala: …/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scalaJava: …/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java

https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/…

Page 43: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming Fault Tolerance

Page 44: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Fault Tolerance

• Points of Failure– Receiver– Driver–Worker/Processor

• Solutions– Data Replication– Secondary/Backup Nodes– Checkpoints

Page 45: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Streaming Receiver Failure

• Use a backup receiver• Use multiple receivers pulling from

multiple shards– Use checkpoint-enabled, sharded streaming

source (ie. Kafka and Kinesis)

• Data is replicated to 2 nodes immediately upon ingestion

• Possible data loss• Possible at-least once• Use buffered sources (ie. Kafka and

Kinesis)

Page 46: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Streaming Driver Failure

• Use a backup Driver – Use DStream metadata checkpoint info

to recover

• Single point of failure – interrupts stream processing

• Streaming Driver is a long-running Spark application– Schedules long-running stream receivers

• State and Window RDD checkpoints help avoid data loss (mostly)

Page 47: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Stream Worker/Processor Failure

• No problem!• DStream RDD partitions will be

recalculated from lineage

Page 48: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Types of Checkpoints

Spark1. Spark checkpointing of

StreamingContext DStreams and metadata

2. Lineage of state and window DStream operations

Kinesis3. Kinesis Client Library (KCL) checkpoints

current position within shard– Checkpoint info is stored in DynamoDB per

Kinesis application keyed by shard

Page 49: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Streaming Monitoring and Tuning

Page 50: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Monitoring

• Monitor driver, receiver, worker nodes, and streams

• Alert upon failure or unusually high latency

• Spark Web UI– Streaming tab

• Ganglia, CloudWatch• StreamingListener callback

Page 51: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Web UI

Page 52: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Tuning• Batch interval

– High: reduce overhead of submitting new tasks for each batch

– Low: keeps latencies low– Sweet spot: DStream job time (scheduling + processing)

is steady and less than batch interval• Checkpoint interval

– High: reduce load on checkpoint overhead– Low: reduce amount of data loss on failure– Recommendation: 5-10x sliding window interval

• Use DStream.repartition() to increase parallelism of processing DStream jobs across cluster

• Use spark.streaming.unpersist=true to let the Streaming Framework figure out when to unpersist

• Use CMS GC for consistent processing times

Page 53: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Lambda Architecture

Page 54: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Lambda Architecture Overview

• Batch Layer– Immutable,

Batch read,Append-only write

– Source of truth– ie. HDFS

• Speed Layer– Mutable,

Random read/write– Most complex– Recent data only– ie. Cassandra

• Serving Layer– Immutable,

Random read, Batch write

– ie. ElephantDB

Page 55: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark + AWS + Lambda

Page 56: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark + AWS + Lambda + ML

Page 57: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Approximations

Page 58: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Approximation Overview

• Required for scaling• Speed up analysis of large datasets• Reduce size of working dataset• Data is messy• Collection of data is messy• Exact isn’t always necessary• “Approximate is the new Exact”

Page 59: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Some Approximation Methods

• Approximate time-bound queries– BlinkDB

• Bernouilli and Poisson Sampling– RDD: sample(), RDD.takeSample()

• HyperLogLogPairRDD: countApproxDistinctByKey()

• Count-min Sketch• Spark Streaming and Twitter

Algebird• Bloom Filters– Everywhere!

Page 61: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Spark Statistics Library

• Correlations– Dependence between 2 random variables– Pearson, Spearman

• Hypothesis Testing– Measure of statistical significance– Chi-squared test

• Stratified Sampling– Sample separately from different sub-populations– Bernoulli and Poisson sampling– With and without replacement

• Random data generator– Uniform, standard normal, and Poisson distribution

Page 62: Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximations Lambda Architecture

Summary

• Spark, Spark Streaming Overview • Use Cases• API and Libraries• Execution Model• Fault Tolerance• Cluster Deployment• Monitoring• Scaling and Tuning• Lambda Architecture• Approximations

Oct 2014 MEAP Early Access http://sparkinaction.com