Spark streaming State of the Union - Strata San Jose 2015

55
Spark Streaming State of the Union and Beyond Tathagata “TD” Das @tathadas Feb 19, 2015

Transcript of Spark streaming State of the Union - Strata San Jose 2015

Page 1: Spark streaming State of the Union - Strata San Jose 2015

Spark Streaming State of the Union and Beyond Tathagata “TD” Das @tathadas

Feb 19, 2015

Page 2: Spark streaming State of the Union - Strata San Jose 2015

Who am I?

Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks

Page 3: Spark streaming State of the Union - Strata San Jose 2015

Founded by the creators of Spark in 2013 Largest organization contributing to Spark End-to-end hosted service, Databricks Cloud

What is Databricks?

Page 4: Spark streaming State of the Union - Strata San Jose 2015

What is Spark Streaming?

Page 5: Spark streaming State of the Union - Strata San Jose 2015

Spark Streaming

Scalable, fault-tolerant stream processing system

File systems

Databases

Dashboards

Flume Kinesis

HDFS/S3

Kafka

Twitter

Streaming

High-level API

joins, windows, … often 5x less code

Fault-tolerant

Exactly-once semantics, even for stateful ops

Integration

Integrate with MLlib, SQL, DataFrames, GraphX

Page 6: Spark streaming State of the Union - Strata San Jose 2015

What can you use it for?

6

Real-time fraud detection in transactions

React to anomalies in sensors in real-time

Cat videos in tweets as soon as they go viral

Page 7: Spark streaming State of the Union - Strata San Jose 2015

How does it work?

Data streams are chopped up into batches Each batch is processed in Spark

Results pushed out in batches

7

data streams

rece

iver

s Streaming

batches results

Page 8: Spark streaming State of the Union - Strata San Jose 2015

Streaming Word Count

val lines = context.socketTextStream(“localhost”, 9999)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()

8

print some counts on screen

count the words

split lines into words

create DStream from data over socket

start processing the stream

Page 9: Spark streaming State of the Union - Strata San Jose 2015

Word Count

9

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Page 10: Spark streaming State of the Union - Strata San Jose 2015

Word Count

10

object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }

Spark Streaming

public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Storm

public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Page 11: Spark streaming State of the Union - Strata San Jose 2015

Languages

Can natively use

Can use any other language by using pipe()

11

Page 12: Spark streaming State of the Union - Strata San Jose 2015

Integrates with Spark Ecosystem

12

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 13: Spark streaming State of the Union - Strata San Jose 2015

Combine batch and streaming processing

Join data streams with static data sets // Create data set from Hadoop file!val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) }

13

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 14: Spark streaming State of the Union - Strata San Jose 2015

Combine machine learning with streaming

Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) }

14

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 15: Spark streaming State of the Union - Strata San Jose 2015

Combine SQL with streaming

Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents")

15

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 16: Spark streaming State of the Union - Strata San Jose 2015

History

16

Late 2011 – research idea AMPLab, UC Berkeley

We need to make Spark

faster

Okay...umm, how??!?!

Page 17: Spark streaming State of the Union - Strata San Jose 2015

History

17

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Late 2011 – idea AMPLab, UC Berkeley

Page 18: Spark streaming State of the Union - Strata San Jose 2015

History

18

Late 2011 – idea AMPLab, UC Berkeley

Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms

Q3 2012 Spark core improvements open sourced in Spark 0.6

Feb 2013 – Alpha release 7.7k lines, merged in 7 days

Released with Spark 0.7

Jan 2014 – Stable release Graduation with Spark 0.9

Page 19: Spark streaming State of the Union - Strata San Jose 2015

Current state of Spark Streaming

Page 20: Spark streaming State of the Union - Strata San Jose 2015

Adoption

20

Roadmap

Development

Page 21: Spark streaming State of the Union - Strata San Jose 2015

21

What have we added in the last year?

Page 22: Spark streaming State of the Union - Strata San Jose 2015

Python API

Core functionality in Spark 1.2, with sockets and files as sources

Kafka support coming in Spark 1.3

Other sources coming in future

22

lines = ssc.socketTextStream(“localhost", 9999)) counts = lines.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b) counts.pprint()

Page 23: Spark streaming State of the Union - Strata San Jose 2015

Streaming MLlib algorithms

val model = new StreamingKMeans() .setK(args(3).toInt) .setDecayFactor(1.0) .setRandomCenters(args(4).toInt, 0.0) // Apply model to DStreams

model.trainOn(trainingDStream) model.predictOnValues(testDStream.map { lp => (lp.label, lp.features) } ).print()

23

Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1 StreamingKMeans in Spark 1.2 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

Page 24: Spark streaming State of the Union - Strata San Jose 2015

Other library additions

Amazon Kinesis integration [ Spark 1.1] More fault-tolerant Flume integration [Spark 1.1] New Kafka API for more native integration [Spark 1.3]

24

Page 25: Spark streaming State of the Union - Strata San Jose 2015

System Infrastructure

Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2]

25

Page 26: Spark streaming State of the Union - Strata San Jose 2015

Contributors to Streaming

26

0

10

20

30

40

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Page 27: Spark streaming State of the Union - Strata San Jose 2015

Contributors - Full Picture

27

0

30

60

90

120

Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2

Streaming

Core + Streaming (w/o SQL, MLlib,…)

All contributions to core Spark directly improve Spark Streaming

Page 28: Spark streaming State of the Union - Strata San Jose 2015

Spark Packages

More contributions from the community in spark-packages

Alternate Kafka receiver

Apache Camel receiver

Cassandra examples

http://spark-packages.org/

28

Page 29: Spark streaming State of the Union - Strata San Jose 2015

Who is using Spark Streaming?

Page 30: Spark streaming State of the Union - Strata San Jose 2015

Spark Summit 2014 Survey

30

40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it

Not using 21%

Evaluating 39%

Prototyping 31%

Production 9%

Page 31: Spark streaming State of the Union - Strata San Jose 2015

31

Page 32: Spark streaming State of the Union - Strata San Jose 2015

32

80+ known

deployments

Page 33: Spark streaming State of the Union - Strata San Jose 2015

Intel China builds big data solutions for large enterprises Multiple streaming applications for different businesses

Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company

Page 34: Spark streaming State of the Union - Strata San Jose 2015

Complicated stream processing SQL queries on streams Join streams with large historical datasets

> 1TB/day passing through Spark Streaming

YARN

Spark Streaming

Kafka

RocketMQ HBase

Page 35: Spark streaming State of the Union - Strata San Jose 2015

One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming

Page 36: Spark streaming State of the Union - Strata San Jose 2015

YARN

Spark Streaming Kafka

Cassandra

Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing

Apache Blur

More information: http://dbricks.co/1BnFZZ8

Page 37: Spark streaming State of the Union - Strata San Jose 2015

Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads

Mesos+Marathon

Spark Streaming

Kinesis MySQL Redis

RabbitMQ SQS

Page 38: Spark streaming State of the Union - Strata San Jose 2015

http://techblog.netflix.com/2015/02/whats-trending-on-netflix.html

http://goo.gl/mJNf8X

Page 39: Spark streaming State of the Union - Strata San Jose 2015

Neuroscience @ Freeman Lab, Janelia Farm

Spark Streaming and MLlib to analyze neural activities

Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!

http://www.jeremyfreeman.net/share/talks/spark-summit-2014/

Page 40: Spark streaming State of the Union - Strata San Jose 2015

Neuroscience @ Freeman Lab, Janelia Farm

Streaming machine learning algorithms on time series data of every neuron 2TB/hour and increasing with brain size 80 HPC nodes

Page 41: Spark streaming State of the Union - Strata San Jose 2015

Why are they adopting Spark Streaming?

Easy, high-level API

Unified API across batch and streaming

Integration with Spark SQL and MLlib

Ease of operations

41

Page 42: Spark streaming State of the Union - Strata San Jose 2015

What’s coming next?

Page 43: Spark streaming State of the Union - Strata San Jose 2015

Beyond Spark 1.3

Libraries Streaming machine learning algorithms

A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms

Streaming + SQL, Streaming + DataFrames

43

Page 44: Spark streaming State of the Union - Strata San Jose 2015

Beyond Spark 1.3

Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments

44

Page 45: Spark streaming State of the Union - Strata San Jose 2015

Beyond Spark 1.3

Performance Higher throughput, especially of stateful operations Lower latencies

Easy deployment of streaming apps in Databricks Cloud!

45

Page 46: Spark streaming State of the Union - Strata San Jose 2015

You can help!

Roadmaps are heavily driven by community feedback We have listened to community demands over the last year

Write Ahead Logs for zero data loss New Kafka integration for stronger semantics

Let us know what do you want to see in Spark Streaming

Spark user mailing list, tweet it to me @tathadas

46

Page 47: Spark streaming State of the Union - Strata San Jose 2015

Takeaways

Spark Streaming is scalable, fault-tolerant stream processing system with high-level API and rich set of libraries

Over 80+ deployments in the industry

More libraries and operational ease in the roadmap

47

Page 48: Spark streaming State of the Union - Strata San Jose 2015

48

Backup slides

Page 49: Spark streaming State of the Union - Strata San Jose 2015

Typesafe survey of Spark users

2136 developers, data scientists, and other tech professionals

http://java.dzone.com/articles/apache-spark-survey-typesafe-0

Page 50: Spark streaming State of the Union - Strata San Jose 2015

Typesafe survey of Spark users

65% of Spark users are interested in Spark Streaming

Page 51: Spark streaming State of the Union - Strata San Jose 2015

Typesafe survey of Spark users

2/3 of Spark users want to process event streams

Page 52: Spark streaming State of the Union - Strata San Jose 2015

52

More usecases

Page 53: Spark streaming State of the Union - Strata San Jose 2015

•  Big data solution provider for enterprises •  Multiple applications for different businesses

-  Monitoring +optimizing online services of Tier-1 bank -  Fraudulent transaction detection for Tier-2 bank

•  Kafka à SS à Cassandra, MongoDB •  Built their own Stratio Streaming platform on

Spark Streaming, Kafka, Cassandra, MongoDB

Page 54: Spark streaming State of the Union - Strata San Jose 2015

•  Provides data analytics solutions for Communication Service Providers -  4 of 5 top mobile ops, 3 of 4 top internet backbone providers -  Processes >50% of all US mobile traffic

•  Multiple applications for different businesses -  Real-time anomaly detection in cell tower traffic -  Real-time call quality optimizations

•  Kafka à SS

http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark

Page 55: Spark streaming State of the Union - Strata San Jose 2015

•  Runs claims processing applications for healthcare providers

http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims

•  Predictive models can look for claims that are likely to be held up for approval

•  Spark Streaming allows model scoring in seconds instead of hours