Spark Streaming into context

57
Spark Streaming into context David Martinez Rego 20th of October 2016

Transcript of Spark Streaming into context

Page 1: Spark Streaming into context

Spark Streaming into context David Martinez Rego 20th of October 2016

Page 2: Spark Streaming into context

About me• Phd in ML 2013: predictive maintenance

of windmills

• Lived in London since then

• Postdoc @ UCL

• Teaching and Mentoring @ UCL internships inside financial institutions

• Consulting on Data analytics

• Early Startup

Page 3: Spark Streaming into context

Plethora of options?

Page 4: Spark Streaming into context

Wishlist• Easy to compose complex pipelines

• Easy scaling out

• Interoperable with a large ecosystem

• Low latency and high throughput

• Monitoring

Page 5: Spark Streaming into context

Plethora of options?

Page 6: Spark Streaming into context

Flume• Its mechanism of scaling to different machines is

managed in an ad hoc way

Page 7: Spark Streaming into context

Flume• Its mechanism of scaling to different machines is

managed in an ad hoc way

Page 8: Spark Streaming into context

Flume

• Its mechanism of scaling to different machines is managed in an ad hoc way

• Nice to solve simple custom data gathering from the exterior and throw it in the perimeter for further processing.

Page 9: Spark Streaming into context

Plethora of options?

Page 10: Spark Streaming into context

Plethora of options?

Page 11: Spark Streaming into context

Plethora of options?

Page 12: Spark Streaming into context

Plethora of options?

Page 13: Spark Streaming into context

Lessons learnt

• Each project has added some good ideas when they were more needed

• Eventually, all platforms have absorbed the best ideas from peers

• It seems that we have a winner, for now?

Page 14: Spark Streaming into context

Time view

Pipelining Composition

one at a time spouts and bolts

RDD

one at a time spouts and bolts

Page 15: Spark Streaming into context

Storm basic model

Spout

Spout

Bolt

Bolt

Bolt

Bolt

Topologys.g.

s.g.s.g.

s.g.

Page 16: Spark Streaming into context

Guarantees and fault tolerance

ACK ANCH

Page 17: Spark Streaming into context

Anchoring

ACK

Guarantees and fault tolerance

Page 18: Spark Streaming into context

Spout

Page 19: Spark Streaming into context

Bolt

Page 20: Spark Streaming into context

Topology

Page 21: Spark Streaming into context

Storm basic model

Spout

Spout

Bolt

Bolt

Bolt

Bolt

Topologys.g.

s.g.s.g.

s.g.

Page 22: Spark Streaming into context

Lambda architecture

Page 23: Spark Streaming into context

Time view

Pipelining Composition

one at a time spouts and bolts

RDD

one at a time system, stream, stream task

Page 24: Spark Streaming into context

Samza

Page 25: Spark Streaming into context

Samza

Page 26: Spark Streaming into context

Samza

Page 27: Spark Streaming into context

Samza

Page 28: Spark Streaming into context

Kappa architecture

Page 29: Spark Streaming into context

Time view

Pipelining Composition

one at a time source, spouts, bolts and ack

RDD

one at a time system, stream, stream task

Page 30: Spark Streaming into context

RDD

Page 31: Spark Streaming into context

Microbatch

Page 32: Spark Streaming into context

Init + connect to source

pipeline

computation + state mgmt.

Page 33: Spark Streaming into context

Time view

Pipelining Composition

one at a time source, spouts, bolts and ack

RDD

one at a time system, stream, stream task

Page 34: Spark Streaming into context

Much better, but still…• Introduce problems

1. Still no full equivalence between batch and streaming

2. out of order management and early reporting have to be coded

3. custom windows code needs to be mixed with business logic

4. Micro-batches impose a lower limit on latency

Page 35: Spark Streaming into context

Spark: batch and streaming

Page 36: Spark Streaming into context

Spark: batch and streaming

Page 37: Spark Streaming into context

Lambda architecture?

Page 38: Spark Streaming into context

Out of order

Latency is unpredictable

Page 39: Spark Streaming into context

Our aim

Page 40: Spark Streaming into context

Final Spark (1)

Page 41: Spark Streaming into context

Final Spark (2)

Page 42: Spark Streaming into context

Batch vs. Streaming

Data Streaming

Page 43: Spark Streaming into context

Data Batch

Batch vs. Streaming

Page 44: Spark Streaming into context

Data Batch

Batch vs. Streaming

A batch pipeline IS a streaming pipeline applied to

a finite stream!

Page 45: Spark Streaming into context

Event time + Processing time

Processing time

Event time Business logic

+

Page 46: Spark Streaming into context

Event time + Processing time

Processing time Event time

Business logic

+

Page 47: Spark Streaming into context

Plethora of options?

Page 48: Spark Streaming into context

Beam/Dataflow

Page 49: Spark Streaming into context

Beam/Dataflow

Page 50: Spark Streaming into context

Beam/Dataflow

Page 51: Spark Streaming into context

Apache Beam

Streaming API

Execution engine

Page 52: Spark Streaming into context

Apache Beam

Streaming API

!

!

!

Execution enginehttp://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

Page 53: Spark Streaming into context

Apache Beam

Kostas Tzoumas, Data artisans

Tyler Akidau, Beam PMC

Page 54: Spark Streaming into context

Other considerations

Maturity ? -

Ecosystem - -

Community -

Ops - -

Page 55: Spark Streaming into context

Other considerations• Flow of the experiment:

• Read an event from Kafka.

• Deserialize the JSON string.

• Filter out irrelevant events

• Take a projection of the relevant fields

• Join each event with its associated campaign (from Redis).

• Take a windowed count of events per campaign and store each window in Redis along with a last updated timestamp (with late events).

Page 56: Spark Streaming into context

Resources• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

• https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

• http://data-artisans.com/why-apache-beam/#more-710

• http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html

• https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 57: Spark Streaming into context

Spark Streaming into context Thanks for listening!