Spark Streaming into context
-
Upload
david-martinez-rego -
Category
Data & Analytics
-
view
223 -
download
0
Transcript of Spark Streaming into context
Spark Streaming into context David Martinez Rego 20th of October 2016
About me• Phd in ML 2013: predictive maintenance
of windmills
• Lived in London since then
• Postdoc @ UCL
• Teaching and Mentoring @ UCL internships inside financial institutions
• Consulting on Data analytics
• Early Startup
Plethora of options?
Wishlist• Easy to compose complex pipelines
• Easy scaling out
• Interoperable with a large ecosystem
• Low latency and high throughput
• Monitoring
Plethora of options?
Flume• Its mechanism of scaling to different machines is
managed in an ad hoc way
Flume• Its mechanism of scaling to different machines is
managed in an ad hoc way
Flume
• Its mechanism of scaling to different machines is managed in an ad hoc way
• Nice to solve simple custom data gathering from the exterior and throw it in the perimeter for further processing.
Plethora of options?
Plethora of options?
Plethora of options?
Plethora of options?
Lessons learnt
• Each project has added some good ideas when they were more needed
• Eventually, all platforms have absorbed the best ideas from peers
• It seems that we have a winner, for now?
Time view
Pipelining Composition
one at a time spouts and bolts
RDD
one at a time spouts and bolts
Storm basic model
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Topologys.g.
s.g.s.g.
s.g.
Guarantees and fault tolerance
ACK ANCH
Anchoring
ACK
Guarantees and fault tolerance
Spout
Bolt
Topology
Storm basic model
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Topologys.g.
s.g.s.g.
s.g.
Lambda architecture
Time view
Pipelining Composition
one at a time spouts and bolts
RDD
one at a time system, stream, stream task
Samza
Samza
Samza
Samza
Kappa architecture
Time view
Pipelining Composition
one at a time source, spouts, bolts and ack
RDD
one at a time system, stream, stream task
RDD
Microbatch
Init + connect to source
pipeline
computation + state mgmt.
Time view
Pipelining Composition
one at a time source, spouts, bolts and ack
RDD
one at a time system, stream, stream task
Much better, but still…• Introduce problems
1. Still no full equivalence between batch and streaming
2. out of order management and early reporting have to be coded
3. custom windows code needs to be mixed with business logic
4. Micro-batches impose a lower limit on latency
Spark: batch and streaming
Spark: batch and streaming
Lambda architecture?
Out of order
Latency is unpredictable
Our aim
Final Spark (1)
Final Spark (2)
Batch vs. Streaming
Data Streaming
Data Batch
Batch vs. Streaming
Data Batch
Batch vs. Streaming
A batch pipeline IS a streaming pipeline applied to
a finite stream!
Event time + Processing time
Processing time
Event time Business logic
+
Event time + Processing time
Processing time Event time
Business logic
+
Plethora of options?
Beam/Dataflow
Beam/Dataflow
Beam/Dataflow
Apache Beam
Streaming API
Execution engine
Apache Beam
Streaming API
!
!
!
Execution enginehttp://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
Apache Beam
Kostas Tzoumas, Data artisans
Tyler Akidau, Beam PMC
Other considerations
Maturity ? -
Ecosystem - -
Community -
Ops - -
Other considerations• Flow of the experiment:
• Read an event from Kafka.
• Deserialize the JSON string.
• Filter out irrelevant events
• Take a projection of the relevant fields
• Join each event with its associated campaign (from Redis).
• Take a windowed count of events per campaign and store each window in Redis along with a last updated timestamp (with late events).
Resources• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
• https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
• http://data-artisans.com/why-apache-beam/#more-710
• http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
• http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
• https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Spark Streaming into context Thanks for listening!