Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The...

34
Budapest Data Forum, 2018 Structured Streaming in

Transcript of Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The...

Page 1: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Budapest Data Forum, 2018

Structured Streaming in

Page 2: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Spark / Big Data / Cloud Computing Trainings Building Data Infrastructures for Industry 4.0 & Online

Page 3: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.
Page 4: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Why Real-time?

Page 5: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Why Spark Streaming?

Page 6: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Why Real-time?

Page 7: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.
Page 8: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

How to chose a streaming tool?

Page 9: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

The Apache landscape

streams

Page 10: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.
Page 11: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Sometimes you just want to keep it simple

+

Page 12: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Remember this from 1 hour ago?

Page 13: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

So, our fancy tools

streams

Page 14: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

How to chose a fancy streaming tool?

Page 15: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Popularity

Page 16: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

See the bigger picture

Page 17: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Throughput

source: https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

*as the Spark folks measured it

Page 18: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Throughput

source:https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime

*as the Flink folks measured it

Page 19: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Developers!

Page 20: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

LatencyNative Streaming

(event-based processing)

vs

Microbatching

streams

trident

Page 21: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers

Page 22: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Structured Streaming

Page 23: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Page 24: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Page 25: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Unbounded Table

image credit: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Page 26: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Page 27: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Late data

Page 28: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Handling late data with Watermarking

Page 29: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Page 30: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

The drama of Exactly-once processing (Act I)

Spark: got it, thanks! Consider line 11 done.Spark: Hey Postgres,

store the results please

Spark: give me data

Kafka: you were at the 10th line, there you go with the 11th.

Spark: give me data

Kafka: you were at the 11th line, there you go with the 12th.

OK!

...

Page 31: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

The drama of Exactly-once processing (Act II)

Spark: got it, thanks! Consider line 13 done.Spark: Hey Postgres,

store the re.....

Spark: give me data

Kafka: you were at the 12th line, there you go with the 13th.

Claudius: Hey Spark, got thirsty? ;)

Page 32: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Demo

Page 33: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Summary• Only use fancy tools if you need them ;)

• Structured Streaming

• Great Concept

• Access to core Spark functionalities

• Probably takes 1-2 years to make it feature-rich

Page 34: Structured Streaming in - BI Consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · The drama of Exactly-once processing (Act I) Spark: got it, thanks!Consider line 11 done.

Questions?

Zoltan Toth [email protected]

+36 30 291 3599