Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming...

49
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall lenadroid

Transcript of Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming...

Page 1: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Distributed systems for stream processing

Apache Kafka and Spark Structured Streaming

Alena Hall lenadroid

Page 2: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

ü Large-scaledataprocessingü DistributedSystemsü FunctionalProgrammingü DataScience&MachineLearning

Alena Hall - lenadroid

Page 3: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Natallia Dzenisenka

nata_dzen bit.ly/oscon-17

Page 4: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Ever-increasing

Data

lenadroid

Page 5: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Direct result of some action

lenadroid

Page 6: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Produced as a side effect

lenadroid

Page 7: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Continuous indicators

lenadroid

Page 8: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Reaction

urgent not-so-urgent flexible

lenadroid

Page 9: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Reaction

urgent not-so-urgent flexible

near-real-time~ seconds

real-time~ sub milliseconds

batch~minutes, hours, days, weeks

lenadroid

Page 10: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Event Ingestion Processing & Reaction

real-time micro-batch

batch

lenadroid

Page 11: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Are data workflows flexible enough?

lenadroid

Page 12: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Challenges

Simplicity. Scalability. Reliability

lenadroid

Page 13: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Meet Apache Kafka

lenadroid

Page 14: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Apache Kafka is an open-source stream-

processing software platform developed by the

Apache Software Foundation written in Scala

and Java.

lenadroid

Page 15: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Kafka Brokers

lenadroid

Page 16: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Inside of a Kafka Topic

0 1 2 3 4

0 1 2 3

80 1 2 3 4 5 6 7

lenadroid

Page 17: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Kafka Topic Partition

80 1 2 3 4 5 6 7

lenadroid

Page 18: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Kafka Producers and Consumers

lenadroid

Page 19: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Systems for stream processing

Kafka Streams

Storm

Spark

Flink

lenadroid

Page 20: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Meet Apache Spark

lenadroid

Page 21: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Apache Spark is a unified analytics engine for large-scale data

processing: batch, streaming, machine learning, graph

computation with access to data in hundreds of sources.

lenadroid

Page 22: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

ü Spark SQL and batch processing

ü Stream processing with Spark Streaming and Structured

Streaming

ü * Continuous processing

ü Machine Learning with Mllib

ü Graph computations with GraphX

* Experimental lenadroid

Page 23: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

How does Spark work?

lenadroid

Page 24: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Sparkapplication(Driver)

ClusterManager

task

task

task

task

task

task

task

task

task

lenadroid

Page 25: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Apache Kafka + Apache Spark

lenadroid

Page 26: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Existing infrastructure and resources

ü Kafka cluster (HDInsight or other)

ü Spark cluster (Azure Databricks workspace, or other)

ü Peered Kafka and Spark Virtual Networks

ü Sources of data: Twitter & Slack & Nomics APIs

lenadroid

Page 27: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Databricks: Interactive Environment

lenadroid

Page 28: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Processing crypto currency trading data

Example

lenadroid

Page 29: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

markets exchanges trades

ETH / BTC

BTC / USDT

Bitfinex

Binance

lenadroid

base quote

Page 30: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

markets exchanges trades

lenadroid

{"volume":"5","price":"3.0871","id":"123456","timestamp":"2018-07-17T17:00:00.00Z"

}

Page 31: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Indicators to watch and act on

ü Price spikes (all-time high, all-time low)

ü Significant changes in price or volume of trades

ü Profitability of potential trade at current moment

ü Price or volume of trades crossing given threshold during the past X minutes

ü Morelenadroid

Page 32: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Getting trades data from API

ü Market and exchange dataü Trades data for given market and base/quote currenciesü Sending data to Kafka

lenadroid

Page 33: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Processing trades

ü Consuming data coming from Kafka topicsü Watching relevant indicators

lenadroid

Page 34: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

More examples?

Processing streams of events from multiple sources with Apache Kafka and Spark

lenadroid

Page 35: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Data sources: external, internal, ...

• Big number of data sources

• Most of the data sources are independent

• Sources of data used for many processing tasks & end-goals

lenadroid

Page 36: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Feedback from Slack

ü Sending messages to Slack

lenadroid

Page 37: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Listener for new Slack messages

ü Messagesunderspecificchannelsü Focusedonaparticulartopicü SenttoaspecificKafkatopic

lenadroid

Page 38: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Receiving events in Kafka topic

ü SparkconsumerforKafkatopicsü SendingonlytopicrelatedmessagestoKafka

lenadroid

Page 39: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Sending Twitter feedback to Kafka

ü GettinglatesttweetsaboutspecifictopictoKafkaü ReceivingthoseeventsfromKafkainSpark

lenadroid

Page 40: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Analyzing feedback in real-time

ü Kafkaisreceivingeventsfrommanysourcesü SentimentanalysisonincomingKafkaeventsü Sentiment<=0.3à #negative-feedback forreviewü Sentiment>=0.9à #positive-feedback channel

lenadroid

Page 41: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Kafka + Spark = Reliable, scalable, durable event ingestion

and efficient stream processing

lenadroid

Page 42: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

lenadroid

Page 43: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

trigger(Trigger.Continuous("1 second"))

Low (~1 ms) end-to-end latency

At-least-once fault-tolerance guarantees

Not nearly all operations are supported yet

No automatic retries of failed tasks

Needs enough cluster power to operate

lenadroid

Page 44: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

EveryXseconds EveryXseconds EveryXseconds

WheneventIsatsource

Wheneventisprocessedtosink

Check-pointingepoch

lenadroid

Page 45: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

EveryXseconds EveryXseconds EveryXseconds

WheneventIsatsource

Wheneventisprocessedtosink

Check-pointingepoch

~1ms

lenadroid

Page 46: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

aka.ms/eventhubs-kafkalenadroid

Page 47: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Operator

Operators

lenadroid

Page 48: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

Thank you!

Apache Kafka: aka.ms/apache-kafka

Apache Spark: aka.ms/apache-sparkEvent stream processing architecture on Azure with Apache Kafka and Spark: aka.ms/kafka-spark-azure and aka.ms/oscon-18

Create HDInsight Kafka cluster using ARM: aka.ms/hdi-kafka-arm

Create Kafka topics in HDInsight: aka.ms/hdi-kafka-topic

lenadroid

Page 49: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional

ü Works on Azure at ü Lives in Seattleü F# Software Foundation Board of Trusteesü Organizes @ML4ALLü Program Committee for Lambda World ü Has a channel: /c/AlenaHall

Alena Hall - lenadroid