A Tale of Two APIs: Using Spark Streaming In Production

44

Transcript of A Tale of Two APIs: Using Spark Streaming In Production

Page 1: A Tale of Two APIs: Using Spark Streaming In Production
Page 2: A Tale of Two APIs: Using Spark Streaming In Production

Gerard MaasSeñor SW Engineer Computer Engineer

Scala DeveloperEarly Spark Adopter (v0.9)

Cassandra MVP (2015, 2016)

Stack Overflow Top Contributor(Spark, Spark Streaming, Scala)

Wannabe { IoT Maker Drone crasher/tinkerer}

@maasg

https://github.com/maasg

https://www.linkedin.com/in/gerardmaas/

https://stackoverflow.com/users/764040/maasg

Page 3: A Tale of Two APIs: Using Spark Streaming In Production
Page 4: A Tale of Two APIs: Using Spark Streaming In Production

Streaming | Big Data

Page 5: A Tale of Two APIs: Using Spark Streaming In Production

100Tb 5Mb

Page 6: A Tale of Two APIs: Using Spark Streaming In Production

100Tb 5Mb/s

Page 7: A Tale of Two APIs: Using Spark Streaming In Production

∑ Stream = Dataset

ᵂ Dataset = Stream

Page 8: A Tale of Two APIs: Using Spark Streaming In Production

What is Spark and Why we Should Care

Streaming APIs in Spark- Structured Streaming Overview

- Interactive Session 1- Spark Streaming Overview

- Interactive Session 2

Spark Streaming [AND|OR|XOR] Structured Streaming

Page 9: A Tale of Two APIs: Using Spark Streaming In Production

Once upon a time...

Page 10: A Tale of Two APIs: Using Spark Streaming In Production

Apache Spark Core

Spark SQL

Spar

k M

LLib

Spar

k St

ream

ing

Stru

ctur

ed

Stre

amin

g

Data

Fram

es

Data

Sets

Grap

hFra

mes

Data Sources

Page 11: A Tale of Two APIs: Using Spark Streaming In Production

Apache Spark Core

Spark SQL

Spar

k M

LLib

Spar

k St

ream

ing

Stru

ctur

ed

Stre

amin

g

Data

Fram

es

Data

Sets

Grap

hFra

mes

Data Sources

Page 12: A Tale of Two APIs: Using Spark Streaming In Production

1 Structured Streaming

Page 13: A Tale of Two APIs: Using Spark Streaming In Production

Structured Streaming

Kafka

Sockets

HDFS/S3

Custom

StreamingDataFrame

Query

Kafka

Files

foreachSink

console

memory

OutputMode

Page 14: A Tale of Two APIs: Using Spark Streaming In Production

SensorData

Producer

Structured Streaming

Fast Data Platform

Spark Notebook

Local Process

Demo Scenario

Page 15: A Tale of Two APIs: Using Spark Streaming In Production

1 Structured Streaming

HIGHLIGHTS

Page 16: A Tale of Two APIs: Using Spark Streaming In Production

val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load()

Sources

Page 17: A Tale of Two APIs: Using Spark Streaming In Production

Operations...val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]val jsonValues = rawValues.select(from_json($"value", schema) as "record")val sensorData = jsonValues.select("record.*").as[SensorData]…

Page 18: A Tale of Two APIs: Using Spark Streaming In Production

Event Time...val movingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp"))...

Page 19: A Tale of Two APIs: Using Spark Streaming In Production

Sinks...val visualizationQuery = sensorData.writeStream .queryName("visualization") // this query name will be the SQL table name .outputMode("update") .format("memory") .start()

...val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") // this query name will be the table name .outputMode("append") .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start()

Page 20: A Tale of Two APIs: Using Spark Streaming In Production

Use Cases● Streaming ETL● Stream aggregations, windows● Event-time oriented analytics● Join Streams with Fixed Datasets● Apply Machine Learning Models

Page 21: A Tale of Two APIs: Using Spark Streaming In Production

2 Spark Streaming

Page 22: A Tale of Two APIs: Using Spark Streaming In Production

Spark Streaming

Kafka

Flume

Kinesis

Twitter

Sockets

HDFS/S3

Custom Apache Spark

Spar

k SQ

L

Spar

k M

L

...

Databases

HDFS

API Server

Streams

Page 23: A Tale of Two APIs: Using Spark Streaming In Production

DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]

TransformationT -> U

Actions

Page 24: A Tale of Two APIs: Using Spark Streaming In Production

API: Transformations

map, flatmap, filter

count, reduce, countByValue,reduceByKey

n

union,joincogroup

Page 25: A Tale of Two APIs: Using Spark Streaming In Production

API: Transformations

mapWithState… …

Page 26: A Tale of Two APIs: Using Spark Streaming In Production

API: Transformations

transform

val iotDstream = MQTTUtils.createStream(...)val devicePriority = sparkContext.cassandraTable(...)val prioritizedDStream = iotDstream.transform{rdd =>

rdd.map(d => (d.id, d)).join(devicePriority)}

Page 27: A Tale of Two APIs: Using Spark Streaming In Production

Actions

print

-------------------------------------------

Time: 1459875469000 ms

-------------------------------------------

data1

data2

saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles

xxxyyyzzz

foreachRDD *

Page 28: A Tale of Two APIs: Using Spark Streaming In Production

Actions

print

-------------------------------------------

Time: 1459875469000 ms

-------------------------------------------

data1

data2

saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles

xxxyyyzzz

foreachRDD *Spark SQLDataframesGraphFramesAny API

Page 29: A Tale of Two APIs: Using Spark Streaming In Production

SensorData

Producer

Structured Streaming

Fast Data Platform

Spark Notebook

Local Process

Demo Scenario

Page 30: A Tale of Two APIs: Using Spark Streaming In Production

2 Spark Streaming

HIGHLIGHTS

Page 31: A Tale of Two APIs: Using Spark Streaming In Production

import org.apache.spark.streaming.StreamingContextval streamingContext = new StreamingContext(sparkContext, interval)

Streaming Context

Page 32: A Tale of Two APIs: Using Spark Streaming In Production

val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString)

val topics = Set(topic)@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics)

Source

Page 33: A Tale of Two APIs: Using Spark Streaming In Production

import spark.implicits._val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd}

Transformations

Page 34: A Tale of Two APIs: Using Spark Streaming In Production

val model = new M2Model()

model.trainOn(inputData)

val scoredDStream = model.predictOnValues(inputData)

Model

Page 35: A Tale of Two APIs: Using Spark Streaming In Production

suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample)}

Output

Page 36: A Tale of Two APIs: Using Spark Streaming In Production

Usecases

● Stream-stream joins● Complex state management (local + cluster state)● Streaming Machine Learning

○ Learn○ Score

● Join Streams with Updatable Datasets

● [-] Event-time oriented analytics● [-] Continuous processing

Page 37: A Tale of Two APIs: Using Spark Streaming In Production

Structured Streaming+

Page 38: A Tale of Two APIs: Using Spark Streaming In Production

Spark Streaming + Structured Streaming

38

val parse: Dataset[String] => Dataset[Record] = ???

val process: Dataset[Record] => Dataset[Result] = ???

val serialize: Dataset[Result] => Dataset[String] = ???

val kafkaStream = spark.readStream…

val f = parse andThen process andThen serialize

val result = f(kafkaStream)

result.writeStream

.format("kafka")

.option("kafka.bootstrap.servers",bootstrapServers)

.option("topic", writeTopic)

.option("checkpointLocation", checkpointLocation)

.start()

val dstream = KafkaUtils.createDirectStream(...)

dstream.map{rdd =>

val ds = sparkSession.createDataset(rdd)

val f = parse andThen process andThen serialize

val result = f(ds)

result.write.format("kafka")

.option("kafka.bootstrap.servers", bootstrapServers)

.option("topic", writeTopic)

.option("checkpointLocation", checkpointLocation)

.save()

}

Structured StreamingSpark Streaming

Page 39: A Tale of Two APIs: Using Spark Streaming In Production

Streaming Pipelines (example)

Structured Streaming

Keyword Extraction

KeywordRelevance Similarity

DB Storage

Page 40: A Tale of Two APIs: Using Spark Streaming In Production

Structured Streaming

New Project?

80%20%

Page 41: A Tale of Two APIs: Using Spark Streaming In Production

lightbend.com/fast-data-platform

Page 42: A Tale of Two APIs: Using Spark Streaming In Production

Features

1. One-click component installations

2. Automatic dependency checks

3. One-click access to install logs

4. Real-time cluster visualization

5. Access to consolidated production logs

Benefits:

1. Easy to get started

2. Ready access to all components

3. Increased developer velocity

Fast Data Platform Manager, for Managing Running Clusters

Page 43: A Tale of Two APIs: Using Spark Streaming In Production

lightbend.com/learn

Page 44: A Tale of Two APIs: Using Spark Streaming In Production

If you’re serious about having end-to-end monitoring for your Fast Data and streaming applications,

let’s chat!

SET UP A 20-MIN DEMO