A Tale of Two APIs: Using Spark Streaming In Production
Transcript of A Tale of Two APIs: Using Spark Streaming In Production
Gerard MaasSeñor SW Engineer Computer Engineer
Scala DeveloperEarly Spark Adopter (v0.9)
Cassandra MVP (2015, 2016)
Stack Overflow Top Contributor(Spark, Spark Streaming, Scala)
Wannabe { IoT Maker Drone crasher/tinkerer}
@maasg
https://github.com/maasg
https://www.linkedin.com/in/gerardmaas/
https://stackoverflow.com/users/764040/maasg
Streaming | Big Data
100Tb 5Mb
100Tb 5Mb/s
∑ Stream = Dataset
ᵂ Dataset = Stream
What is Spark and Why we Should Care
Streaming APIs in Spark- Structured Streaming Overview
- Interactive Session 1- Spark Streaming Overview
- Interactive Session 2
Spark Streaming [AND|OR|XOR] Structured Streaming
Once upon a time...
Apache Spark Core
Spark SQL
Spar
k M
LLib
Spar
k St
ream
ing
Stru
ctur
ed
Stre
amin
g
Data
Fram
es
Data
Sets
Grap
hFra
mes
Data Sources
Apache Spark Core
Spark SQL
Spar
k M
LLib
Spar
k St
ream
ing
Stru
ctur
ed
Stre
amin
g
Data
Fram
es
Data
Sets
Grap
hFra
mes
Data Sources
1 Structured Streaming
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
StreamingDataFrame
Query
Kafka
Files
foreachSink
console
memory
OutputMode
SensorData
Producer
Structured Streaming
Fast Data Platform
Spark Notebook
Local Process
Demo Scenario
1 Structured Streaming
HIGHLIGHTS
val rawData = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load()
Sources
Operations...val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]val jsonValues = rawValues.select(from_json($"value", schema) as "record")val sensorData = jsonValues.select("record.*").as[SensorData]…
Event Time...val movingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp"))...
Sinks...val visualizationQuery = sensorData.writeStream .queryName("visualization") // this query name will be the SQL table name .outputMode("update") .format("memory") .start()
...val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") // this query name will be the table name .outputMode("append") .format("kafka") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start()
Use Cases● Streaming ETL● Stream aggregations, windows● Event-time oriented analytics● Join Streams with Fixed Datasets● Apply Machine Learning Models
2 Spark Streaming
Spark Streaming
Kafka
Flume
Kinesis
Sockets
HDFS/S3
Custom Apache Spark
Spar
k SQ
L
Spar
k M
L
...
Databases
HDFS
API Server
Streams
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
TransformationT -> U
Actions
API: Transformations
map, flatmap, filter
count, reduce, countByValue,reduceByKey
n
union,joincogroup
API: Transformations
mapWithState… …
API: Transformations
transform
val iotDstream = MQTTUtils.createStream(...)val devicePriority = sparkContext.cassandraTable(...)val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)}
Actions
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles
xxxyyyzzz
foreachRDD *
Actions
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles
xxxyyyzzz
foreachRDD *Spark SQLDataframesGraphFramesAny API
SensorData
Producer
Structured Streaming
Fast Data Platform
Spark Notebook
Local Process
Demo Scenario
2 Spark Streaming
HIGHLIGHTS
import org.apache.spark.streaming.StreamingContextval streamingContext = new StreamingContext(sparkContext, interval)
Streaming Context
val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString)
val topics = Set(topic)@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics)
Source
import spark.implicits._val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd}
Transformations
val model = new M2Model()
…
model.trainOn(inputData)
…
val scoredDStream = model.predictOnValues(inputData)
Model
suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample)}
Output
Usecases
● Stream-stream joins● Complex state management (local + cluster state)● Streaming Machine Learning
○ Learn○ Score
● Join Streams with Updatable Datasets
● [-] Event-time oriented analytics● [-] Continuous processing
Structured Streaming+
Spark Streaming + Structured Streaming
38
val parse: Dataset[String] => Dataset[Record] = ???
val process: Dataset[Record] => Dataset[Result] = ???
val serialize: Dataset[Result] => Dataset[String] = ???
val kafkaStream = spark.readStream…
val f = parse andThen process andThen serialize
val result = f(kafkaStream)
result.writeStream
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.start()
val dstream = KafkaUtils.createDirectStream(...)
dstream.map{rdd =>
val ds = sparkSession.createDataset(rdd)
val f = parse andThen process andThen serialize
val result = f(ds)
result.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.save()
}
Structured StreamingSpark Streaming
Streaming Pipelines (example)
Structured Streaming
Keyword Extraction
KeywordRelevance Similarity
DB Storage
Structured Streaming
New Project?
80%20%
lightbend.com/fast-data-platform
Features
1. One-click component installations
2. Automatic dependency checks
3. One-click access to install logs
4. Real-time cluster visualization
5. Access to consolidated production logs
Benefits:
1. Easy to get started
2. Ready access to all components
3. Increased developer velocity
Fast Data Platform Manager, for Managing Running Clusters
lightbend.com/learn
If you’re serious about having end-to-end monitoring for your Fast Data and streaming applications,
let’s chat!
SET UP A 20-MIN DEMO