Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos
Transcript of Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos
Modernizing Infrastructures for Fast Data Spark, Kafka, Cassandra, Reactive Platform and Mesos
by Dean Wampler, Ph.D. (@deanwampler)
Outline
•Reactive Enterprise Architectures: The Lightbend Perspective
•Big Data and the Emergence of Apache Spark
•An Architecture for Fast Data
2
Reactive Enterprise Applications:The Lightbend Perspective
Reactive Manifesto
The Lightbend Reactive Platform
7
Online Services
IoT
Retail
Education
Technology
Social
Media
Finance
7
Big Data and the Emergence of Apache Spark
Distributed compute frameworks: MapReduce
9
•Distribution computation over that data.
Hadoop
YARN
HDFS
MRjob#1
MRjob#2
Flume Sqoop
DBs
SlaveNode
DiskDiskDiskDiskDisk
NodeMgr
DataNode
Master
ResourceManager
NameNode
Hadoop Strengths
•Lowest CapEx system for Big Data.
•Excellent for ingesting and integrating diverse datasets.
•Flexible: from classic analytics (aggregations and data warehousing) to machine learning.
11
Hadoop Weaknesses
•Complex administration.
•YARN can’t manage all distributed services.
•MapReduce:
•Has poor performance.
•A difficult programming model.
•Doesn’t support stream processing.
12
Why Apache Spark?
YARN
HDFS
MRjob#1
MRjob#2
Flume Sqoop
DBs
SlaveNode
DiskDiskDiskDiskDisk
NodeMgr
DataNode
Master
ResourceManager
NameNode
Sparkjob#1
Sparkjob#2
Hadoop 2013:Embrace Spark
Spark vs. MapReduce Performance
15
100x better for many algorithms.
Spark: Major Performance Improvements
16
Sort 100TB
One of the Fastest Growing OS Projects
17
Modules
18
SparkStreaming(~RealTime)
MLlib(MachineLearning)
SQL/DataFrames(StructuredData)
GraphX(Graphs)
SparkRDD(Core)
The Core - Resilient Distributed Datasets
19
SparkStreaming(~RealTime)
MLlib(MachineLearning)
SQL/DataFrames(StructuredData)
GraphX(Graphs)
SparkRDD(Core)
Cluster
Node Node Node Node
RDDPartition 1
RDDPartition 2
RDDPartition 3
RDDPartition 4
“Inverted Index” in Spark
20
sparkContext.textFile("/path/to/input") .map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatMap { case (id, contents) => toWords(contents).map(w=>((w,id),1)) }.reduceByKey { (count1, count2) => count1 + count2 }.map { case ((word, path), n) => (word, (path, n))} .groupByKey .map { case (word, list) => (word, sortByCount(list)) }.saveAsTextFile("/path/to/output")
reduceByKey
flatMap
textFile
map
map
groupByKey
map
saveAsTextFile
SQL queries and a “DataFrame” DSL
21
SparkStreaming(~RealTime)
MLlib(MachineLearning)
SQL/DataFrames(StructuredData)
GraphX(Graphs)
SparkRDD(Core)
•For data with a fixed schema...
•Write SQL queries (currently a subset of HiveQL).
•Use equivalent Python-inspired DataFrame API.
Use SQL or the Idiomatic DataFrame API
22
# SQL: sqlContext.sql(""" SELECT state, age, COUNT(*) AS cnt FROM people GROUP BY state, age ORDER BY cnt DESC, state ASC, age ASC """)
// DataFrame (Scala): people.state($"state", $"age") .groupBy($"state", $"age").count() .orderBy($"count".desc, $"state".asc, $"age".asc)
Spark Streaming: “Mini-batch” Processing
23
SparkStreaming(~RealTime)
MLlib(MachineLearning)
SQL/DataFrames(StructuredData)
GraphX(Graphs)
SparkRDD(Core)
DStream
RDD #2
Even
t
Even
t
Even
t
Even
t
Even
t
…
Windows(2 batches)
t0
RDD #1
Even
t
Even
t
RDD #3
Even
t
Even
t
Even
t
t1 =t0 + ∆
t2 =t0 + 2∆
t3 =t0 + 3∆
Streaming Inverted Index
24
val kafkaBrokers = "host1:port1,host2:port2,..." val kafkaTopics = Set("topic1", "topic2", ...)
val sparkConf = new SparkConf().setAppName("...") val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with kafkaBrokers and kafkaTopics val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder]( ssc, kafkaParams, kafkaTopics)
messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))} .reduceByKey (_ + _) .map {case ((word, topic), n) => (word, (path, n))} .groupByKey .map {case (word, list) => (word, sortByCount(list))} .saveAsTextFiles("/path/to/output")
25
val sparkConf = new SparkConf().setAppName("...") val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create direct kafka stream with kafkaBrokers and kafkaTopics val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder]( ssc, kafkaParams, kafkaTopics)
messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))} .reduceByKey (_ + _) .map {case ((word, topic), n) => (word, (path, n))} .groupByKey .map {case (word, list) => (word, sortByCount(list))} .saveAsTextFiles("/path/to/output")
ssc.start() ssc.awaitTermination()
An Architecture for Fast Data
•Update a search engine in real time as web page or documents change.
•Train a SPAM filter with every email.
•Detect anomalies as they happen through processing of logs and monitoring data.
Fast as in Streaming. Why?
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
CoreofSpark,Kafka,andCassandra
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
“SMACK”Stack
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
DataSources
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
Event
Event
Event
Event
Event
Event
Producer Consumer
boundedqueue
feedbackfeedback
Reactive Streams
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
LightbendReactivePlatform
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
KafkaforStreamStorage
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
Service 1
Log & Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
Service 1
Log & Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
MinibatchProcessing
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
ShortandLong-termStorage
Mesos,YARNonBareMetal,Cloud
HDFS,S3,CFSv2SQL/NoSQL
Core
Streaming SQL
MLlib GraphX
Fast Data Architecture
HTTP/RESTInternet
ReacHveServices
LogsandOtherFiles
Actors
Cluster …Persist
AkkaStreams
WebServices
Infrastructure
•Next Steps
•Learn - Fast Data: Big Data Evolved
•Watch - Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
•Review - Spark success stories by Lightbend clients