Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

41
Modernizing Infrastructures for Fast Data Spark, Kafka, Cassandra, Reactive Platform and Mesos by Dean Wampler, Ph.D. (@deanwampler)

Transcript of Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Page 1: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Modernizing Infrastructures for Fast Data Spark, Kafka, Cassandra, Reactive Platform and Mesos

by Dean Wampler, Ph.D. (@deanwampler)

Page 2: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Outline

•Reactive Enterprise Architectures: The Lightbend Perspective

•Big Data and the Emergence of Apache Spark

•An Architecture for Fast Data

2

Page 3: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Reactive Enterprise Applications:The Lightbend Perspective

Page 4: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Reactive Manifesto

Page 5: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

The Lightbend Reactive Platform

Page 6: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos
Page 7: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

7

Online Services

IoT

Retail

Education

Technology

Social

Media

Finance

7

Page 8: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Big Data and the Emergence of Apache Spark

Page 9: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Distributed compute frameworks: MapReduce

9

•Distribution computation over that data.

Page 10: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Hadoop

YARN

HDFS

MRjob#1

MRjob#2

Flume Sqoop

DBs

SlaveNode

DiskDiskDiskDiskDisk

NodeMgr

DataNode

Master

ResourceManager

NameNode

Page 11: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Hadoop Strengths

•Lowest CapEx system for Big Data.

•Excellent for ingesting and integrating diverse datasets.

•Flexible: from classic analytics (aggregations and data warehousing) to machine learning.

11

Page 12: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Hadoop Weaknesses

•Complex administration.

•YARN can’t manage all distributed services.

•MapReduce:

•Has poor performance.

•A difficult programming model.

•Doesn’t support stream processing.

12

Page 13: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Why Apache Spark?

Page 14: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

YARN

HDFS

MRjob#1

MRjob#2

Flume Sqoop

DBs

SlaveNode

DiskDiskDiskDiskDisk

NodeMgr

DataNode

Master

ResourceManager

NameNode

Sparkjob#1

Sparkjob#2

Hadoop 2013:Embrace Spark

Page 15: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Spark vs. MapReduce Performance

15

100x better for many algorithms.

Page 16: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Spark: Major Performance Improvements

16

Sort 100TB

Page 17: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

One of the Fastest Growing OS Projects

17

Page 18: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Modules

18

SparkStreaming(~RealTime)

MLlib(MachineLearning)

SQL/DataFrames(StructuredData)

GraphX(Graphs)

SparkRDD(Core)

Page 19: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

The Core - Resilient Distributed Datasets

19

SparkStreaming(~RealTime)

MLlib(MachineLearning)

SQL/DataFrames(StructuredData)

GraphX(Graphs)

SparkRDD(Core)

Cluster

Node Node Node Node

RDDPartition 1

RDDPartition 2

RDDPartition 3

RDDPartition 4

Page 20: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

“Inverted Index” in Spark

20

sparkContext.textFile("/path/to/input") .map { line => val array = line.split(",", 2) (array(0), array(1)) }.flatMap { case (id, contents) => toWords(contents).map(w=>((w,id),1)) }.reduceByKey { (count1, count2) => count1 + count2 }.map { case ((word, path), n) => (word, (path, n))} .groupByKey .map { case (word, list) => (word, sortByCount(list)) }.saveAsTextFile("/path/to/output")

reduceByKey

flatMap

textFile

map

map

groupByKey

map

saveAsTextFile

Page 21: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

SQL queries and a “DataFrame” DSL

21

SparkStreaming(~RealTime)

MLlib(MachineLearning)

SQL/DataFrames(StructuredData)

GraphX(Graphs)

SparkRDD(Core)

•For data with a fixed schema...

•Write SQL queries (currently a subset of HiveQL).

•Use equivalent Python-inspired DataFrame API.

Page 22: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Use SQL or the Idiomatic DataFrame API

22

# SQL: sqlContext.sql(""" SELECT state, age, COUNT(*) AS cnt FROM people GROUP BY state, age ORDER BY cnt DESC, state ASC, age ASC """)

// DataFrame (Scala): people.state($"state", $"age") .groupBy($"state", $"age").count() .orderBy($"count".desc, $"state".asc, $"age".asc)

Page 23: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Spark Streaming: “Mini-batch” Processing

23

SparkStreaming(~RealTime)

MLlib(MachineLearning)

SQL/DataFrames(StructuredData)

GraphX(Graphs)

SparkRDD(Core)

DStream

RDD #2

Even

t

Even

t

Even

t

Even

t

Even

t

Windows(2 batches)

t0

RDD #1

Even

t

Even

t

RDD #3

Even

t

Even

t

Even

t

t1 =t0 + ∆

t2 =t0 + 2∆

t3 =t0 + 3∆

Page 24: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Streaming Inverted Index

24

val kafkaBrokers = "host1:port1,host2:port2,..." val kafkaTopics = Set("topic1", "topic2", ...)

val sparkConf = new SparkConf().setAppName("...") val ssc = new StreamingContext(sparkConf, Seconds(2))

// Create direct kafka stream with kafkaBrokers and kafkaTopics val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder]( ssc, kafkaParams, kafkaTopics)

messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))} .reduceByKey (_ + _) .map {case ((word, topic), n) => (word, (path, n))} .groupByKey .map {case (word, list) => (word, sortByCount(list))} .saveAsTextFiles("/path/to/output")

Page 25: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

25

val sparkConf = new SparkConf().setAppName("...") val ssc = new StreamingContext(sparkConf, Seconds(2))

// Create direct kafka stream with kafkaBrokers and kafkaTopics val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder]( ssc, kafkaParams, kafkaTopics)

messages.flatMap {case (topic,text) => toWords(text).map(w=>((w,topic),1L))} .reduceByKey (_ + _) .map {case ((word, topic), n) => (word, (path, n))} .groupByKey .map {case (word, list) => (word, sortByCount(list))} .saveAsTextFiles("/path/to/output")

ssc.start() ssc.awaitTermination()

Page 26: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

An Architecture for Fast Data

Page 27: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

•Update a search engine in real time as web page or documents change.

•Train a SPAM filter with every email.

•Detect anomalies as they happen through processing of logs and monitoring data.

Fast as in Streaming. Why?

Page 28: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

Page 29: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

CoreofSpark,Kafka,andCassandra

Page 30: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

“SMACK”Stack

Page 31: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

DataSources

Page 32: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

Event

Event

Event

Event

Event

Event

Producer Consumer

boundedqueue

feedbackfeedback

Reactive Streams

Page 33: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

LightbendReactivePlatform

Page 34: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

KafkaforStreamStorage

Page 35: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

Service 1

Log & Other Files

Internet

Services

Service 2

Service 3

Services

Services

N * M links ConsumersProducers

Page 36: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

Service 1

Log & Other Files

Internet

Services

Service 2

Service 3

Services

Services

N + M links ConsumersProducers

Page 37: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

MinibatchProcessing

Page 38: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

ShortandLong-termStorage

Page 39: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

Mesos,YARNonBareMetal,Cloud

HDFS,S3,CFSv2SQL/NoSQL

Core

Streaming SQL

MLlib GraphX

Fast Data Architecture

HTTP/RESTInternet

ReacHveServices

LogsandOtherFiles

Actors

Cluster …Persist

AkkaStreams

WebServices

Infrastructure

Page 40: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos

•Next Steps

•Learn - Fast Data: Big Data Evolved

•Watch - Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization

•Review - Spark success stories by Lightbend clients

Page 41: Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, Reactive Platform and Mesos