Apache Cassandra and Spark for simple, distributed, near real-time ...

49
Data at the Speed of your Users Apache Cassandra and Spark for simple, distributed, near real-time stream processing. GOTO Copenhagen 2014

Transcript of Apache Cassandra and Spark for simple, distributed, near real-time ...

Page 1: Apache Cassandra and Spark for simple, distributed, near real-time ...

Data at the Speed of your UsersApache Cassandra and Spark for simple, distributed, near real-time stream processing.

GOTO Copenhagen 2014

Page 2: Apache Cassandra and Spark for simple, distributed, near real-time ...

Rustam Aliyev

Solution Architect at . !

! @rstml

Page 3: Apache Cassandra and Spark for simple, distributed, near real-time ...

Big Data?

Photo: Flickr / Watches En Masse

Page 4: Apache Cassandra and Spark for simple, distributed, near real-time ...

" Volume

# Variety

$ Velocity

Page 5: Apache Cassandra and Spark for simple, distributed, near real-time ...

Velocity = Near Real Time

Page 6: Apache Cassandra and Spark for simple, distributed, near real-time ...

Near Real Time?

Page 7: Apache Cassandra and Spark for simple, distributed, near real-time ...

0.5 sec ≤ ≤ 60 secNear Real Time

Page 8: Apache Cassandra and Spark for simple, distributed, near real-time ...

Use Cases

Photo: Flickr / Swiss Army / Jim Pennucci

Page 9: Apache Cassandra and Spark for simple, distributed, near real-time ...

Web Analytics Dynamic Pricing

Recommendation Fraud Detection

Page 10: Apache Cassandra and Spark for simple, distributed, near real-time ...

Architecture

Photo: Ilkin Kangarli / Baku Haydar Aliyev Center

Page 11: Apache Cassandra and Spark for simple, distributed, near real-time ...

Architecture Goals

Low Latency

High Availability

Horizontal Scalability

Simplicity

Page 12: Apache Cassandra and Spark for simple, distributed, near real-time ...

Stream Processing

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection Processing Storing Delivery

Page 13: Apache Cassandra and Spark for simple, distributed, near real-time ...

Stream Processing

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection!

!

Spark

!

CassandraDelivery

Page 14: Apache Cassandra and Spark for simple, distributed, near real-time ...

Cassandra Distributed Database

Photo: Flickr / Hypostyle Hall / Jorge Láscar

Page 15: Apache Cassandra and Spark for simple, distributed, near real-time ...

Data Model

Page 16: Apache Cassandra and Spark for simple, distributed, near real-time ...

Partition

Cell 1 Cell 2 …Cell 3Partition Key

Page 17: Apache Cassandra and Spark for simple, distributed, near real-time ...

Partition

os: Android

storage: 32GB

version: 4.4

weight: 130g

sort order on disk

Nexus 5

Page 18: Apache Cassandra and Spark for simple, distributed, near real-time ...

Table

os: Android

storage: 32GB

version: 4.4

weight: 130gNexus 5

os: iOS

storage: 64GB

version: 8.0

weight: 129giPhone 6

os: memory: version: weight: Other

Page 19: Apache Cassandra and Spark for simple, distributed, near real-time ...

Distribution

Page 20: Apache Cassandra and Spark for simple, distributed, near real-time ...

0000

8000

4000C000

2000

6000

E000

A000

3D97Nexus 5

Page 21: Apache Cassandra and Spark for simple, distributed, near real-time ...

0000

8000

4000C000

2000

6000

E000

A000

9C4FiPhone 6

3D97

Page 22: Apache Cassandra and Spark for simple, distributed, near real-time ...

Replication

Page 23: Apache Cassandra and Spark for simple, distributed, near real-time ...

0000

8000

4000C000

2000

6000

E000

A000

3D97

9C4F

1 replica

Page 24: Apache Cassandra and Spark for simple, distributed, near real-time ...

0000

8000

4000C000

2000

6000

E000

A000

3D97

9C4F

9C4F

3D97

2 replicas

Page 25: Apache Cassandra and Spark for simple, distributed, near real-time ...

Spark Distributed Data Processing Engine

Photo: Flickr / Sparklers / Alexandra Compo / CreativeCommons

Page 26: Apache Cassandra and Spark for simple, distributed, near real-time ...

Fast

In-memory

Page 27: Apache Cassandra and Spark for simple, distributed, near real-time ...

Logistic Regression

Run

ning

Tim

e (s

)

1000

2000

3000

4000

Number of Iterations

1 5 10 20 30

SparkHadoop

Page 28: Apache Cassandra and Spark for simple, distributed, near real-time ...

Easy

Page 29: Apache Cassandra and Spark for simple, distributed, near real-time ...

map !

reduce

Page 30: Apache Cassandra and Spark for simple, distributed, near real-time ...

map filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Page 31: Apache Cassandra and Spark for simple, distributed, near real-time ...

RDD Resilient Distributed Datasets

Node 1

Node 2

Node 3Node 1

Node 2

Node 3

Page 32: Apache Cassandra and Spark for simple, distributed, near real-time ...

Operator DAG

groupBy

join

filtermap

Disk RDD Memory RDD

Page 33: Apache Cassandra and Spark for simple, distributed, near real-time ...

Spark Streaming

Micro-batching

Page 34: Apache Cassandra and Spark for simple, distributed, near real-time ...

RDD

DStreamData Stream

Page 35: Apache Cassandra and Spark for simple, distributed, near real-time ...

Spark + Cassandra

DataStax Spark Cassandra Connector

Page 36: Apache Cassandra and Spark for simple, distributed, near real-time ...

https://github.com/datastax/spark-cassandra-connector

Page 37: Apache Cassandra and Spark for simple, distributed, near real-time ...

M

M

M

Cassandra

Spark Worker

Spark Master & Worker

Page 38: Apache Cassandra and Spark for simple, distributed, near real-time ...

Demo!

! Twitter Analytics

Page 39: Apache Cassandra and Spark for simple, distributed, near real-time ...

Cassandra Data Model

Page 40: Apache Cassandra and Spark for simple, distributed, near real-time ...

ALL: 7139

2014-09-21: 220

2014-09-20: 309

2014-09-19: 129

sort order

#hashtag

Page 41: Apache Cassandra and Spark for simple, distributed, near real-time ...

CREATE  TABLE  hashtags  (          hashtag  text,          interval  text,          mentions  counter,          PRIMARY  KEY((hashtag),  interval)  )  WITH  CLUSTERING  ORDER  BY  (interval  DESC);  

Page 42: Apache Cassandra and Spark for simple, distributed, near real-time ...

Processing Data Stream

Page 43: Apache Cassandra and Spark for simple, distributed, near real-time ...

import  com.datastax.spark.connector.streaming._  !val  sc  =  new  SparkConf()      .setMaster("spark://127.0.0.1:7077")      .setAppName("Twitter-­‐Demo")      .setJars("demo-­‐assembly-­‐1.0.jar"))      .set("spark.cassandra.connection.host",  "127.0.0.1")  !val  ssc  =  new  StreamingContext(sc,  Seconds(2))  !val  stream  =  TwitterUtils.      createStream(ssc,  None,  Nil,  storageLevel  =  StorageLevel.MEMORY_ONLY_SER_2)  !val  hashTags  =  stream.flatMap(tweet  =>      tweet.getText.toLowerCase.split("  ").      filter(tags.contains(Seq("#iphone",  "#android"))))  !val  tagCounts  =  hashTags.map((_,  1)).reduceByKey(_  +  _)  !val  tagCountsAll  =  tagCounts.map{     case  (tag,  mentions)  =>  (tag,  mentions,  "ALL")  }  !tagCountsAll.saveToCassandra(  

Page 44: Apache Cassandra and Spark for simple, distributed, near real-time ...

import  com.datastax.spark.connector.streaming._  !val  sc  =  new  SparkConf()      .setMaster("spark://127.0.0.1:7077")      .setAppName("Twitter-­‐Demo")      .setJars("demo-­‐assembly-­‐1.0.jar"))      .set("spark.cassandra.connection.host",  "127.0.0.1")  !val  ssc  =  new  StreamingContext(sc,  Seconds(2))  !val  stream  =  TwitterUtils.      createStream(ssc,  None,  Nil,  storageLevel  =  StorageLevel.MEMORY_ONLY_SER_2)  !val  hashTags  =  stream.flatMap(tweet  =>      tweet.getText.toLowerCase.split("  ").      filter(tags.contains(Seq("#iphone",  "#android"))))  !val  tagCounts  =  hashTags.map((_,  1)).reduceByKey(_  +  _)  !val  tagCountsAll  =  tagCounts.map{     case  (tag,  mentions)  =>  (tag,  mentions,  "ALL")  }  !tagCountsAll.saveToCassandra(  

Page 45: Apache Cassandra and Spark for simple, distributed, near real-time ...

import  com.datastax.spark.connector.streaming._  !val  sc  =  new  SparkConf()      .setMaster("spark://127.0.0.1:7077")      .setAppName("Twitter-­‐Demo")      .setJars("demo-­‐assembly-­‐1.0.jar"))      .set("spark.cassandra.connection.host",  "127.0.0.1")  !val  ssc  =  new  StreamingContext(sc,  Seconds(2))  !val  stream  =  TwitterUtils.      createStream(ssc,  None,  Nil,  storageLevel  =  StorageLevel.MEMORY_ONLY_SER_2)  !val  hashTags  =  stream.flatMap(tweet  =>      tweet.getText.toLowerCase.split("  ").      filter(tags.contains(Seq("#iphone",  "#android"))))  !val  tagCounts  =  hashTags.map((_,  1)).reduceByKey(_  +  _)  !val  tagCountsAll  =  tagCounts.map{     case  (tag,  mentions)  =>  (tag,  mentions,  "ALL")  }  !tagCountsAll.saveToCassandra(  

Page 46: Apache Cassandra and Spark for simple, distributed, near real-time ...

!val  ssc  =  new  StreamingContext(sc,  Seconds(2))  !val  stream  =  TwitterUtils.      createStream(ssc,  None,  Nil,  storageLevel  =  StorageLevel.MEMORY_ONLY_SER_2)  !val  hashTags  =  stream.flatMap(tweet  =>      tweet.getText.toLowerCase.split("  ").      filter(tags.contains(Seq("#iphone",  "#android"))))  !val  tagCounts  =  hashTags.map((_,  1)).reduceByKey(_  +  _)  !val  tagCountsAll  =  tagCounts.map{     case  (tag,  mentions)  =>  (tag,  mentions,  "ALL")  }  !tagCountsAll.saveToCassandra(     "demo_ks",  "hashtags",  Seq("hashtag",  "mentions",  "interval"))  !ssc.start()  ssc.awaitTermination()  

Page 47: Apache Cassandra and Spark for simple, distributed, near real-time ...

!val  ssc  =  new  StreamingContext(sc,  Seconds(2))  !val  stream  =  TwitterUtils.      createStream(ssc,  None,  Nil,  storageLevel  =  StorageLevel.MEMORY_ONLY_SER_2)  !val  hashTags  =  stream.flatMap(tweet  =>      tweet.getText.toLowerCase.split("  ").      filter(tags.contains(Seq("#iphone",  "#android"))))  !val  tagCounts  =  hashTags.map((_,  1)).reduceByKey(_  +  _)  !val  tagCountsByDay  =  tagCounts.map{     case  (tag,  mentions)  =>  (tag,  mentions,  DateTime.now.toString("yyyyMMdd"))  }  !tagCountsByDay.saveToCassandra(     "demo_ks",  "hashtags",  Seq("hashtag",  "mentions",  "interval"))  !ssc.start()  ssc.awaitTermination()  

Page 48: Apache Cassandra and Spark for simple, distributed, near real-time ...

!val  ssc  =  new  StreamingContext(sc,  Seconds(2))  !val  stream  =  TwitterUtils.      createStream(ssc,  None,  Nil,  storageLevel  =  StorageLevel.MEMORY_ONLY_SER_2)  !val  hashTags  =  stream.flatMap(tweet  =>      tweet.getText.toLowerCase.split("  ").      filter(tags.contains(Seq("#iphone",  "#android"))))  !val  tagCounts  =  hashTags.map((_,  1)).reduceByKey(_  +  _)  !val  tagCountsAll  =  tagCounts.map{     case  (tag,  mentions)  =>  (tag,  mentions,  "ALL")  }  !tagCountsAll.saveToCassandra(     "demo_ks",  "hashtags",  Seq("hashtag",  "mentions",  "interval"))  !ssc.start()  ssc.awaitTermination()  

Page 49: Apache Cassandra and Spark for simple, distributed, near real-time ...

Questions ?