Apache Cassandra and Spark for simple, distributed, near real-time ...
Embed Size (px)
Transcript of Apache Cassandra and Spark for simple, distributed, near real-time ...

Data at the Speed of your UsersApache Cassandra and Spark for simple, distributed, near real-time stream processing.
GOTO Copenhagen 2014

Rustam Aliyev
Solution Architect at . !
! @rstml

Big Data?
Photo: Flickr / Watches En Masse

" Volume
# Variety
$ Velocity

Velocity = Near Real Time

Near Real Time?

0.5 sec ≤ ≤ 60 secNear Real Time

Use Cases
Photo: Flickr / Swiss Army / Jim Pennucci

Web Analytics Dynamic Pricing
Recommendation Fraud Detection

Architecture
Photo: Ilkin Kangarli / Baku Haydar Aliyev Center

Architecture Goals
Low Latency
High Availability
Horizontal Scalability
Simplicity

Stream Processing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection Processing Storing Delivery

Stream Processing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection!
!
Spark
!
CassandraDelivery

Cassandra Distributed Database
Photo: Flickr / Hypostyle Hall / Jorge Láscar

Data Model

Partition
Cell 1 Cell 2 …Cell 3Partition Key

Partition
os: Android
storage: 32GB
version: 4.4
weight: 130g
sort order on disk
Nexus 5

Table
os: Android
storage: 32GB
version: 4.4
weight: 130gNexus 5
os: iOS
storage: 64GB
version: 8.0
weight: 129giPhone 6
os: memory: version: weight: Other

Distribution

0000
8000
4000C000
2000
6000
E000
A000
3D97Nexus 5

0000
8000
4000C000
2000
6000
E000
A000
9C4FiPhone 6
3D97

Replication

0000
8000
4000C000
2000
6000
E000
A000
3D97
9C4F
1 replica

0000
8000
4000C000
2000
6000
E000
A000
3D97
9C4F
9C4F
3D97
2 replicas

Spark Distributed Data Processing Engine
Photo: Flickr / Sparklers / Alexandra Compo / CreativeCommons

Fast
In-memory

Logistic Regression
Run
ning
Tim
e (s
)
1000
2000
3000
4000
Number of Iterations
1 5 10 20 30
SparkHadoop

Easy

map !
reduce

map filter groupBy sort union join leftOuterJoin rightOuterJoin
reduce count fold reduceByKey groupByKey cogroup cross zip
sample take first partitionBy mapWith pipe save ...

RDD Resilient Distributed Datasets
Node 1
Node 2
Node 3Node 1
Node 2
Node 3

Operator DAG
groupBy
join
filtermap
Disk RDD Memory RDD

Spark Streaming
Micro-batching

RDD
DStreamData Stream

Spark + Cassandra
DataStax Spark Cassandra Connector

https://github.com/datastax/spark-cassandra-connector

M
M
M
Cassandra
Spark Worker
Spark Master & Worker

Demo!
! Twitter Analytics

Cassandra Data Model

ALL: 7139
2014-09-21: 220
2014-09-20: 309
2014-09-19: 129
sort order
#hashtag

CREATE TABLE hashtags ( hashtag text, interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);

Processing Data Stream

import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(

import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(

import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(

!val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()

!val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } !tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()

!val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()

Questions ?