Real-time Analytics with Cassandra, Spark, and Shark

128
Real-time Analytics with Cassandra, Spark and Shark Tuesday, June 18, 13

description

"Real-time Analytics with Cassandra, Spark, and Shark", my talk at the Cassandra Summit 2013 conference in San Francisco

Transcript of Real-time Analytics with Cassandra, Spark, and Shark

Page 1: Real-time Analytics with Cassandra, Spark, and Shark

Real-time Analytics withCassandra, Spark and Shark

Tuesday, June 18, 13

Page 2: Real-time Analytics with Cassandra, Spark, and Shark

Who is this guy• Staff Engineer, Compute and Data Services, Ooyala• Building multiple web-scale real-time systems on top of C*, Kafka,

Storm, etc.• Scala/Akka guy• Very excited by open source, big data projects - share some today• @evanfchan

Tuesday, June 18, 13

Page 3: Real-time Analytics with Cassandra, Spark, and Shark

Agenda

Tuesday, June 18, 13

Page 4: Real-time Analytics with Cassandra, Spark, and Shark

Agenda• Ooyala and Cassandra

Tuesday, June 18, 13

Page 5: Real-time Analytics with Cassandra, Spark, and Shark

Agenda• Ooyala and Cassandra• What problem are we trying to solve?

Tuesday, June 18, 13

Page 6: Real-time Analytics with Cassandra, Spark, and Shark

Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark

Tuesday, June 18, 13

Page 7: Real-time Analytics with Cassandra, Spark, and Shark

Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark• Our Spark/Cassandra Architecture

Tuesday, June 18, 13

Page 8: Real-time Analytics with Cassandra, Spark, and Shark

Agenda• Ooyala and Cassandra• What problem are we trying to solve?• Spark and Shark• Our Spark/Cassandra Architecture• Demo

Tuesday, June 18, 13

Page 9: Real-time Analytics with Cassandra, Spark, and Shark

Cassandra at OoyalaWho is Ooyala, and how we use Cassandra

Tuesday, June 18, 13

Page 10: Real-time Analytics with Cassandra, Spark, and Shark

CONFIDENTIAL—DO NOT DISTRIBUTE

OOYALAPowering personalized video

experiences across all screens.

5

Tuesday, June 18, 13

Page 11: Real-time Analytics with Cassandra, Spark, and Shark

CONFIDENTIAL—DO NOT DISTRIBUTE 6CONFIDENTIAL—DO NOT DISTRIBUTE

Founded in 2007

Commercially launch in 2009

230+ employees in Silicon Valley, LA, NYC, London, Paris, Tokyo, Sydney & Guadalajara

Global footprint, 200M unique users,110+ countries, and more than 6,000 websites

Over 1 billion videos played per month and 2 billion analytic events per day

25% of U.S. online viewers watch video powered by Ooyala

COMPANY OVERVIEW

Tuesday, June 18, 13

Page 12: Real-time Analytics with Cassandra, Spark, and Shark

CONFIDENTIAL—DO NOT DISTRIBUTE 7

TRUSTED VIDEO PARTNER

STRATEGIC PARTNERS

CUSTOMERS

CONFIDENTIAL—DO NOT DISTRIBUTE

Tuesday, June 18, 13

Page 13: Real-time Analytics with Cassandra, Spark, and Shark

We are a large Cassandra user

Tuesday, June 18, 13

Page 14: Real-time Analytics with Cassandra, Spark, and Shark

We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes

Tuesday, June 18, 13

Page 15: Real-time Analytics with Cassandra, Spark, and Shark

We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes

Tuesday, June 18, 13

Page 16: Real-time Analytics with Cassandra, Spark, and Shark

We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes• Over 2 billion C* column writes per day

Tuesday, June 18, 13

Page 17: Real-time Analytics with Cassandra, Spark, and Shark

We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes• Over 2 billion C* column writes per day• Powers all of our analytics infrastructure

Tuesday, June 18, 13

Page 18: Real-time Analytics with Cassandra, Spark, and Shark

We are a large Cassandra user• 11 clusters ranging in size from 3 to 36 nodes• Total of 28TB of data managed over ~85 nodes• Over 2 billion C* column writes per day• Powers all of our analytics infrastructure• Much much bigger cluster coming soon

Tuesday, June 18, 13

Page 19: Real-time Analytics with Cassandra, Spark, and Shark

What problem are we trying to solve?Lots of data, complex queries, answered really quickly... but how??

Tuesday, June 18, 13

Page 20: Real-time Analytics with Cassandra, Spark, and Shark

From mountains of useless data...

Tuesday, June 18, 13

Page 21: Real-time Analytics with Cassandra, Spark, and Shark

To nuggets of truth...

Tuesday, June 18, 13

Page 22: Real-time Analytics with Cassandra, Spark, and Shark

To nuggets of truth...

•Quickly

Tuesday, June 18, 13

Page 23: Real-time Analytics with Cassandra, Spark, and Shark

To nuggets of truth...

•Quickly•Painlessly

Tuesday, June 18, 13

Page 24: Real-time Analytics with Cassandra, Spark, and Shark

To nuggets of truth...

•Quickly•Painlessly•At scale?

Tuesday, June 18, 13

Page 25: Real-time Analytics with Cassandra, Spark, and Shark

Today: Precomputed aggregates

Tuesday, June 18, 13

Page 26: Real-time Analytics with Cassandra, Spark, and Shark

Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions

Tuesday, June 18, 13

Page 27: Real-time Analytics with Cassandra, Spark, and Shark

Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change

Tuesday, June 18, 13

Page 28: Real-time Analytics with Cassandra, Spark, and Shark

Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change• Most computed aggregates are never read

Tuesday, June 18, 13

Page 29: Real-time Analytics with Cassandra, Spark, and Shark

Today: Precomputed aggregates• Video metrics computed along several high cardinality dimensions • Very fast lookups, but inflexible, and hard to change• Most computed aggregates are never read• What if we need more dynamic queries?

– Top content for mobile users in France– Engagement curves for users who watched recommendations– Data mining, trends, machine learning

Tuesday, June 18, 13

Page 30: Real-time Analytics with Cassandra, Spark, and Shark

The static - dynamic continuum

• Super fast lookups• Inflexible, wasteful• Best for 80% most

common queries

• Always compute results from raw data

• Flexible but slow

100% Precomputation 100% Dynamic

Tuesday, June 18, 13

Page 31: Real-time Analytics with Cassandra, Spark, and Shark

The static - dynamic continuum

• Super fast lookups• Inflexible, wasteful• Best for 80% most

common queries

• Always compute results from raw data

• Flexible but slow

100% Precomputation100% Dynamic

Tuesday, June 18, 13

Page 32: Real-time Analytics with Cassandra, Spark, and Shark

Where we want to be

Partly dynamic

• Pre-aggregate most common queries

• Flexible, fast dynamic queries

• Easily generate many materialized views

Tuesday, June 18, 13

Page 33: Real-time Analytics with Cassandra, Spark, and Shark

Industry Trends

Tuesday, June 18, 13

Page 34: Real-time Analytics with Cassandra, Spark, and Shark

Industry Trends• Fast execution frameworks

– Impala

Tuesday, June 18, 13

Page 35: Real-time Analytics with Cassandra, Spark, and Shark

Industry Trends• Fast execution frameworks

– Impala• In-memory databases

– VoltDB, Druid

Tuesday, June 18, 13

Page 36: Real-time Analytics with Cassandra, Spark, and Shark

Industry Trends• Fast execution frameworks

– Impala• In-memory databases

– VoltDB, Druid• Streaming and real-time

Tuesday, June 18, 13

Page 37: Real-time Analytics with Cassandra, Spark, and Shark

Industry Trends• Fast execution frameworks

– Impala• In-memory databases

– VoltDB, Druid• Streaming and real-time• Higher-level, productive data frameworks

– Cascading, Hive, Pig

Tuesday, June 18, 13

Page 38: Real-time Analytics with Cassandra, Spark, and Shark

Why Spark and Shark?“Lightning-fast in-memory cluster computing”

Tuesday, June 18, 13

Page 39: Real-time Analytics with Cassandra, Spark, and Shark

Introduction to Spark

Tuesday, June 18, 13

Page 40: Real-time Analytics with Cassandra, Spark, and Shark

Introduction to Spark

• In-memory distributed computing framework

Tuesday, June 18, 13

Page 41: Real-time Analytics with Cassandra, Spark, and Shark

Introduction to Spark

• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010

Tuesday, June 18, 13

Page 42: Real-time Analytics with Cassandra, Spark, and Shark

Introduction to Spark

• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:

– Iterative algorithms (machine learning)– Interactive data mining

Tuesday, June 18, 13

Page 43: Real-time Analytics with Cassandra, Spark, and Shark

Introduction to Spark

• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:

– Iterative algorithms (machine learning)– Interactive data mining

• More general purpose than Hadoop MR

Tuesday, June 18, 13

Page 44: Real-time Analytics with Cassandra, Spark, and Shark

Introduction to Spark

• In-memory distributed computing framework• Created by UC Berkeley AMP Lab in 2010• Targeted problems that MR is bad at:

– Iterative algorithms (machine learning)– Interactive data mining

• More general purpose than Hadoop MR• Active contributions from ~ 15 companies

Tuesday, June 18, 13

Page 45: Real-time Analytics with Cassandra, Spark, and Shark

HDFS

Map

Reduce

Map

Reduce

Tuesday, June 18, 13

Page 46: Real-time Analytics with Cassandra, Spark, and Shark

HDFS

Map

Reduce

Map

Reduce

Tuesday, June 18, 13

Page 47: Real-time Analytics with Cassandra, Spark, and Shark

HDFS

Map

Reduce

Map

Reduce

Data Source

map()

join()

Source 2

Tuesday, June 18, 13

Page 48: Real-time Analytics with Cassandra, Spark, and Shark

HDFS

Map

Reduce

Map

Reduce

Data Source

map()

join()

Source 2

cache()

Tuesday, June 18, 13

Page 49: Real-time Analytics with Cassandra, Spark, and Shark

HDFS

Map

Reduce

Map

Reduce

Data Source

map()

join()

Source 2

cache()

transform

Tuesday, June 18, 13

Page 50: Real-time Analytics with Cassandra, Spark, and Shark

Throughput: Memory is king

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 51: Real-time Analytics with Cassandra, Spark, and Shark

Throughput: Memory is king

0 37500 75000 112500 150000C*, cold cache

C*, warm cache

Spark RDD

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 52: Real-time Analytics with Cassandra, Spark, and Shark

Throughput: Memory is king

0 37500 75000 112500 150000C*, cold cache

C*, warm cache

Spark RDD

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 53: Real-time Analytics with Cassandra, Spark, and Shark

Throughput: Memory is king

0 37500 75000 112500 150000C*, cold cache

C*, warm cache

Spark RDD

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 54: Real-time Analytics with Cassandra, Spark, and Shark

Throughput: Memory is king

0 37500 75000 112500 150000C*, cold cache

C*, warm cache

Spark RDD

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 55: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

Tuesday, June 18, 13

Page 56: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”

Tuesday, June 18, 13

Page 57: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API

Tuesday, June 18, 13

Page 58: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft

Tuesday, June 18, 13

Page 59: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python

Tuesday, June 18, 13

Page 60: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell

Tuesday, June 18, 13

Page 61: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell• EASY testing!!

Tuesday, June 18, 13

Page 62: Real-time Analytics with Cassandra, Spark, and Shark

Developers love it

• “I wrote my first aggregation job in 30 minutes”• High level “distributed collections” API• No Hadoop cruft• Full power of Scala, Java, Python• Interactive REPL shell• EASY testing!!• Low latency - quick development cycles

Tuesday, June 18, 13

Page 63: Real-time Analytics with Cassandra, Spark, and Shark

Spark word count example

1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }

Tuesday, June 18, 13

Page 64: Real-time Analytics with Cassandra, Spark, and Shark

Spark word count example

file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" "))    .map(word => (word, 1))    .reduceByKey(_ + _)

1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }

Tuesday, June 18, 13

Page 65: Real-time Analytics with Cassandra, Spark, and Shark

The Spark Ecosystem

Spark

Tachyon - in-memory caching DFS

Tuesday, June 18, 13

Page 66: Real-time Analytics with Cassandra, Spark, and Shark

The Spark Ecosystem

Bagel - Pregel on Spark

Spark

Tachyon - in-memory caching DFS

Tuesday, June 18, 13

Page 67: Real-time Analytics with Cassandra, Spark, and Shark

The Spark Ecosystem

Bagel - Pregel on Spark HIVE on Spark

Spark

Tachyon - in-memory caching DFS

Tuesday, June 18, 13

Page 68: Real-time Analytics with Cassandra, Spark, and Shark

The Spark Ecosystem

Bagel - Pregel on Spark HIVE on Spark

Spark Streaming - discretized stream processing

Spark

Tachyon - in-memory caching DFS

Tuesday, June 18, 13

Page 69: Real-time Analytics with Cassandra, Spark, and Shark

Shark - HIVE on Spark

Tuesday, June 18, 13

Page 70: Real-time Analytics with Cassandra, Spark, and Shark

Shark - HIVE on Spark

• 100% HiveQL compatible

Tuesday, June 18, 13

Page 71: Real-time Analytics with Cassandra, Spark, and Shark

Shark - HIVE on Spark

• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds

Tuesday, June 18, 13

Page 72: Real-time Analytics with Cassandra, Spark, and Shark

Shark - HIVE on Spark

• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers

Tuesday, June 18, 13

Page 73: Real-time Analytics with Cassandra, Spark, and Shark

Shark - HIVE on Spark

• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for Metastore

Tuesday, June 18, 13

Page 74: Real-time Analytics with Cassandra, Spark, and Shark

Shark - HIVE on Spark

• 100% HiveQL compatible• 10-100x faster than HIVE, answers in seconds• Reuse UDFs, SerDe’s, StorageHandlers• Can use DSE / CassandraFS for Metastore• Easy Scala/Java integration via Spark - easier than

writing UDFs

Tuesday, June 18, 13

Page 75: Real-time Analytics with Cassandra, Spark, and Shark

Our new analytics architectureHow we integrate Cassandra and Spark/Shark

Tuesday, June 18, 13

Page 76: Real-time Analytics with Cassandra, Spark, and Shark

From raw events to fast queries

Raw Events

Raw Events

Raw Events

Tuesday, June 18, 13

Page 77: Real-time Analytics with Cassandra, Spark, and Shark

From raw events to fast queries

IngestionC*

event store

Raw Events

Raw Events

Raw Events

Tuesday, June 18, 13

Page 78: Real-time Analytics with Cassandra, Spark, and Shark

From raw events to fast queries

IngestionC*

event store

Raw Events

Raw Events

Raw Events Spark

Spark

Spark

View 1

View 2

View 3

Tuesday, June 18, 13

Page 79: Real-time Analytics with Cassandra, Spark, and Shark

From raw events to fast queries

IngestionC*

event store

Raw Events

Raw Events

Raw Events Spark

Spark

Spark

View 1

View 2

View 3

Spark Predefined queries

Tuesday, June 18, 13

Page 80: Real-time Analytics with Cassandra, Spark, and Shark

From raw events to fast queries

IngestionC*

event store

Raw Events

Raw Events

Raw Events Spark

Spark

Spark

View 1

View 2

View 3

Spark

Shark

Predefined queries

Ad-hoc HiveQL

Tuesday, June 18, 13

Page 81: Real-time Analytics with Cassandra, Spark, and Shark

Our Spark/Shark/Cassandra Stack

Node1

CassandraNode2

CassandraNode3

Cassandra

Tuesday, June 18, 13

Page 82: Real-time Analytics with Cassandra, Spark, and Shark

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

InputFormat

SerDe

Node2

Cassandra

InputFormat

SerDe

Node3

Cassandra

InputFormat

SerDe

Tuesday, June 18, 13

Page 83: Real-time Analytics with Cassandra, Spark, and Shark

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node2

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node3

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Tuesday, June 18, 13

Page 84: Real-time Analytics with Cassandra, Spark, and Shark

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node2

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node3

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Spark Master

Tuesday, June 18, 13

Page 85: Real-time Analytics with Cassandra, Spark, and Shark

Our Spark/Shark/Cassandra Stack

Node1

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node2

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Node3

Cassandra

InputFormat

SerDe

Spark Worker

Shark

Spark Master Job Server

Tuesday, June 18, 13

Page 86: Real-time Analytics with Cassandra, Spark, and Shark

Event Store Cassandra schema

t0 t1 t2 t3 t4

2013-04-05T00:00Z#id1

{event0: a0}

{event1: a1}

{event2: a2}

{event3: a3}

{event4: a4}

Event CF

Tuesday, June 18, 13

Page 87: Real-time Analytics with Cassandra, Spark, and Shark

Event Store Cassandra schema

t0 t1 t2 t3 t4

2013-04-05T00:00Z#id1

{event0: a0}

{event1: a1}

{event2: a2}

{event3: a3}

{event4: a4}

ipaddr:10.20.30.40:t1 videoId:45678:t1 providerId:500:t0

2013-04-05T00:00Z#id1

Event CF

EventAttr CF

Tuesday, June 18, 13

Page 88: Real-time Analytics with Cassandra, Spark, and Shark

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

Tuesday, June 18, 13

Page 89: Real-time Analytics with Cassandra, Spark, and Shark

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

id1 11 1

Tuesday, June 18, 13

Page 90: Real-time Analytics with Cassandra, Spark, and Shark

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

id1 11 1

id2 20 5

Tuesday, June 18, 13

Page 91: Real-time Analytics with Cassandra, Spark, and Shark

Unpacking raw eventst0 t1

2013-04-05T00:00Z#id1

{video: 10, type:5}

{video: 11, type:1}

2013-04-05T00:00Z#id2

{video: 20, type:5}

{video: 25, type:9}

UserID Video Typeid1 10 5

id1 11 1

id2 20 5

id2 25 9

Tuesday, June 18, 13

Page 92: Real-time Analytics with Cassandra, Spark, and Shark

Tips for InputFormat Development

Tuesday, June 18, 13

Page 93: Real-time Analytics with Cassandra, Spark, and Shark

Tips for InputFormat Development• Know which target platforms you are developing for

– Which API to write against? New? Old? Both?

Tuesday, June 18, 13

Page 94: Real-time Analytics with Cassandra, Spark, and Shark

Tips for InputFormat Development• Know which target platforms you are developing for

– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation

– Low latency jobs require fast splits

Tuesday, June 18, 13

Page 95: Real-time Analytics with Cassandra, Spark, and Shark

Tips for InputFormat Development• Know which target platforms you are developing for

– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation

– Low latency jobs require fast splits• Consider sorting row keys by token for data locality

Tuesday, June 18, 13

Page 96: Real-time Analytics with Cassandra, Spark, and Shark

Tips for InputFormat Development• Know which target platforms you are developing for

– Which API to write against? New? Old? Both?• Be prepared to spend time tuning your split computation

– Low latency jobs require fast splits• Consider sorting row keys by token for data locality• Implement predicate pushdown for HIVE SerDe’s

– Use your indexes to reduce size of dataset

Tuesday, June 18, 13

Page 97: Real-time Analytics with Cassandra, Spark, and Shark

Example: OLAP processing

t0

2013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#id2

{video: 20, type:5}

C* events

Tuesday, June 18, 13

Page 98: Real-time Analytics with Cassandra, Spark, and Shark

Example: OLAP processing

t0

2013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#id2

{video: 20, type:5}

C* events

OLAP Aggregates

OLAP Aggregates

OLAP Aggregates

Cached Materialized Views

Spark

Spark

Spark

Tuesday, June 18, 13

Page 99: Real-time Analytics with Cassandra, Spark, and Shark

Example: OLAP processing

t0

2013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#id2

{video: 20, type:5}

C* events

OLAP Aggregates

OLAP Aggregates

OLAP Aggregates

Cached Materialized Views

Spark

Spark

Spark

Union

Tuesday, June 18, 13

Page 100: Real-time Analytics with Cassandra, Spark, and Shark

Example: OLAP processing

t0

2013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#id2

{video: 20, type:5}

C* events

OLAP Aggregates

OLAP Aggregates

OLAP Aggregates

Cached Materialized Views

Spark

Spark

Spark

Union

Query 1: Plays by Provider

Tuesday, June 18, 13

Page 101: Real-time Analytics with Cassandra, Spark, and Shark

Example: OLAP processing

t0

2013-04-05T00:00Z#id1

{video: 10, type:5}

2013-04-05T00:00Z#id2

{video: 20, type:5}

C* events

OLAP Aggregates

OLAP Aggregates

OLAP Aggregates

Cached Materialized Views

Spark

Spark

Spark

Union

Query 1: Plays by Provider

Query 2: Top content for mobile

Tuesday, June 18, 13

Page 102: Real-time Analytics with Cassandra, Spark, and Shark

Performance numbers

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 103: Real-time Analytics with Cassandra, Spark, and Shark

Performance numbers

Spark: C* -> OLAP aggregatescold cache, 1.4 million events

130 seconds

C* -> OLAP aggregateswarmed cache

20-30 seconds

OLAP aggregate query via Spark(56k records)

60 ms

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 104: Real-time Analytics with Cassandra, Spark, and Shark

Performance numbers

Spark: C* -> OLAP aggregatescold cache, 1.4 million events

130 seconds

C* -> OLAP aggregateswarmed cache

20-30 seconds

OLAP aggregate query via Spark(56k records)

60 ms

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 105: Real-time Analytics with Cassandra, Spark, and Shark

Performance numbers

Spark: C* -> OLAP aggregatescold cache, 1.4 million events

130 seconds

C* -> OLAP aggregateswarmed cache

20-30 seconds

OLAP aggregate query via Spark(56k records)

60 ms

6-node C*/DSE 1.1.9 cluster,Spark 0.7.0

Tuesday, June 18, 13

Page 106: Real-time Analytics with Cassandra, Spark, and Shark

OLAP WorkFlow

Aggregation JobSparkExecutors

REST Job Server

Aggregate

Tuesday, June 18, 13

Page 107: Real-time Analytics with Cassandra, Spark, and Shark

OLAP WorkFlow

Aggregation JobSparkExecutors

Cassandra

REST Job Server

Aggregate

Tuesday, June 18, 13

Page 108: Real-time Analytics with Cassandra, Spark, and Shark

OLAP WorkFlow

DatasetAggregation JobSparkExecutors

Cassandra

REST Job Server

Aggregate

Tuesday, June 18, 13

Page 109: Real-time Analytics with Cassandra, Spark, and Shark

OLAP WorkFlow

DatasetAggregation Job Query JobSparkExecutors

Cassandra

REST Job Server

Aggregate Query

Tuesday, June 18, 13

Page 110: Real-time Analytics with Cassandra, Spark, and Shark

OLAP WorkFlow

DatasetAggregation Job Query JobSparkExecutors

Cassandra

REST Job Server

Aggregate Query

Result

Tuesday, June 18, 13

Page 111: Real-time Analytics with Cassandra, Spark, and Shark

OLAP WorkFlow

DatasetAggregation Job Query JobSparkExecutors

Cassandra

REST Job Server

Query Job

Aggregate Query

Result

Query

Result

Tuesday, June 18, 13

Page 112: Real-time Analytics with Cassandra, Spark, and Shark

Fault Tolerance

Tuesday, June 18, 13

Page 113: Real-time Analytics with Cassandra, Spark, and Shark

Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?

Tuesday, June 18, 13

Page 114: Real-time Analytics with Cassandra, Spark, and Shark

Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this is

expensive!

Tuesday, June 18, 13

Page 115: Real-time Analytics with Cassandra, Spark, and Shark

Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this is

expensive!• Can also replicate cached dataset to survive single node failures

Tuesday, June 18, 13

Page 116: Real-time Analytics with Cassandra, Spark, and Shark

Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this is

expensive!• Can also replicate cached dataset to survive single node failures• Persist materialized views back to C*, then load into cache -- now

recovery path is much faster

Tuesday, June 18, 13

Page 117: Real-time Analytics with Cassandra, Spark, and Shark

Fault Tolerance• Cached dataset lives in Java Heap only - what if process dies?• Spark lineage - automatic recomputation from source, but this is

expensive!• Can also replicate cached dataset to survive single node failures• Persist materialized views back to C*, then load into cache -- now

recovery path is much faster• Persistence also enables multiple processes to hold cached dataset

Tuesday, June 18, 13

Page 118: Real-time Analytics with Cassandra, Spark, and Shark

Demo time

Tuesday, June 18, 13

Page 119: Real-time Analytics with Cassandra, Spark, and Shark

Shark Demo• Local shark node, 1 core, MBP• How to create a table from C* using our inputformat• Creating a cached Shark table• Running fast queries

Tuesday, June 18, 13

Page 120: Real-time Analytics with Cassandra, Spark, and Shark

Creating a Shark Table from InputFormat

Tuesday, June 18, 13

Page 121: Real-time Analytics with Cassandra, Spark, and Shark

Creating a cached table

Tuesday, June 18, 13

Page 122: Real-time Analytics with Cassandra, Spark, and Shark

Querying cached table

Tuesday, June 18, 13

Page 123: Real-time Analytics with Cassandra, Spark, and Shark

THANK YOU

Tuesday, June 18, 13

Page 124: Real-time Analytics with Cassandra, Spark, and Shark

THANK YOU• @evanfchan

Tuesday, June 18, 13

Page 125: Real-time Analytics with Cassandra, Spark, and Shark

THANK YOU• @evanfchan• [email protected]

Tuesday, June 18, 13

Page 126: Real-time Analytics with Cassandra, Spark, and Shark

THANK YOU• @evanfchan• [email protected]

Tuesday, June 18, 13

Page 127: Real-time Analytics with Cassandra, Spark, and Shark

THANK YOU• @evanfchan• [email protected]

• WE ARE HIRING!!

Tuesday, June 18, 13

Page 128: Real-time Analytics with Cassandra, Spark, and Shark

Spark: Under the hood

Map DatasetReduce Map

Driver Map DatasetReduce Map

Map DatasetReduce Map

One executor process per node

Driver

Tuesday, June 18, 13