Download - Apache cassandra and spark. you got the the lighter, let's start the fire

Transcript
Page 1: Apache cassandra and spark. you got the the lighter, let's start the fire

©2013 DataStax Confidential. Do not distribute without consent.

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra

Apache Cassandra and Spark You got the lighter, let’s spark the fire

1

Page 2: Apache cassandra and spark. you got the the lighter, let's start the fire

6 years. How’s it going?

Page 3: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra 3.0 & 3.1

Spring and Fall

Page 4: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra is…• Shared nothing •Masterless peer-to-peer • Great scaling story • Resilient to failure

Page 5: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra for Applications

APACHE

CASSANDRA

Page 6: Apache cassandra and spark. you got the the lighter, let's start the fire

A Data Ocean or Pond., Lake

An In-Memory Database

A Key-Value Store

A magical database unicorn that farts rainbows

Page 7: Apache cassandra and spark. you got the the lighter, let's start the fire

Apache Spark

Page 8: Apache cassandra and spark. you got the the lighter, let's start the fire

Apache Spark• 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options

Up to 100× faster (2-10× on disk)

2-5× less code

Page 9: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Components

Spark Core

Spark SQL structured

Spark Streaming

real-time

MLlib machine learning

GraphX graph

Page 10: Apache cassandra and spark. you got the the lighter, let's start the fire
Page 11: Apache cassandra and spark. you got the the lighter, let's start the fire

org.apache.spark.rdd.RDDResilient Distributed Dataset (RDD)

•Created through transformations on data (map,filter..) or other RDDs

•Immutable

•Partitioned

•Reusable

Page 12: Apache cassandra and spark. you got the the lighter, let's start the fire

RDD Operations•Transformations - Similar to scala collections API

•Produce new RDDs

•filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract

•Actions

•Require materialization of the records to generate a value

•collect: Array[T], count, fold, reduce..

Page 13: Apache cassandra and spark. you got the the lighter, let's start the fire

Analytic

Analytic

Search

Transformation

Action

RDD Operations

Page 14: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra and Spark

Page 15: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra & Spark: A Great Combo

Datastax: spark-cassandra-connector: https://github.com/datastax/spark-cassandra-connector

•Both are Easy to Use

•Spark Can Help You Bridge Your Hadoop

and Cassandra Systems

•Use Spark Libraries, Caching on-top of

Cassandra-stored Data

•Combine Spark Streaming with Cassandra

Storage

Page 16: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark On Cassandra•Server-Side filters (where clauses)

•Cross-table operations (JOIN, UNION, etc.)

•Data locality-aware (speed)

•Data transformation, aggregation, etc.

•Natural Time Series Integration

Page 17: Apache cassandra and spark. you got the the lighter, let's start the fire

Apache Spark and Cassandra Open Source Stack

Cassandra

Page 18: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Cassandra Connector

18

Page 19: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Cassandra Connector*Cassandra tables exposed as Spark RDDs

*Read from and write to Cassandra

*Mapping of C* tables and rows to Scala objects

*All Cassandra types supported and converted to Scala types

*Server side data selection

*Virtual Nodes support

*Use with Scala or Java

*Compatible with, Spark 1.1.0, Cassandra 2.1 & 2.0

Page 20: Apache cassandra and spark. you got the the lighter, let's start the fire

Type MappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option

Page 21: Apache cassandra and spark. you got the the lighter, let's start the fire

Connecting to Cassandra

// Import Cassandra-specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._

// Spark connection optionsval conf = new SparkConf(true)

.setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo")

.set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra")

val sc = new SparkContext(conf)

Page 22: Apache cassandra and spark. you got the the lighter, let's start the fire

Accessing DataCREATE TABLE test.words (word text PRIMARY KEY, count int);

INSERT INTO test.words (word, count) VALUES ('bar', 30);INSERT INTO test.words (word, count) VALUES ('foo', 20);

// Use table as RDDval rdd = sc.cassandraTable("test", "words")// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]

rdd.toArray.foreach(println)// CassandraRow[word: bar, count: 30]// CassandraRow[word: foo, count: 20]

rdd.columnNames // Stream(word, count) rdd.size // 2

val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]firstRow.getInt("count") // Int = 30

*Accessing table above as RDD:

Page 23: Apache cassandra and spark. you got the the lighter, let's start the fire

Saving Dataval newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]

newRdd.saveToCassandra("test", "words", Seq("word", "count"))

SELECT * FROM test.words;

word | count------+------- bar | 30 foo | 20 cat | 40 fox | 50

(4 rows)

*RDD above saved to Cassandra:

Page 24: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Cassandra Connectorhttps://github.com/datastax/spark-­‐cassandra-­‐connector

Keyspace Table

Cassandra Spark

RDD[CassandraRow]

RDD[Tuples]

Bundled  and  Supported  with  DSE  4.5!

Page 25: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C*

Spark C*

Full Token Range

Each Executor Maintains a connection to the C* Cluster

Spark Executor

DataStax Java Driver

Tokens 1-1000

Tokens 1001 -2000

Tokens …

RDD’s read into different splits based on sets of tokens

Spark Cassandra Connector

Page 26: Apache cassandra and spark. you got the the lighter, let's start the fire

Co-locate Spark and C* for Best Performance

C*

C*C*

C*

Spark Worker

Spark Worker

Spark Master

Spark WorkerRunning Spark Workers

on the same nodes as your C* Cluster will save network hops when reading and writing

Page 27: Apache cassandra and spark. you got the the lighter, let's start the fire

Analytics Workload Isolation

Cassandra+ Spark DC

CassandraOnly DC

Online App

Analytical App

Mixed Load Cassandra Cluster

Page 28: Apache cassandra and spark. you got the the lighter, let's start the fire

Data Locality

Page 29: Apache cassandra and spark. you got the the lighter, let's start the fire

Example 1: Weather Station• Weather station collects data • Cassandra stores in sequence • Application reads in sequence

Page 30: Apache cassandra and spark. you got the the lighter, let's start the fire

Data Model• Weather Station Id and Time

are unique • Store as many as needed

CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY (weather_station,year,month,day,hour) );

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-5.1);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-4.9);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.3);

Page 31: Apache cassandra and spark. you got the the lighter, let's start the fire

Primary key relationship

PRIMARY KEY (weather_station,year,month,day,hour)

Page 32: Apache cassandra and spark. you got the the lighter, let's start the fire

Primary key relationship

PRIMARY KEY (weather_station,year,month,day,hour)

Partition Key

Page 33: Apache cassandra and spark. you got the the lighter, let's start the fire

Primary key relationship

PRIMARY KEY (weather_station,year,month,day,hour)

Partition Key Clustering Columns

Page 34: Apache cassandra and spark. you got the the lighter, let's start the fire

Primary key relationship

PRIMARY KEY (weather_station,year,month,day,hour)

Partition Key Clustering Columns

10010:99999

Page 35: Apache cassandra and spark. you got the the lighter, let's start the fire

2005:12:1:7

-5.6

Primary key relationship

PRIMARY KEY (weather_station,year,month,day,hour)

Partition Key Clustering Columns

10010:99999-5.3-4.9-5.1

2005:12:1:8 2005:12:1:9 2005:12:1:10

Page 36: Apache cassandra and spark. you got the the lighter, let's start the fire

Partition keys

10010:99999 Murmur3 Hash Token = 7224631062609997448

722266:13850 Murmur3 Hash Token = -6804302034103043898

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);

Consistent hash. 128 bit number between 2-63 and 264

Page 37: Apache cassandra and spark. you got the the lighter, let's start the fire

Partition keys

10010:99999 Murmur3 Hash Token = 15

722266:13850 Murmur3 Hash Token = 77

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘722266:13850’,2005,12,1,7,-5.6);

For this example, let’s make it a reasonable number

Page 38: Apache cassandra and spark. you got the the lighter, let's start the fire

Writes & WAN replication

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.10.0.1 00-25

10.10.0.4 76-100

10.10.0.2 26-50

10.10.0.3 51-75

DC2

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

DC2: RF=3

Client

Insert DataPartition Key = 15

Asynchronous Local Replication

Asynchronous WAN Replication

Page 39: Apache cassandra and spark. you got the the lighter, let's start the fire

Locality

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=3

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

10.10.0.1 00-25

10.10.0.4 76-100

10.10.0.2 26-50

10.10.0.3 51-75

DC2

Node Primary Replica Replica

10.0.0.1 00-25 76-100 51-75

10.0.0.2 26-50 00-25 76-100

10.0.0.3 51-75 26-50 00-25

10.0.0.4 76-100 51-75 26-50

DC2: RF=3

Client

Get DataPartition Key = 15

Client

Get DataPartition Key = 15

Page 40: Apache cassandra and spark. you got the the lighter, let's start the fire

Data Locality

weatherstation_id=‘10010:99999’ ?

1000 Node Cluster

You are here!

Page 41: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Reads on Cassandra

Awesome animation by DataStax’s own Russel Spitzer

Page 42: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

1 2 3

4 5 6

7 8 9Node 2

Node 1 Node 3

Node 4

Page 43: Apache cassandra and spark. you got the the lighter, let's start the fire

Node 2

Node 1

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Page 44: Apache cassandra and spark. you got the the lighter, let's start the fire

Node 2

Node 1

RDD

2

346

7 8 9

Node 3

Node 4

1 5

Spark RDDs Represent a Large

Amount of Data Partitioned into Chunks

Page 45: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra Data is Distributed By Token Range

Page 46: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra Data is Distributed By Token Range

0

500

Page 47: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra Data is Distributed By Token Range

0

500

999

Page 48: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Page 49: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

Without vnodes

Page 50: Apache cassandra and spark. you got the the lighter, let's start the fire

Cassandra Data is Distributed By Token Range

0

500

Node 1

Node 2

Node 3

Node 4

With vnodes

Page 51: Apache cassandra and spark. you got the the lighter, let's start the fire

Node 1

120-220

300-500

780-830

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

Page 52: Apache cassandra and spark. you got the the lighter, let's start the fire

Node 1

120-220

300-500

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

1

780-830

Page 53: Apache cassandra and spark. you got the the lighter, let's start the fire

1

Node 1

120-220

300-500

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

Page 54: Apache cassandra and spark. you got the the lighter, let's start the fire

2

1

Node 1 300-500

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

Page 55: Apache cassandra and spark. you got the the lighter, let's start the fire

2

1

Node 1 300-500

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

Page 56: Apache cassandra and spark. you got the the lighter, let's start the fire

2

1

Node 1

300-400

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

Page 57: Apache cassandra and spark. you got the the lighter, let's start the fire

21

Node 1

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

Page 58: Apache cassandra and spark. you got the the lighter, let's start the fire

21

Node 1

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830400-500

3

Page 59: Apache cassandra and spark. you got the the lighter, let's start the fire

21

Node 1

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

400-500

Page 60: Apache cassandra and spark. you got the the lighter, let's start the fire

21

Node 1

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

Page 61: Apache cassandra and spark. you got the the lighter, let's start the fire

4

21

Node 1

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

Page 62: Apache cassandra and spark. you got the the lighter, let's start the fire

4

21

Node 1

0-50

spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

780-830

3

Page 63: Apache cassandra and spark. you got the the lighter, let's start the fire

421

Node 1spark.cassandra.input.split.size 50

Reported density is 0.5

The Connector Uses Information on the Node to Make Spark Partitions

3

Page 64: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50780-830

Node 1

Page 65: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

Page 66: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

Page 67: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows

Page 68: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows

Page 69: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows

Page 70: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows

Page 71: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows

Page 72: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows

Page 73: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

Page 74: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 75: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

Page 76: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 780 and token(pk) <= 830

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 50

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 77: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 78: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows 50 CQL Rows

Page 79: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows

Page 80: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 81: Apache cassandra and spark. you got the the lighter, let's start the fire

4

spark.cassandra.input.page.row.size 50

Data is Retrieved Using the DataStax Java Driver

0-50

780-830

Node 1

SELECT * FROM keyspace.table WHERE token(pk) > 0 and token(pk) <= 5050 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows50 CQL Rows

Page 82: Apache cassandra and spark. you got the the lighter, let's start the fire

Co-locate Spark and C* for Best Performance

C*

C*C*

C*

Spark Worker

Spark Worker

Spark Master

Spark WorkerRunning Spark Workers

on the same nodes as your C* Cluster will save network hops when reading and writing

Page 83: Apache cassandra and spark. you got the the lighter, let's start the fire

Weather Station Analysis• Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new

tables

Windsor California July 1, 2014

High: 73.4

Low : 51.4

Page 84: Apache cassandra and spark. you got the the lighter, let's start the fire

Roll-up table(SparkSQL example)CREATE TABLE daily_high_low ( weatherstation text, date text, high_temp double, low_temp double, PRIMARY KEY ((weatherstation,date)) );

• Weather Station Id and Date are unique

• High and low temp for each day

SparkSQL> INSERT INTO TABLE > daily_high_low > SELECT > weatherstation, to_date(year, day, hour) date, max(temperature) high_temp, min(temperature) low_temp > FROM temperature > GROUP BY weatherstation_id, year, month, day; OK Time taken: 2.345 seconds

functions aggregations

Page 85: Apache cassandra and spark. you got the the lighter, let's start the fire

What just happened

• Data is read from temperature table • Transformed • Inserted into the daily_high_low table

Table: temperature

Table: daily_high_low

Read data from table Transform Insert data

into table

Page 86: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Streaming

Page 87: Apache cassandra and spark. you got the the lighter, let's start the fire

zillions of bytes gigabytes per second

Spark Versus Spark Streaming

Page 88: Apache cassandra and spark. you got the the lighter, let's start the fire

Analytic

Analytic

Search

Spark Streaming

Kinesis,'S3'

Page 89: Apache cassandra and spark. you got the the lighter, let's start the fire

DStream - Micro Batches

μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)

Processing of DStream = Processing of μBatches, RDDs

DStream

• Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch

computations on small time intervals

Page 90: Apache cassandra and spark. you got the the lighter, let's start the fire

Spark Streaming Example

val conf = new SparkConf(loadDefaults = true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("spark://127.0.0.1:7077")

val sc = new SparkContext(conf)

val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")

val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") ssc.start()ssc.awaitTermination()

Initialization

Transformations and Action

CassandraRDD

Stream Initialization

Page 91: Apache cassandra and spark. you got the the lighter, let's start the fire

Now what?

CassandraOnly DC

Cassandra+ Spark DC

Spark JobsSpark Streaming

Page 92: Apache cassandra and spark. you got the the lighter, let's start the fire

You can do this at home!

https://github.com/killrweather/killrweather

On your USB!

Page 93: Apache cassandra and spark. you got the the lighter, let's start the fire

Thank you!

Bring the questions

Follow me on twitter @PatrickMcFadin