Download - Analytics with Spark and Cassandra

Transcript
Page 1: Analytics with Spark and Cassandra

@tlberglund

andSparkCassandra

Page 2: Analytics with Spark and Cassandra

@tlberglund

andSparkCassandra

Page 3: Analytics with Spark and Cassandra

Cassandra

Page 4: Analytics with Spark and Cassandra

• You have too much data • It changes a lot • The system is always on

When?

Page 5: Analytics with Spark and Cassandra

• Distributed database • Transactional/online • Low latency, high velocity

What :

Page 6: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Data Model

Page 7: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Partition Key

Page 8: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Partition Key

Page 9: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Clustering Column

Page 10: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Clustering Column

Page 11: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Primary Key

Page 12: Analytics with Spark and Cassandra

Data Modelcard_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Page 13: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition Key

Page 14: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Clustering Column

Page 15: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Primary Key

Page 16: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158Partition

Page 17: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Page 18: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Page 19: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Page 20: Analytics with Spark and Cassandra

2000

4000

6000

8000

A000

C000

E000

0000Partitioning

5444…838914:03 $98.03 US 158

f(x) 5D93

Page 21: Analytics with Spark and Cassandra

2000

4000

6000

8000

A000

C000

E000

0000Partitioning

5444…838914:03 $98.03 US 158

5D93

Page 22: Analytics with Spark and Cassandra

2000

4000

6000

8000

A000

C000

E000

0000Partitioning

5444…838914:03 $98.03 US 158

f(x) 5D93

5444…8389

Page 23: Analytics with Spark and Cassandra

card_number xctn_time amount country merchant

5444…8389

4532…1191

5444…6027

4532…2189

13:58

13:59

13:59

14:02

5444…6027 14:03

5444…8389 14:03

$12.42

$50.38

$3.02

$12.42

€23.78

$98.03

US

US

US

US

ES

US

101

101

102

198

352

158

Partition

Partitioning

Page 24: Analytics with Spark and Cassandra

• Clumps of related data • Always on one server • Accessed by consistent hashing

Partitions

Page 25: Analytics with Spark and Cassandra

• Design for low-latency, high-throughput reads and writes

• Inflexible ad-hoc queries • Very little aggregation • Very little secondary indexing

Consequences

Page 26: Analytics with Spark and Cassandra

Spark

Page 27: Analytics with Spark and Cassandra

You have too much data You don’t know the questions You don’t like MapReduce

You’re tired of Hadoop lol

When do I use it?

Page 28: Analytics with Spark and Cassandra

Distributed, functional “In-memory” “Real-time” Next-generation MapReduce

What are they saying?

Page 29: Analytics with Spark and Cassandra

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Transform RDDs into other RDDs

Page 30: Analytics with Spark and Cassandra

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Transform RDDs into other RDDs

Aggregate RDDs with “actions”

Page 31: Analytics with Spark and Cassandra

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Aggregate RDDs with “actions”

Transform RDDs into other RDDs

Keep track of a graph of transformations and

actions

Page 32: Analytics with Spark and Cassandra

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Keep track of a graph of transformations and

actionsTransform RDDs into

other RDDsAggregate RDDs with

“actions”Distribute all of this

storage and computation

Page 33: Analytics with Spark and Cassandra

Paradigm :Store collections in Resilient Distributed Datasets, or “RDDs”

Distribute all of this storage and computation

Transform RDDs into other RDDs

Aggregate RDDs with “actions”

Keep track of a graph of transformations and

actions

Page 34: Analytics with Spark and Cassandra

2000

4000

6000

8000

A000

C000

E000

0000Architecture

Spark Client Spark

Context

Driver

Page 35: Analytics with Spark and Cassandra

Architecture

Spark Client Spark

ContextJob

Page 36: Analytics with Spark and Cassandra

Architecture

Spark Client Spark

Context

Job

Cluster Manager

Page 37: Analytics with Spark and Cassandra

Architecture

Spark Client Spark

Context

Cluster Manager

JobTaskTaskTaskTask

Page 38: Analytics with Spark and Cassandra

Architecture

Spark Client Spark

Context

Cluster Manager

Job

TaskTask

Task

Task

Page 39: Analytics with Spark and Cassandra

Worker NodeExecutor

Task Task

Cache

Page 40: Analytics with Spark and Cassandra

Worker NodeExecutor

Task Task

Cache

Executor

Task Task

CacheRDD RDD

RDD

RDDRDD

RDD

Page 41: Analytics with Spark and Cassandra

What’s an RDD?

Page 42: Analytics with Spark and Cassandra

What’s an RDD?

This is

Page 43: Analytics with Spark and Cassandra

Bigger than a computer

What’s an RDD?

Page 44: Analytics with Spark and Cassandra

Read from an input source?

What’s an RDD?

Page 45: Analytics with Spark and Cassandra

The output of a pure

function?

What’s an RDD?

Page 46: Analytics with Spark and Cassandra

Immutable

What’s an RDD?

Page 47: Analytics with Spark and Cassandra

Typed

What’s an RDD?

Page 48: Analytics with Spark and Cassandra

Ordered

What’s an RDD?

Page 49: Analytics with Spark and Cassandra

Functions comprise a

graph

What’s an RDD?

Page 50: Analytics with Spark and Cassandra

Lazily Evaluated

What’s an RDD?

Page 51: Analytics with Spark and Cassandra

Collection of Things

What’s an RDD?

Page 52: Analytics with Spark and Cassandra

Distributed, how?

Page 53: Analytics with Spark and Cassandra

Distributed, how?

5444 5676 0686 8389: List(xn-1, xn-2, xn-3)

4532 4569 7030 1191: List(xn-4, xn-5, xn-6)

5444 5607 6517 6027: List(xn-7, xn-8, xn-9)

4532 4577 0122 2189: List(xn-10, xn-11, xn-12)

Hash these

Page 54: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", “10.0.3.74”)

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val transactions = sc. // cassandra stuff .partitionBy(new HashPartitioner(25)).persist() } }

Page 55: Analytics with Spark and Cassandra

ConnectorCassandra

Page 56: Analytics with Spark and Cassandra

import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import com.datastax.spark.connector._

object ScalaProgram { def main(args: Array[String]) {

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val rdd = sc.textFile(“/actual/path/raven.txt”) } }

From a Text File

Page 57: Analytics with Spark and Cassandra

CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

Data Model

Page 58: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val xctns = sc.cassandraTable(“payments”, “credit_card_transactions")

From a Table

Page 59: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val xctns = sc.cassandraTable(“payments”, “credit_card_transactions”) .select(“card_number”, “country”)

Projection

Page 60: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val countries = xctns.map(x => (x.getString(“card_number”), x.getString(“country”))

Making k / v pairs

Page 61: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val cCount = xctns.reduce(card, country => // cool analysis redacted! )

Count up Countries

Page 62: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val countries = xctns.map(x => (x._1, x._2))

Make k / v pairs

Page 63: Analytics with Spark and Cassandra

val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "127.0.0.1")

val sc = new SparkContext(“spark://127.0.0.1:7077”, “music”, conf) val fraud = cCount.saveToCassandra(“payments”, “frequent_countries”)

Persisting to a Table

Page 64: Analytics with Spark and Cassandra
Page 65: Analytics with Spark and Cassandra

Thank you!

@tlberglund