Download - Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Transcript
Page 1: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark and Cassandra

Artem AlievDataStax

Solving classical data analytic task by using modern distributed databases

Page 2: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •
Page 3: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

©2014 DataStax Confidential. Do not distribute without consent.

Artem AlievSoftware Developer

Solving classical data analytic task by using modern distributed databases

Spark and Cassandra

Page 4: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Agenda: Lambda Architecture

http://lambda-architecture.net

Page 5: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Apache Cassandra

• Masterless architecture with read/write anywhere design.• Continuous availability with no single point of failure.• Multi-data center and cloud availability zone support.• Linear scale performance with online capacity expansion. • CQL – SQL-like language.

Node

Node

100,000 txns/sec

Node

Node

Node

Node

NodeNode200,000 txns/sec

Node Node

Node

NodeNode

Node

400,000 txns/sec

Apache Cassandra™ is a massively scalable NoSQL OLTP database.

Page 6: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

• Cassandra was designed with the understanding that system/hardware failures can and do occur• Peer-to-peer, distributed system • All nodes the same• Multi data centre support out of the box• Configurable replication factor• Configurable data consistency per request• Active-Active replication architecture

Cassandra Architecture Overview

Node 11st copy

Node 4

Node 5 Node 22nd copy

Node 3

Node 11st

Node 4

Node 5 Node 22nd copy

Node 33rd copy

DC: USA DC: EU

Page 7: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Cassandra Query LanguageCREATE TABLE sporty_league (

team_name varchar,

player_name varchar,

jersey int,

PRIMARY KEY (team_name, player_name)

);

SELECT * FROM sporty_league WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;

INSERT INTO sporty_league (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);

Page 8: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Apache Spark

• Distributed computing framework• Created by UC AMP Lab since 2009• Apache Project since 2010• Solves problems Hadoop is bad at• Iterative Algorithms• Interactive Machine Learning• More general purpose than MapReduce• Streaming!• In memory!

8

Page 9: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Fast

0

500

1000

1500

2000

2500

3000

3500

4000

1 5 10 20 30

Run

ning

Tim

e (s

)

Number of Iterations

Hadoop

Spark

110 sec / iteration

first iteration 80 secfurther iterations 1 sec

* Logistic Regression Performance

Page 10: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Simple• 1. package org.myorg;• 2.• 3. import java.io.IOException;• 4. import java.util.*;• 5.• 6. import org.apache.hadoop.fs.Path ;• 7. import org.apache.hadoop.conf.*;• 8. import org.apache.hadoop.io.*;• 9. import org.apache.hadoop.map red.*;• 10. import org.apache.hadoop.util.*;• 11.• 12. public class WordCount {• 13.• 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {• 15. private final static IntWritable one = new IntWritable(1);• 16. private Text word = new Text();• 17.• 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {• 19. String line = value.toString();• 20. StringTokenizer tokenizer = new StringTokenizer(line);• 21. while (tokenizer.hasMoreToken s()) {• 22. word.set(tokenizer.nextToken());• 23. output.collect(word, one);• 24. }• 25. }• 26. }• 27.• 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {• 29. public void reduce(Text key, Iterator<IntWritab le> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {• 30. int sum = 0;• 31. while (values.hasNext()) {• 32. sum += values.next().get();• 33. }• 34. output.collect(key, new IntWritable(sum));• 35. }• 36. }• 37.• 38. public static void main(String[] args) throws Exception {• 39. JobConf conf = new JobConf(WordCount.class);• 40. conf.setJobName("wordcount");• 41.• 42. conf.setOutputKeyClass(Text.class);• 43. conf.setOutputValueClass(IntWritable.class);• 44.• 45. conf.setMapperClass(Map .class);• 46. conf.setCombinerClass(Redu ce.class);• 47. conf.setReducerClass(Reduce.class);• 48.• 49. conf.setInputFormat(TextInputFo rmat.class);• 50. conf.setOutputFormat(TextOutpu tFormat.class);• 51.• 52. FileInputFormat.setInpu tPaths(conf, new Path(args[0])) ;• 53. FileOutputFormat.setOutpu tPath (conf, new Path(args[1]));• 54.• 55. JobClient.runJob(conf);• 57. }• 58. }

1. file = spark.textFile("hdfs://...")2. counts = file.flatMap(lambda line: line.split(" ")) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda a, b: a + b)3. counts.saveAsTextFile("hdfs://...")

Page 11: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

A Quick Comparison to Hadoop

HDFS

map()

reduce()

map()

reduce()

DataSource 1 Data

Source 2

map()

join()

cache()

transform() transform()

Page 12: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

APImap reduce

Page 13: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

APImapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoin

reducecountfoldreduceByKeygroupByKeycogroupcrosszip

sampletakefirstpartitionBymapWithpipesave ...

Page 14: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

API*Resilient Distributed Datasets (RDD)*Collections of objects spread across a cluster, stored in RAM or on Disk

*Built through parallel transformations

*Automatically rebuilt on failure

*Operations*Transformations (e.g. map, filter, groupBy)

*Actions (e.g. count, collect, save)

Page 15: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Operator Graph: Optimization and Fault Tolerance

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

map

= Cached partition= RDD

Page 16: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Why Spark on Cassandra?*Data model independent queries

*Cross-table operations (JOIN, UNION, etc.)

*Complex analytics (e.g. machine learning)

*Data transformation, aggregation, etc.

*Stream processing

Page 17: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

How to Spark on Cassandra?*DataStax Cassandra Spark driver*Open source: https://github.com/datastax/cassandra-driver-spark

*Compatible with*Spark 0.9+*Cassandra 2.0+*DataStax Enterprise 4.5+

Page 18: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Cassandra Spark Driver*Cassandra tables exposed as Spark RDDs

*Read from and write to Cassandra

*Mapping of C* tables and rows to Scala objects

*All Cassandra types supported and converted to Scala types

*Server side data selection

*Spark Streaming support

*Scala and Java support

Page 19: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Connecting to Cassandra

//  Import  Cassandra-­‐specific  functions  on  SparkContext and  RDD  objectsimport  com.datastax.driver.spark._

//  Spark  connection  optionsval conf =  new  SparkConf(true)

.setMaster("spark://192.168.123.10:7077")

.setAppName("cassandra-­‐demo").set("cassandra.connection.host",  "192.168.123.10")  //  initial  contact.set("cassandra.username",  "cassandra").set("cassandra.password",  "cassandra")  

val  sc  =  new  SparkContext(conf)

Page 20: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Accessing DataCREATE  TABLE  test.words (word  text  PRIMARY  KEY,  count  int);

INSERT  INTO  test.words (word,  count)  VALUES  ('bar',  30);INSERT  INTO  test.words (word,  count)  VALUES  ('foo',  20);

//  Use  table  as  RDDval rdd =  sc.cassandraTable("test",  "words")//  rdd:  CassandraRDD[CassandraRow]  =  CassandraRDD[0]

rdd.toArray.foreach(println)//  CassandraRow[word:  bar,  count:  30]//  CassandraRow[word:  foo,  count:  20]

rdd.columnNames //  Stream(word,  count)  rdd.size                      //  2

val  firstRow  =  rdd.first   //  firstRow:  CassandraRow  =  CassandraRow[word:  bar,  count:  30]firstRow.getInt("count")    //  Int  =  30

*Accessing table above as RDD:

Page 21: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Saving Dataval newRdd =  sc.parallelize(Seq(("cat",  40),  ("fox",  50)))//  newRdd:  org.apache.spark.rdd.RDD[(String,  Int)]  =  ParallelCollectionRDD[2]

newRdd.saveToCassandra("test",  "words",  Seq("word",  "count"))

SELECT  *  FROM  test.words;

word  |  count-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐

bar  |        30foo  |        20cat  |        40fox  |        50

(4  rows)

*RDD above saved to Cassandra:

Page 22: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Type MappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option

Page 23: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Mapping Rows to Objects

CREATE  TABLE  test.cars (id  text  PRIMARY  KEY,model  text,fuel_type text,year  int

);

case  class  Vehicle(id:  String,model:  String,fuelType:  String,year:  Int

)

sc.cassandraTable[Vehicle]("test",  "cars").toArray//Array(Vehicle(KF334L,  Ford  Mondeo,  Petrol,  2009),//            Vehicle(MT8787,  Hyundai  x35,  Diesel,  2011)

à

*Mapping rows to Scala Case Classes

*CQL underscore case column mapped to Scala camel case property

*Custom mapping functions (see docs)

Page 24: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Server Side Data Selection*Reduce the amount of data transferred

*Selecting columns

*Selecting rows (by clustering columns and/or secondary indexes)

sc.cassandraTable("test",  "users").select("username").toArray.foreach(println)//  CassandraRow{username:  john}  //  CassandraRow{username:  tom}

sc.cassandraTable("test",  "cars").select("model").where("color  =  ?",  "black").toArray.foreach(println)//  CassandraRow{model:  Ford  Mondeo}//  CassandraRow{model:  Hyundai  x35}

Page 25: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark SQL

Spark SQL Streaming ML

Spark (General execution engine)

Graph

Cassandra/HDFS

Compatible

Page 26: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark SQL*SQL query engine on top of Spark

*Hive compatible (JDBC, UDFs, types, metadata, etc.)

*Support for in-memory processing

*Pushdown of predicates to Cassandra when possible

Page 27: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark SQL Example

import  com.datastax.spark.connector._

//  Connect  to  the  Spark  clusterval conf =  new  SparkConf(true)...val  sc  =  new  SparkContext(conf)

//  Create  Cassandra  SQL  contextval cc  =  new  CassandraSQLContext(sc)

//  Execute  SQL  queryval df =  cc.sql("SELECT  *  FROM  keyspace.table WHERE  ...”)

Page 28: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark Streaming

Spark SQL Streaming ML

Spark (General execution engine)

Graph

Cassandra/HDFS

Page 29: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark Streaming*Micro batching

*Each batch represented as RDD

*Fault tolerant

*Exactly-once processing

*Unified stream and batch processing framework

DStreamData Stream

RDD

Page 30: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Streaming Exampleimport  com.datastax.spark.connector.streaming._

//  Spark  connection  optionsval conf =  new  SparkConf(true)...

//  streaming  with  1  second  batch  windowval ssc =  new  StreamingContext(conf,  Seconds(1))

//  stream  inputval lines  =  ssc.socketTextStream(serverIP,  serverPort)

//  count  wordsval wordCounts =  lines.flatMap(_.split("  ")).map(word  =>  (word,  1)).reduceByKey(_  +  _)

//  stream  outputwordCounts.saveToCassandra("test",  "words")

//  start  processingssc.start()    ssc.awaitTermination()

Page 31: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark MLlib

Streaming

Spark (General execution engine)

Graph

Cassandra/HDFS

Spark SQL ML

Page 32: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark MLlib*Classification(SVM, LogisticRegression, NaiveBayes, RandomForest…)

*Clustering (Kmeans, LDA, PIC)

*Linear Regression

*Collaborative Filtering

*Dimensionality reduction (SVD, PCA)

*Frequent pattern mining

*Word2Vec

Page 33: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Spark MLlib Example

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.classification.NaiveBayes

// Spark connection optionsval conf = new SparkConf(true)...

// read dataval data = sc.cassandraTable[Iris]("test", "iris")

// convert data to LabeledPointval parsedData = data.map { i => LabeledPoint(class2id(i.species), Array(i.petal_l,i.petal_w,i.sepal_l,i.sepal_w)) }

// train the model val model = NaiveBayes.train(parsedData)

//predict

model.predict(Array(5, 1.5, 6.4, 3.2))

http://www.datastax.com/dev/blog/interactive-advanced-analytic-with-dse-and-spark-mllib

Page 34: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

APIs•Scala•Native•Python•Popular•Java •Ugly•R•No closures

Page 35: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •

Questions?

Page 36: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •
Page 37: Spark and Cassandra - GOTO Conference...Apache Cassandra • Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. •