Big data analytics with Spark & Cassandra

Big Data Analytics mit Spark & Cassandra_

JUG Stuttgart 01/2016

Matthias Niehoff

• Cassandra

• Spark

• Spark & Cassandra

• Spark Applications

• Spark Streaming

• Spark SQL

• Spark MLLib

Agenda_

2

Cassandra

3

•Distributed database

•Highly Available

•Linear Scalable

•Multi Datacenter Support

•No Single Point Of Failure

•CQL Query Language • Similiar to SQL • No Joins and aggregates

• Eventual Consistency „Tunable Consistency“

Cassandra_

4

Distributed Data Storage_

5

Node 1

Node 2

Node 3

Node 4

1-25

26-50 51-75

76-0

CQL - Querying Language With Limitations_

6

SELECT*FROMperformerWHEREname='ACDC'—>ok

SELECT*FROMperformerWHEREname='ACDC'andcountry='Australia'—>notok

SELECTcountry,COUNT(*)asquantityFROMartistsGROUPBYcountryORDERBYquantityDESC—>notsupported

performername (PK)

genrecountry

Spark

7

•Open Source & Apache project since 2010

•Data processing Framework • Batch processing • Stream processing

What Is Apache Spark_

8

•Fast • up to 100 times faster than Hadoop • a lot of in-memory processing • linear scalable using more nodes

• Easy • Scala, Java and Python API • Clean Code (e.g. with lambdas in Java 8) • expanded API: map, reduce, filter, groupBy, sort, union, join,

reduceByKey, groupByKey, sample, take, first, count

• Fault-Tolerant • easily reproducible

Why Use Spark_

9

•RDD‘s – Resilient Distributed Dataset • Read–Only description of a collection of objects • Distributed among the cluster (on memory or disk) • Determined through transformations • Allows automatically rebuild on failure

•Operations • Transformations (map,filter,reduce...) —> new RDD • Actions (count, collect, save)

•Only Actions start processing!

Easily Reproducable?_

10

•Partitions • Describes the Partitions (i.e. one per Cassandra Partition)

•Dependencies • dependencies on parent RDD’s

•Compute • The function to compute the RDD’s partitions

•(Optional) Partitioner • How is the data partitioned? (Hash, Range..)

•(Optional) Preferred Location • Where to get the data (i.e. List of Cassandra Node IP’s)

Properties Of A RDD_

11

RDD Example_

12

scala>valtextFile=sc.textFile("README.md")textFile:spark.RDD[String]=spark.MappedRDD@2ee9b6e3

scala>vallinesWithSpark=textFile.filter(line=>line.contains("Spark"))linesWithSpark:spark.RDD[String]=spark.FilteredRDD@7dd4af09

scala>linesWithSpark.count()res0:Long=126

Reproduce RDD’s Using A Tree_

13

Datenquelle

rdd1

rdd3

val1 rdd5

rdd2

rdd4

val2

rdd6

val3

map(..)filter(..)

union(..)

count()

count() count()

sample(..)

cache()

•Transformations • map, flatMap • sample, filter, distinct • union, intersection, cartesian

•Actions • reduce • count • collect,first, take • saveAsTextFile • foreach

Spark Transformations & Actions_

14

Run Spark In A Cluster_

15

•Memory • A lot of data in memory • More memory —> Less disk IO —> Faster processing • Minimum 8 GB / Node

•Network • Communication between Driver, Cluster Manager & Worker • Important for reduce operations • 10 Gigabit LAN or better

•CPU • Less communication between threads • Good to parallelize • Minimum 8 – 16 Cores / Node

What About Hardware?_

16

•Master Web UI (8080)

How To Monitor? (1/3)_

17

•Worker Web UI (8081)


18

•Application Web UI (4040)


19

([atomic,collection,object],[atomic,collection,object])

valfluege=List(("Thomas","Berlin"),("Mark","Paris"),("Thomas","Madrid"))

valpairRDD=sc.parallelize(fluege)

pairRDD.filter(_._1=="Thomas").collect.foreach(t=>println(t._1+"flognach"+t._2))

Pair RDDs_

20

key – not unique value

• Parallelization! • keys are use for partitioning

• pairs with different keys are distributed across the cluster

• Efficient processing of • aggregate by key

• group by key

• sort by key

• joins, union based on keys

Why Use Pair RDD’s_

21

RDD Dependencies_

22

„Narrow“ (pipeline-able)

map, filterunion

join on co partitioned data

RDD Dependencies_

23

„Wide“ (shuffle)

groupBy on non partitioned data join on non co partitioned data

Spark Demo

24

Spark & Cassandra

25

Use Spark And Cassandra In A Cluster_

26

Spark

Client Spark

Driver

C*

C*

C*C*

Spark WN

Spark WN

Spark WN

Spark WN

Spark Master

Two Datacenter - Two Purposes_

27

C*

C*

C*C*

C*

C*

C*C*

Spark WN

Spark WNSpark

WN

Spark WN

Spark Master

DC1 - Online DC2 - Analytics

•Spark Cassandra Connector by Datastax • https://github.com/datastax/spark-cassandra-connector

• Cassandra tables as Spark RDD (read & write)

• Mapping of C* tables and rows onto Java/Scala objects

• Server-Side filtering („where“)

• Compatible with • Spark ≥ 0.9 • Cassandra ≥ 2.0

•Clone & Compile with SBT or download at Maven Central

Connecting Spark With Cassandra_

28

• Start the shell bin/spark-shell--jars~/path/to/jar/spark-cassandra-connector-assembly-1.3.0.jar--confspark.cassandra.connection.host=localhost

• ImportCassandraClassesscala>importcom.datastax.spark.connector._

Use The Connector In The Shell_

29

• Read complete table valmovies=sc.cassandraTable("movie","movies")//returnsCassandraRDD[CassandraRow]

• Read selected columns valmovies=sc.cassandraTable("movie","movies").select("title","year")

• Filter rows valmovies=sc.cassandraTable("movie","movies").where("title='DieHard'")

• Access Columns in Result Set movies.collect.foreach(r=>println(r.get[String]("title")))

Read A Cassandra Table_

30

Read As Tuple

valmovies=sc.cassandraTable[(String,Int)]("movie","movies").select("title","year")

valmovies=sc.cassandraTable("movie","movies").select("title","year").as((_:String,_:Int))

//bothresultinaCassandraRDD[(String,Int)]


31

Read As Case Class

caseclassMovie(title:String,year:Int)

sc.cassandraTable[Movie]("movie","movies").select("title","year")

sc.cassandraTable("movie","movies").select("title","year").as(Movie)


32

•Every RDD can be saved

• Using Tuples

valtuples=sc.parallelize(Seq(("Hobbit",2012),("96Hours",2008)))tuples.saveToCassandra("movie","movies",SomeColumns("title","year")

• Using Case Classes

caseclassMovie(title:String,year:int)valobjects=

sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96Hours",2008)))objects.saveToCassandra("movie","movies")

Write Table_

33

//LoadandformatasPairRDDvalpairRDD=sc.cassandraTable("movie","director").map(r=>(r.getString("country"),r))

//Directors/Country,sortedpairRDD.mapValues(v=>1).reduceByKey(_+_).sortBy(-_._2).collect.foreach(println)

//or,unsortedpairRDD.countByKey().foreach(println)

//AllCountriespairRDD.keys()

Pair RDDs With Cassandra_

34

director

name text Kcountry text

• Joins can be expensive as they may require shuffling

valdirectors=sc.cassandraTable(..).map(r=>(r.getString("name"),r))

valmovies=sc.cassandraTable().map(r=>(r.getString("director"),r))

movies.join(directors)//RDD[(String,(CassandraRow,CassandraRow))]

Pair RDDs With Cassandra - Join

35

director

name text Kcountry text

movie

title text Kdirector text

•Automatically on read

•Not automatically on write • No Shuffling Spark Operations -> Writes are local • Shuffeling Spark Operartions

• Fan Out writes to Cassandra • repartitionByCassandraReplica(“keyspace“, “table“) before write

• Joins with data locality

Using Data Locality With Cassandra_

36

sc.cassandraTable[CassandraRow](KEYSPACE,A).repartitionByCassandraReplica(KEYSPACE,B).joinWithCassandraTable[CassandraRow](KEYSPACE,B).on(SomeColumns("id"))

•cassandraCount() • Utilizes Cassandra query • vs load the table into memory and do a count

•spanBy(), spanByKey() • group data by Cassandra partition key • does not need shuffling • should be preferred over groupBy/groupByKey

CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));

sc.cassandraTable("test","events").spanBy(row=>(row.getInt("year"),row.getInt("month")))sc.cassandraTable("test","events").keyBy(row=>(row.getInt("year"),row.getInt("month"))).spanByKey

Further Transformations & Actions_

37

Spark & Cassandra Demo

38

Create an Application

39

•Normal Scala Application

•SBT as build tool

•source in src/main/scala-2.10

•assembly.sbt in root and project directory

•build.sbt in root directory

• sbt assembly to build

Scala Application_

40

libraryDependencies+="com.datastax.spark"%"spark-cassandra-connector"%"1.3.0"libraryDependencies+="org.apache.spark"%"spark-core"%"1.3.1"%"provided"libraryDependencies+="org.apache.spark"%"spark-mllib_2.10"%"1.3.1"%"provided"libraryDependencies+="org.apache.spark"%"spark-streaming_2.10"%"1.3.1"%"provided"

•Normal Java Application

•Java 8!

•MVN as build tool

•source in src/main/java

•in pom.xml • dependencies (spark-core, spark-streaming, spark-mllib,

spark-cassandra-connector) • assembly-plugin or shade-plugin

• mvn clean install to build

Java Application_

41

•Special classes for Java

SparkConfconf=newSparkConf().setMaster("local[2]").setAppName("Java").set("spark.cassandra.connection.host","127.0.0.1");

JavaSparkContextsc=newJavaSparkContext(conf);JavaStreamingContextssc=newJavaStreamingContext(conf,Durations.seconds(1L));

JavaRDD<Integer>rdd=sc.parallelize(Arrays.asList(1,2,3,4,5,6));

rdd.filter(e->e%2==0).foreach(System.out::println);

Java Specials_

42

•Special classes for Java

importstaticcom.datastax.spark.connector.japi.CassandraJavaUtil.*;

CassandraTableScanJavaRDD<CassandraRow>table=javaFunctions(sc.sparkContext()).cassandraTable("keyspace",„table");

CassandraTableScanJavaRDD<Entity>table=javaFunctions(sc.sparkContext()).cassandraTable("keyspace","table",mapRowTo(Entity.class))

javaFunctions(someRDD).writerBuilder("keyspace","table",mapToRow(Entity.class)).saveToCassandra();

Java Specials - Cassandra_

43

Spark SQL

44

• SQL Queries with Spark (SQL & HiveQL) • On structured data • On DataFrame • Every result of Spark SQL is a DataFrame • All operations of the GenericRDD‘s available

• Supports (even on non primary key columns) • Joins • Union • Group By • Having • Order By

Spark SQL_

45

valsqlContext=newSQLContext(sc)valpersons=sqlContext.jsonFile(path)

//Showtheschemapersons.printSchema()

persons.registerTempTable("persons")

valadults=sqlContext.sql("SELECTnameFROMpersonsWHEREage>18")adults.collect.foreach(println)

Spark SQL - JSON Example_

46

{"name":"Michael"}{"name":"Jan","age":30}{"name":"Tim","age":17}

valcsc=newCassandraSQLContext(sc)

csc.setKeyspace("musicdb")

valresult=csc.sql("SELECTcountry,COUNT(*)asanzahl"+ "FROMartistsGROUPBYcountry"+ "ORDERBYanzahlDESC");

result.collect.foreach(println);

Spark SQL - Cassandra Example_

47

Spark SQL Demo

48

Spark Streaming

49

• Real Time Processing using micro batches

• Supported sources: TCP, S3, Kafka, Twitter,..

• Data as Discretized Stream (DStream)

• Same programming model as for batches

• All Operations of the GenericRDD & SQL & MLLib

• Stateful Operations & Sliding Windows

Stream Processing With Spark Streaming_

50

importorg.apache.spark.streaming._

valssc=newStreamingContext(sc,Seconds(1))

valstream=ssc.socketTextStream("127.0.0.1",9999)

stream.map(x=>1).reduce(_+_).print()

ssc.start()

//awaitmanualterminationorerrorssc.awaitTermination()

//manualterminationssc.stop()

Spark Streaming - Example_

51

•Maintain State for each key in a DStream: updateStateByKey

Spark Streaming - Stateful Operations_

52

defupdateAlbumCount(newValues:Seq[Int],runningCount:Option[Int]):Option[Int]={valnewCount=runningCount.getOrElse(0)+newValues.sizeSome(newCount)}

valcountStream=stream.updateStateByKey[Int](updateAlbumCount_)

StreamisaDStreamofPairRDD's

•One Receiver -> One Node • Start more receivers and union them

valnumStreams=5valkafkaStreams=(1tonumStreams).map{i=>KafkaUtils.createStream(...)}valunifiedStream=streamingContext.union(kafkaStreams)unifiedStream.print()

• Received data will be split up into blocks • 1 block => 1 task • blocks = batchSize / blockInterval

• Repartition data to distribute over cluster

Spark Streaming - Parallelism_

53

Spark Streaming Demo

54

Spark MLLib

55

• Fully integrated in Spark • Scalable • Scala, Java & Python APIs • Use with Spark Streaming & Spark SQL

• Packages various algorithms for machine learning

• Includes • Clustering • Classification • Prediction • Collaborative Filtering

• Still under development • performance, algorithms

Spark MLLib_

56

MLLib Example - Clustering_

57

age

set of data points meaningful clusters

inco

me

//Loadandparsedatavaldata=sc.textFile("data/mllib/kmeans_data.txt")valparsedData=data.map(s=>Vectors.dense(s.split('').map(_.toDouble))).cache()//Clusterthedatainto3classesusingKMeanswith20iterationsvalclusters=KMeans.train(parsedData,2,20)//EvaluateclusteringbycomputingSumofSquaredErrorsvalSSE=clusters.computeCost(parsedData)println("SumofSquaredErrors="+WSSSE)

MLLib Example - Clustering (using KMeans)_

58

MLLib Example - Classification_

59

MLLib Example - Classification_

60

//LoadtrainingdatainLIBSVMformat.valdata=MLUtils.loadLibSVMFile(sc,"sample_libsvm_data.txt")

//Splitdataintotraining(60%)andtest(40%).valsplits=data.randomSplit(Array(0.6,0.4),seed=11L)valtraining=splits(0).cache()valtest=splits(1)

//RuntrainingalgorithmtobuildthemodelvalnumIterations=100valmodel=SVMWithSGD.train(training,numIterations)

MLLib Example - Classification (Linear SVM)_

61

//Computerawscoresonthetestset.valscoreAndLabels=test.map{point=>valscore=model.predict(point.features)(score,point.label)}

//Getevaluationmetrics.valmetrics=newBinaryClassificationMetrics(scoreAndLabels)valauROC=metrics.areaUnderROC()println("AreaunderROC="+auROC)

MLLib Example - Classification (Linear SVM)_

62

MLLib Example - Collaborative Filtering_

63

//Loadandparsethedata(userid,itemid,rating)valdata=sc.textFile("data/mllib/als/test.data")valratings=data.map(_.split(',')match{caseArray(user,item,rate)=>Rating(user.toInt,item.toInt,rate.toDouble)

})

//BuildtherecommendationmodelusingALSvalrank=10valnumIterations=20valmodel=ALS.train(ratings,rank,numIterations,0.01)

MLLib Example - Collaborative Filtering using ALS_

64

//EvaluatethemodelonratingdatavalusersProducts=ratings.map{caseRating(user,product,rate)=>(user,product)}

valpredictions=model.predict(usersProducts).map{ caseRating(user,product,rate)=>((user,product),rate)}

valratesAndPredictions=ratings.map{caseRating(user,product,rate)=>((user,product),rate)}.join(predictions)

valMSE=ratesAndPredictions.map{case((user,product),(r1,r2))=>valerr=(r1-r2);err*err}.mean()

println("MeanSquaredError="+MSE)

MLLib Example - Collaborative Filtering using ALS_

65

Use Cases

66

• In particular for huge amounts of external data

• Support for CSV, TSV, XML, JSON und other

Use Cases for Spark and Cassandra_

67

Data Loading

caseclassUser(id:java.util.UUID,name:String)

valusers=sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line=>line.split(",")match{caseArray(id,name)=>User(java.util.UUID.fromString(id),name)})

users.saveToCassandra("keyspace","users")

Validate consistency in a Cassandra database

• syntactic • Uniqueness (only relevant for columns not in the PK) • Referential integrity • Integrity of the duplicates

• semantic • Business- or Application constraints • e.g.: At least one genre per movies, a maximum of 10 tags per blog

post


68

Validation & Normalization

• Modelling, Mining, Transforming, ....

• Use Cases • Recommendation • Fraud Detection • Link Analysis (Social Networks, Web) • Advertising • Data Stream Analytics ( Spark Streaming) • Machine Learning ( Spark ML)


69

Analyses (Joins, Transformations,..)

• Changes on existing tables • New table required when changing primary key • Otherwise changes could be performed in-place

• Creating new tables • data derived from existing tables • Support new queries

• Use the CassandraConnectors in Spark


70

Schema Migration

Thank you for your attention!

71

Questions?

Matthias Niehoff, IT-Consultant

90

codecentric AG Zeppelinstraße 2 76185 Karlsruhe, Germany

mobil: +49 (0) 172.1702676 [email protected]

www.codecentric.de blog.codecentric.de

matthiasniehoff

http://www.codecentric.de/

http://blog.codecentric.de/

Big data analytics with Spark & Cassandra

Data & Analytics

Transcript of Big data analytics with Spark & Cassandra