Analytics with Cassandra & Spark

download Analytics with Cassandra & Spark

of 45

  • date post

    07-Jan-2017
  • Category

    Software

  • view

    580
  • download

    3

Embed Size (px)

Transcript of Analytics with Cassandra & Spark

  • Big Data Analytics with Cassandra & Spark_

    Matthias Niehoff

  • Cassandra

    2

  • Distributed database Highly Available Linear Scalable Multi Datacenter Support No Single Point Of Failure CQL Query Language Similiar to SQL No Joins and aggregates

    Eventual Consistency Tunable Consistency

    Cassandra_

    3

  • Distributed Data Storage_

    4

    Node 1

    Node 2

    Node 3

    Node 4

    1-25

    26-50 51-75

    76-0

  • CQL - Querying Language With Limitations_

    5

    SELECT*FROMperformerWHEREname='ACDC'>ok

    SELECT*FROMperformerWHEREname='ACDC'andcountry='Australia'>notok

    SELECTcountry,COUNT(*)asquantityFROMartistsGROUPBYcountryORDERBYquantityDESC>notsupported

    performername (PK)

    genrecountry

  • Spark

    6

  • Open Source & Apache project since 2010 Data processing Framework Batch processing Stream processing

    What Is Apache Spark_

    7

  • Fast up to 100 times faster than Hadoop a lot of in-memory processing linear scalable using more nodes

    Easy Scala, Java and Python API Clean Code (e.g. with lambdas in Java 8) expanded API: map, reduce, filter, groupBy, sort, union, join,

    reduceByKey, groupByKey, sample, take, first, count

    Fault-Tolerant easily reproducible

    Why Use Spark_

    8

  • RDDs Resilient Distributed Dataset ReadOnly description of a collection of objects Distributed among the cluster (on memory or disk) Determined through transformations Allows automatically rebuild on failure

    Operations Transformations (map,filter,reduce...) > new RDD Actions (count, collect, save)

    Only Actions start processing!

    Easily Reproducable?_

    9

  • RDD Example_

    10

    scala>valtextFile=sc.textFile("README.md")textFile:spark.RDD[String]=spark.MappedRDD@2ee9b6e3

    scala>vallinesWithSpark=textFile.filter(line=>line.contains("Spark"))linesWithSpark:spark.RDD[String]=spark.FilteredRDD@7dd4af09

    scala>linesWithSpark.count()res0:Long=126

  • Reproduce RDDs Using A DAG_

    11

    Datenquelle

    rdd1

    rdd3

    val1 rdd5

    rdd2

    rdd4

    val2

    rdd6

    val3

    map(..)filter(..)

    union(..)

    count()

    count() count()

    sample(..)

    cache()

  • Run Spark In A Cluster_

    12

  • ([atomic,collection,object],[atomic,collection,object])

    valfluege=List(("Thomas","Berlin"),("Mark","Paris"),("Thomas","Madrid"))

    valpairRDD=sc.parallelize(fluege)

    pairRDD.filter(_._1=="Thomas").collect.foreach(t=>println(t._1+"flewto"+t._2))

    Pair RDDs_

    13

    key not unique value

  • Parallelization! keys are use for partitioning pairs with different keys are distributed across the cluster

    Efficient processing of aggregate by key group by key sort by key joins, union based on keys

    Why Use Pair RDDs_

    14

  • Spark & Cassandra

    15

  • Use Spark And Cassandra In A Cluster_

    16

    Spar

    k

    Client Spark

    Driver

    C*

    C*

    C*C*

    Spark WN

    Spark WN

    Spark WN

    Spark WN

    Spark Master

  • Two Datacenter - Two Purposes_

    17

    C*

    C*

    C*C*

    C*

    C*

    C*C*

    Spark WN

    Spark WNSpark

    WN

    Spark WN

    Spark Master

    DC1 - Online DC2 - Analytics

  • Spark Cassandra Connector by Datastax https://github.com/datastax/spark-cassandra-connector

    Cassandra tables as Spark RDD (read & write) Mapping of C* tables and rows onto Java/Scala objects Server-Side filtering (where) Compatible with Spark 0.9 Cassandra 2.0

    Clone & Compile with SBT or download at Maven Central

    Connecting Spark With Cassandra_

    18

  • Start the shell bin/spark-shell--jars~/path/to/jar/spark-cassandra-connector-assembly-1.3.0.jar--confspark.cassandra.connection.host=localhost--driver-memory2g--executor-memory3g

    ImportCassandraClassesscala>importcom.datastax.spark.connector._

    Use The Connector In The Shell_

    19

  • Read complete table valmovies=sc.cassandraTable("movie","movies")//returnsCassandraRDD[CassandraRow]

    Read selected columns valmovies=sc.cassandraTable("movie","movies").select("title","year")

    Filter rows valmovies=sc.cassandraTable("movie","movies").where("title='DieHard'")

    Access Columns in Result Set movies.collect.foreach(r=>println(r.get[String]("title")))

    Read A Cassandra Table_

    20

  • Read As Tuple

    valmovies=sc.cassandraTable[(String,Int)]("movie","movies").select("title","year")

    valmovies=sc.cassandraTable("movie","movies").select("title","year").as((_:String,_:Int))

    //bothresultinaCassandraRDD[(String,Int)]

    Read A Cassandra Table_

    21

  • Read As Case Class

    caseclassMovie(title:String,year:Int)

    sc.cassandraTable[Movie]("movie","movies").select("title","year")

    sc.cassandraTable("movie","movies").select("title","year").as(Movie)

    Read A Cassandra Table_

    22

  • Every RDD can be saved Using Tuples

    valtuples=sc.parallelize(Seq(("Hobbit",2012),("96Hours",2008)))tuples.saveToCassandra("movie","movies",SomeColumns("title","year")

    Using Case Classes caseclassMovie(title:String,year:int)valobjects=

    sc.parallelize(Seq(Movie("Hobbit",2012),Movie("96Hours",2008)))objects.saveToCassandra("movie","movies")

    Write Table_

    23

  • //LoadandformatasPairRDDvalpairRDD=sc.cassandraTable("movie","director").map(r=>(r.getString("country"),r))

    //Directors/Country,sortedpairRDD.mapValues(v=>1).reduceByKey(_+_).sortBy(-_._2).collect.foreach(println)

    //or,unsortedpairRDD.countByKey().foreach(println)

    //AllCountriespairRDD.keys()

    Pair RDDs With Cassandra_

    24

    director

    name text Kcountry text

  • Joins can be expensive as they may require shuffling valdirectors=sc.cassandraTable(..).map(r=>(r.getString("name"),r))

    valmovies=sc.cassandraTable().map(r=>(r.getString("director"),r))

    movies.join(directors)//RDD[(String,(CassandraRow,CassandraRow))]

    Pair RDDs With Cassandra - Join

    25

    director

    name text Kcountry text

    movie

    title text Kdirector text

  • Automatically on read Not automatically on write No Shuffling Spark Operations -> Writes are local Shuffeling Spark Operartions

    Fan Out writes to Cassandra repartitionByCassandraReplica(keyspace, table) before write

    Joins with data locality

    Using Data Locality With Cassandra_

    26

    sc.cassandraTable[CassandraRow](KEYSPACE,A).repartitionByCassandraReplica(KEYSPACE,B).joinWithCassandraTable[CassandraRow](KEYSPACE,B).on(SomeColumns("id"))

  • cassandraCount() Utilizes Cassandra query vs load the table into memory and do a count

    spanBy(), spanByKey() group data by Cassandra partition key does not need shuffling should be preferred over groupBy/groupByKey

    CREATE TABLE events (year int, month int, ts timestamp, data varchar, PRIMARY KEY (year,month,ts));

    sc.cassandraTable("test","events").spanBy(row=>(row.getInt("year"),row.getInt("month")))sc.cassandraTable("test","events").keyBy(row=>(row.getInt("year"),row.getInt("month"))).spanByKey

    Further Transformations & Actions_

    27

  • Spark SQL

    28

  • SQL Queries with Spark (SQL & HiveQL) On structured data On DataFrame Every result of Spark SQL is a DataFrame All operations of the GenericRDDs available

    Supports (even on non primary key columns) Joins Union Group By Having Order By

    Spark SQL_

    29

  • valsqlContext=newSQLContext(sc)valpersons=sqlContext.jsonFile(path)

    //Showtheschemapersons.printSchema()

    persons.registerTempTable("persons")

    valadults=sqlContext.sql("SELECTnameFROMpersonsWHEREage>18")adults.collect.foreach(println)

    Spark SQL - JSON Example_

    30

    {"name":"Michael"}{"name":"Jan","age":30}{"name":"Tim","age":17}

  • valcsc=newCassandraSQLContext(sc)

    csc.setKeyspace("musicdb")

    valresult=csc.sql("SELECTcountry,COUNT(*)asanzahl"+ "FROMartistsGROUPBYcountry"+ "ORDERBYanzahlDESC");

    result.collect.foreach(println);

    Spark SQL - Cassandra Example_

    31

  • Spark Streaming

    32

  • Real Time Processing using micro batches Supported sources: TCP, S3, Kafka, Twitter,.. Data as Discretized Stream (DStream) Same programming model as for batches All Operations of the GenericRDD & SQL & MLLib Stateful Operations & Sliding Windows

    Stream Processing With Spark Streaming_

    33

  • importorg.apache.spark.streaming._

    valssc=newStreamingContext(sc,Seconds(1))

    valstream=ssc.socketTextStream("127.0.0.1",9999)

    stream.map(x=>1).reduce(_+_).print()

    ssc.start()

    //awaitmanualterminationorerrorssc.awaitTermination()

    //manualterminationssc.stop()

    Spark Streaming - Example_

    34

  • Spark MLLib

    35

  • Fully integrated in Spark Scalable Scala, Java & Python APIs Use with Spark Streaming & Spark SQL

    Packages various algorithms for machine learning Includes Clustering Classification Prediction Collaborative Filtering

    Still under development performance, algorithms

    Spark MLLib_

    36

  • MLLib Example - Collaborative Filtering_

    37

  • //Loadandparsethedata(userid,itemid,rating)valdata=sc.textFile("data/mllib/als/test.data")valratings=data.map(_.split(',')match{caseArray(user,item,rate)=>