Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

download Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

of 58

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day

  1. 1. codecentric AG Matthias Niehoff Big Data Analytics with Cassandra, Spark & MLLib
  2. 2. codecentric AG Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo AGENDA
  3. 3. codecentric AG SELECT * FROM performer WHERE name = 'ACDC ok SELECT * FROM performer WHERE name = 'ACDC and country = Australia not ok SELECT country, COUNT(*) as quantity FROM artists GROUP BY country ORDER BY quantity DESC not possible CQL QUERYING LANGUAGE WITH LIMITATIONS 18.06.2015 5 performer name text K style text country text type text
  4. 4. codecentric AG Open Source & Apache project since 2010 Data processing framework Batch processing Stream processing WHAT IS APACHE SPARK? 18/06/2015 6
  5. 5. codecentric AG Fast up to 100 times faster than Hadoop a lot of in-memory processing scalable about nodes Easy Scala, Java and Python API Clean Code (e.g. with lambdas in Java 8) expanded API: map, reduce, filter, groupBy, sort, union, join, reduceByKey, groupByKey, sample, take, first, count .. Fault-Tolerant easily reproducible WHY USE SPARK? 18.06.2015 7
  6. 6. codecentric AG RDDs Resilient Distributed Dataset Read Only Collection of Objects Distributed among the Cluster (on memory or disk) Determined through transformations Automatically rebuild on failure Operations Transformations (map,filter,reduce...) new RDD Actions (count, collect, save) Only Actions start processing! EASILY REPRODUCIBLE? 18.06.2015 8
  7. 7. codecentric AG scala> val textFile = sc.textFile("") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09 scala> linesWithSpark.count() res0: Long = 126 RDD EXAMPLE 18.06.2015 9
  8. 8. codecentric AG REPRODUCE RDD USING A TREE 18.06.2015 10 Datenquelle rdd1 rdd3 val1 rdd5 rdd2 rdd4 val2 rdd6 val3 map(..)filter(..) union(..) count() count() count() sample(..) cache()
  9. 9. codecentric AG Spark Cassandra Connector by Datastax Cassandra tables as Spark RDD (read & write) Mapping of C* tables and rows onto Java/Scala objects Server-Side filtering (where) Compatible with Spark 0.9 Cassandra 2.0 Clone & Compile with SBT or download at Maven Central SPARK ON CASSANDRA 18.06.2015 16
  10. 10. codecentric AG Start Spark Shell bin/spark-shell --jars ~/path/to/jar/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar --conf --driver-memory 2g --executor-memory 3g Import Cassandra Classes scala> import com.datastax.spark.connector._, USE SPARK CONNECTOR 18.06.2015 17
  11. 11. codecentric AG Read complete table val movies = sc.cassandraTable("movie", "movies") // Return type: CassandraRDD[CassandraRow] Read selected columns val movies = sc.cassandraTable("movie", "movies").select("title","year") Filter Rows val movies = sc.cassandraTable("movie", "movies").where("title = 'Die Hard'") READ TABLE 18.06.2015 18
  12. 12. codecentric AG Access columns in ResultSet movies.collect.foreach(r => println(r.get[String]("title"))) As Scala tuple sc.cassandraTable[(String,Int)]("movie","movies") .select("title","year") or sc.cassandraTable("movie","movies").select("title","year") .as((_: String, _:Int)) CassandraRDD[(String,Int)] READ TABLE 18.06.2015 19 getString, getStringOption, getOrElse, getMap,
  13. 13. codecentric AG case class Movie(name: String, year: Int) sc.cassandraTable[Movie]("movie","movies") .select("title","year") sc.cassandraTable("movie","movies").select("title","year") .as(Movie) CassandraRDD[Movie] READ TABLE CASE CLASS 18.06.2015 20
  14. 14. codecentric AG Every RDD can be saved (not only CassandraRDD) Tuple val tuples = sc.parallelize(Seq(("Hobbit",2012), ("96 Hours",2008))) tuples.saveToCassandra("movie","movies", SomeColumns("title","year") - Case Class case class Movie (title:String, year: int) val objects = sc.parallelize(Seq( Movie("Hobbit",2012), Movie("96 Hours",2008))) objects.saveToCassandra("movie","movies") WRITE TABLE 18.06.2015 21
  15. 15. codecentric AG ( [atomic,collection,object] , [atomic,collection,object]) val fluege= List( ("Thomas", "Berlin") ,("Mark", "Paris"), ("Thomas", "Madrid") ) val pairRDD = sc.parallelize(fluege) pairRDD.filter(_._1 == "Thomas") .collect .foreach(t => println(t._1 + " flog nach " + t._2)) PAIR RDDS 18.06.2015 23 key not unique value
  16. 16. codecentric AG Parallelization! keys are use for partitioning pairs with different keys are distributed across the cluster Efficient processing of aggregate by key group by key sort by key joins, union based on keys WHY USE PAIR RDD'S? 18.06.2015 24
  17. 17. codecentric AG keys(), values() mapValues(func), flatMapValues(func) lookup(key), collectAsMap(), countByKey() reduceByKey(func), foldByKey(zeroValue)(func) groupByKey(), cogroup(otherDataset) sortByKey([ascending]) join(otherDataset), leftOuterJoin(otherDataset), rightOuterJoin(otherDataset) union(otherDataset), substractByKey(otherDataset) SPECIAL OPS FOR PAIR RDDS 18.06.2015 25
  18. 18. codecentric AG val pairRDD = sc.cassandraTable("movie","director") .map(r => (r.getString("country"),r)) // Directors / Country, sorted pairRDD.mapValues(v => 1).reduceByKey(_+_).sortBy(- _._2).collect.foreach(println) // or, unsorted pairRDD.countByKey().foreach(println) // All Countries pairRDD.keys() CASSANDRA EXAMPLE 18.06.2015 26 director name text K country text
  19. 19. codecentric AG pairRDD.groupByCountry() RDD[(String,Iterable[CassandraRow])] val directors = sc.cassandraTable(..).map(r => r.getString("name"),r)) val movies = sc.cassandraTable().map(r => r.getString("director"),r)) directors.cogroup(movies) RDD[(String, (Iterable[CassandraRow], Iterable[CassandraRow]))] CASSANDRA EXAMPLE 18.06.2015 27 director name text K country text movie title text K director text Director Movies
  20. 20. codecentric AG Joins could be expensive partitions for same keys in different tables on different nodes requires shuffling val directors = sc.cassandraTable(..).map(r => (r.getString("name"),r)) val movies = sc.cassandraTable().map(r => (r.getString("director"),r)) movies.join(directors) RDD[(String, (CassandraRow, CassandraRow))] CASSANDRA EXAMPLE - JOINS 18.06.2015 28 director name text K country text movie title text K director text DirectorMovie
  21. 21. codecentric AG USE CASES 18.06.2015 30 DataLoading Validation& Normalization Analyses (Joins, Transformations,..) Schema Migration DataConversion
  22. 22. codecentric AG In particular for huge amounts of external data Support for CSV, TSV, XML, JSON und other Example: case class User (id: java.util.UUID, name: String) val users = sc.textFile("users.csv") .repartition(2*sc.defaultParallelism) .map(line => line.split(",") match { case Array(id,name) => User(java.util.UUID.fromString(id), name)}) users.saveToCassandra("keyspace","users") DATA LOADING 18.06.2015 31
  23. 23. codecentric AG Validate consistency in a Cassandra Database syntactic Uniqueness (only relevant for columns not in the PK) Referential integrity Integrity of the duplicates semantic Business- or Application constraints e.g.: At least one genre per movies, a maximum of 10 tags per blog post VALIDATION 18.06.2015 32
  24. 24. codecentric AG Modelling, Mining, Transforming, .... Use Cases Recommendation Fraud Detection Link Analysis (Social Networks, Web) Advertising Data Stream Analytics ( Spark Streaming) Machine Learning ( Spark ML) ANALYSES 18.06.2015 33
  25. 25. codecentric AG Changes on existing tables New table required when changing primary key Otherwise changes could be performed in-place Creating new tables data derived from existing tables Support new queries Use the CassandraConnectors in Spark val cc = CassandraConnector(sc.getConf) cc.withSessionDo{session => session.execute(ALTER TABLE movie.movies ADD year timestamp} SCHEMA MIGRATION & DATA CONVERSION 18.06.2015 34
  26. 26. codecentric AG STREAM PROCESSING 18.06.2015 35 .. with Spark Streaming Real Time Processing Supported sources: TCP, HDFS, S3, Kafka, Twitter,.. Data as Discretized Stream (DStream) All Operations of the GenericRDD Stateful Operations Sliding Windows
  27. 27. codecentric AG val ssc = new StreamingContext(sc,Seconds(1)) val stream = ssc.socketTextStream("",9999) => 1).reduce(_ + _).print() ssc.start() // await manual termination or error ssc.awaitTermination() // manual termination ssc.stop() SPARK STREAMING - BEISPIEL 18.06.2015 36
  28. 28. codecentric AG SQL Queries with Spark (SQL & HiveQL) On structured data On SchemaRDDs Every result of Spark SQL are SchemaRDDs All operations of the GenericRDDs available Untersttzt, auch auf nicht PK Spalten,... Joins Union Group By Having Order By SPARK SQL 18.06.2015 37
  29. 29. codecentric AG val csc = new CassandraSQLContext(sc) csc.setKeyspace("musicdb") val result = csc.sql("SELECT country, COUNT(*) as anzahl" + "FROM artists GROUP BY country" + "ORDER BY anzahl DESC"); result.collect.foreach(println); SPARK SQL CASSANDRA BEISPIEL 18.06.2015 38 performer name text K style text country text type text
  30. 30. codecentric AG val sqlContext = new SQLContext(sc) val personen = sqlContext.jsonFile(path) // Show the schema personen.printSchema() personen.registerTempTable("personen") val erwachsene = sqlContext.sql("SELECT name FROM personen WHERE alter > 18") erwachsene.collect.foreach(println) SPARK SQL RDD / JSON BEISPIEL 18.06.2015 39 {"name":"Michael"} {"name":"Jan", "alter":30} {"name":"Tim", "alter":17