Big Data Analytics with Spark

download Big Data Analytics with Spark

of 55

  • date post

    21-Aug-2015
  • Category

    Technology

  • view

    291
  • download

    2

Embed Size (px)

Transcript of Big Data Analytics with Spark

  1. 1. Sorry for the Delay There were some technical difficulties, so we are giving folks a few more minutes to join Again sorry for the dely 2014 DataStax, All Rights Reserved. Company Confidential 1
  2. 2. All attendees placed on mute Input questions at any time using the online interface Webinar Housekeeping
  3. 3. Big Data Analytics with Cassandra and Spark Brian Hess Sr. Product Manager for Analytics DataStax
  4. 4. 2014 DataStax, All Rights Reserved. Company Confidential 5
  5. 5. 2014 DataStax, All Rights Reserved. Company Confidential 6 Willie Sutton Bank Robber in the 1930s-1950s FBI Most Wanted List 1950 Captured in 1952
  6. 6. 2014 DataStax, All Rights Reserved. Company Confidential 7 Willie Sutton When asked Why do you rob banks? Because thats where the money is.
  7. 7. Motivating Use Case Internet of Things 2014 DataStax, All Rights Reserved. Company Confidential 8 Your System
  8. 8. Motivating Use Case Internet of Things 2014 DataStax, All Rights Reserved. Company Confidential 9 Your System
  9. 9. Motivating Use Case Internet of Things 2014 DataStax, All Rights Reserved. Company Confidential 10 Your SystemFAULT
  10. 10. 2014 DataStax, All Rights Reserved. Company Confidential Cassandra Spark Spark + Cassandra 11
  11. 11. Apache Cassandra Distributed NoSQL database BigTable meets Dynamo All nodes are equal Always on Linear scale out - a lot More data More transactions Multi-Datacenter Geographic or Workload Cassandra Query Language SQL-like 2014 DataStax, All Rights Reserved. Company Confidential 12 200,000 txns/sec 100,000 txns/sec 400,000 txns/sec
  12. 12. How Cassandra Works Writes 2014 DataStax, All Rights Reserved. Company Confidential 13 Its 72
  13. 13. How Cassandra Works Writes 2014 DataStax, All Rights Reserved. Company Confidential 14 Its 72
  14. 14. How Cassandra Works Writes 2014 DataStax, All Rights Reserved. Company Confidential 15 Done
  15. 15. How Cassandra Works Writes 2014 DataStax, All Rights Reserved. Company Confidential 16 Done
  16. 16. Tunable Consistency Relax the Consistency in ACID Isnt always needed and isnt guaranteed anyway (in distributed DBs) Reads my not get the most up-to-date data but almost always will All data is replicated Set in the schema Distributed to nodes by Token Range Options: QUORUM, ONE, ALL Can ensure reads get most up-to-date value E.g. read/write at QUORUM 2014 DataStax, All Rights Reserved. Company Confidential 17
  17. 17. How Cassandra Works Tunable Consistency 2014 DataStax, All Rights Reserved. Company Confidential 18 You got it. Ill make sure everyone gets it. You got it. A majority got it. The rest will. You got it. One guy got it. The rest will. You got it. Everyone has it.
  18. 18. How Cassandra Works Query 2014 DataStax, All Rights Reserved. Company Confidential 19 SELECT user_id FROM users WHERE name = PBCupFan;
  19. 19. How Cassandra Works Query 2014 DataStax, All Rights Reserved. Company Confidential 20 Sure Thing, Let me get that for you. SELECT user_id FROM users WHERE name = PBCupFan;
  20. 20. How Cassandra Works Query 2014 DataStax, All Rights Reserved. Company Confidential 21 What do you guys have for PBCup? SELECT user_id FROM users WHERE name = PBCupFan;
  21. 21. How Cassandra Works Query 2014 DataStax, All Rights Reserved. Company Confidential 22 Heres what I have: Heres what I have: SELECT user_id FROM users WHERE name = PBCupFan;
  22. 22. How Cassandra Works Query 2014 DataStax, All Rights Reserved. Company Confidential 23 Let me resolve any conflicts SELECT user_id FROM users WHERE name = PBCupFan;
  23. 23. How Cassandra Works Query 2014 DataStax, All Rights Reserved. Company Confidential 24 Here ya go! user_id --------- 1234 (1 rows)
  24. 24. Cassandra for Internet of Things Its all about scaling 2014 DataStax, All Rights Reserved. Company Confidential 25
  25. 25. Cassandra for Internet of Things Its all about scaling 2014 DataStax, All Rights Reserved. Company Confidential 26
  26. 26. Cassandra for Internet of Things Its all about scaling 2014 DataStax, All Rights Reserved. Company Confidential 27
  27. 27. Cassandra Always On No down time Linear Scalability For writes or reads For data size 2014 DataStax, All Rights Reserved. Company Confidential 28 Terrific choice for Internet of Things, Web, Mobile, etc. British Gas, Nike, etc Thermostats, Manufacturing, Oil/Gas, etc Its where the data is!
  28. 28. Cassandra Limitations No aggregations Optimized for lookups & writes No GROUP BYs No Windowed Aggregates No Joins Data model to avoid Must select by partition key There are secondary indexes But they are an antipattern Not optimized for full-table scans 2014 DataStax, All Rights Reserved. Company Confidential 29 It actually cant do everything
  29. 29. Apache Spark Distributed computing framework Generalized DAG execution Easy Abstraction for Datasets Integrated SQL Queries Streaming Machine Learning Library 2014 DataStax, All Rights Reserved. Company Confidential 30
  30. 30. Spark Components 2014 DataStax, All Rights Reserved. Company Confidential 31 Spark Core Engine Spark SQL Spark Streaming MLlib GraphX Spark R
  31. 31. Spark Components 2014 DataStax, All Rights Reserved. Company Confidential 32
  32. 32. Spark Provides a Simple and Efficient framework for Distributed Computations Node Roles 2 In Memory Caching Yes! Generic DAG Execution Yes! Great Abstraction For Datasets? Dataframe! (previously Resilient Distributed Dataset (RDD)) Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  33. 33. Spark Provides a Simple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  34. 34. Spark Provides a Simple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  35. 35. Spark Provides a Simple and Efficient framework for Distributed Computations Spark Master: Assigns cluster resources to applications Spark Worker: Manages executors running on a machine Spark Executor: Started by Worker - Workhorse of the spark application Spark Master Spark Worker Spark Worker Spark WorkerSpark Executor Spark Partition Dataframe (or RDD)
  36. 36. RDDs Can be Generated from a Variety of Sources Textfiles Parallelized Collections
  37. 37. RDDs Can be Generated from a Variety of Sources Textfiles Parallelized Collections
  38. 38. Spark on Cassandra 2014 DataStax, All Rights Reserved. Company Confidential 40 Spark Core Engine Spark SQL Spark Streaming MLlib GraphX Spark R Cassandra DataStax Spark-Cassandra Connector
  39. 39. Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to Cassandra Each Executor Maintains a connection to the C* Cluster Spark Executor DataStax Java Driver Tokens 1-1000 Tokens 1001 -2000 Tokens RDDs read into different splits based on sets of tokens C* Full Token Range
  40. 40. 2014 DataStax, All Rights Reserved. Company Confidential 42
  41. 41. Co-locate Spark and C* for Best Performance Run Cassandra and Spark on same nodes Local reads/writes Increased performance 2014 DataStax, All Rights Reserved. Company Confidential 43
  42. 42. Things you cant do in Cassandra Using SparkSQL JOINs sc.sql("SELECT t.sensor_id, t.temp, m.location FROM ks.temperatures t JOIN ks.metadata m ON t.sensor_id = m.sensor_id WHERE t.sensor_id = 12345"); Aggregates sc.sql("SELECT sensor_id, year, month, MAX(temp) mtemp FROM ks.temperatures GROUP BY sensor_id, year, month"); 2014 DataStax, All Rights Reserved. Company Confidential 44
  43. 43. Things you cant do in Cassandra External Data JOIN with HDFS data val temp2014 = sc.textFile("webhdfs://myhadoop/data/temp2014.csv"). map(x=>x.split(",")). map(x=>((x(0).toInt, x(1).toInt, x(2).toInt), x(3).toDouble)) val temp2015 = sc.cassandraTable("ks", "temperatures"). map(x=>((x.getInt("sensor_id"), x.getInt("year"), x.getInt("month")), x.getDouble("avgTemp"))) val hotter = temp2015.join(temp2014).filter(x => x._2._1._1 > x._2._2._1) Non-Partition Key Predicates csc.sql("SELECT * FROM ks.temperatures WHERE temp > 100") 2014 DataStax, All Rights Reserved. Company Confidential 45
  44. 44. Tools ODBC and JDBC tools via SparkSQL Tableau, Pentaho, R, etc Apache Zeppelin (incubating) A web-based notebook that enables interactive data analytics. 2014 DataStax, All Rights Reserved. Company Confidential 46
  45. 45. Quick word on Spark Streaming and Cassandra Very good combination Simple, powerful, useful, scalable, etc, etc, etc. 2014 DataStax, All Rights Reserved. Company Confidential 47 Receiver
  46. 46. Quick word on Spark Streaming and Cassandra 2014 DataStax, All Rights Reserved. Company Confidential 48 import com.datastax.spark.connector.streaming._ // Spark connection options val conf = new SparkConf(true)... // streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1)) // stream input val lines = ssc.socketTextStream(serverIP, serverPort) // count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) // stream output wordC