Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

Click here to load reader

  • date post

    21-Jan-2018
  • Category

    Technology

  • view

    667
  • download

    4

Embed Size (px)

Transcript of Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

  1. 1. Copyright 2015 Glassbeam Inc. Ad Hoc Analytics on Internet of Complex Things with Spark and Cassandra Mohammed Guller September 2015
  2. 2. Copyright 2015 Glassbeam Inc. Lets Take a Quick Poll Familiar with IoT Data modelling experience in C* Familiar with Spark Hands-on experience with Spark 3
  3. 3. Copyright 2015 Glassbeam Inc. About Me Principal Architect at Glassbeam Author of an upcoming book Big Data Analytics with Spark Founded two startups Passionate about building new products, big data analytics, and Machine Learning Berkeley Graduate LinkedIn: www.linkedin.com/in/mohammedguller Twitter: @MohammedGuller 4
  4. 4. Copyright 2015 Glassbeam Inc. Internet of Things (IoT) 5 Network of objects embedded with software for collecting and exchanging data over the Internet
  5. 5. Copyright 2015 Glassbeam Inc. Internet of Complex Things (IoCT) 6 Data Center Devices Server, storage, controller Medical Devices X-Ray, MRI scan, CT scan Manufacturing Systems Cars Electric Vehicle Chargers Other Complex Devices Glassbeam target market is focused on driving opera onal & business naly cs value for connected product companies in Industrial IoT market IT & Networks Medical & Health Care EV Chargers & Smart Grid
  6. 6. Copyright 2015 Glassbeam Inc. IT & Networks Medical & Healthcare EV Chargers & Smart Grid Industrial & Mfg Transportation Glassbeam 7 target market is focused on driving opera onal & business ue for connected product companies in Industrial IoT market rks Medical & Health Care Transporta on EV Chargers & Smart Grid Industrial & Mfg 5 Glassbeam target market is focused on d analy cs value for connected product com IT & Networks Medical & Health Care TrIndustrial & Mfg market is focused on driving opera onal & business connected product companies in Industrial IoT market Medical & Health Care EV Chargers & Smart Grid Transporta on 5 Advanced and Predictive Analytics for Connected Product Companies
  7. 7. Copyright 2015 Glassbeam Inc. 10101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000000101 01101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101000001 11101010101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000 00010101101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101 00000100100110101101001001001101011010010010011010001001101011010010010011010110100101101001101001101001101 Analytics on Operational Data 8 Operational Data to Powerful Insights
  8. 8. Copyright 2015 Glassbeam Inc. High-level Architecture 9 1010100010101 10101011101011 1101010100010 1001010101010 11111000101100 1000110000110 10111010011001 11110000001010 11010100111110 0010100101011 0010100101100 0100110101011 4010101000010 10100001011110 0100110101101 0010101000001 11101001111001 0011010110100 1010101010100 0101011010101 11010111101010 1000101001010 10101011111000 1011001000110 00011010111010 011 Data Inges on Data Transforma on Data Stores Middleware Applica ons Logs (Streams/ docs) SPL Library S C A L A R I N F O S E R V E R LogVault Explorer Workbench Standard Apps Custom Apps Rules & Alerts DirectAccess Glassbeam Studio Cloud Enablement & Automa on S3 Amazon Raw logs Cassandra Processed Data Solr Cloud Index Analy cs and Machine learning Spark SQL Spark Streaming MLlib Event Processing & Rules Engine End to End cloud based architecture built on modern technologies to handle any machine, any data, any cloud * SPL (Semiotic Parsing Language) and SCALAR are patent pending technology inventions of Glassbeam
  9. 9. Copyright 2015 Glassbeam Inc. Key Properties of IoCT Data 10 Volume Terabytes of Data Variety Multi-structured Data Velocity Fast Paced Batch Data Streaming Data
  10. 10. Copyright 2015 Glassbeam Inc. Why We Chose C* 11 Volume Economically Scale from Gigabytes to Terabytes of Data Variety Store Multi-structured Data Velocity Fast Ingest of New Data Quick Reload of Old Data Linear Scalability Dynamic Schema Fast Writes
  11. 11. Copyright 2015 Glassbeam Inc. Modeling Data in C* Different from Modeling Data in RDBMS Queries Drive Table and Primary Key Definitions Primary Key Definition Limits the Kind of Queries You Can Run C* Does Not Support Joins 12
  12. 12. Copyright 2015 Glassbeam Inc. A Simple Table for Storing Event Data in C* CREATE TABLE event ( sys_id text, dt timestamp, ts timestamp, severity text, module text, message text, PRIMARY KEY ((sys_id, dt), ts) ) WITH CLUSTERING ORDER BY (ts DESC); 13
  13. 13. Copyright 2015 Glassbeam Inc. Another Table to Filter Events by Severity CREATE TABLE event_by_severity ( sys_id text, dt timestamp, ts timestamp, severity text, module text, message text, PRIMARY KEY ((sys_id, dt), severity, ts) ) WITH CLUSTERING ORDER BY (severity ASC, ts DESC); 14
  14. 14. Copyright 2015 Glassbeam Inc. Yet Another Table to Filter Events by Module CREATE TABLE event_by_module ( sys_id text, dt timestamp, ts timestamp, severity text, module text, message text, PRIMARY KEY ((sys_id, dt), module, ts) ) WITH CLUSTERING ORDER BY (module ASC, ts DESC); 15
  15. 15. Copyright 2015 Glassbeam Inc. Ad Hoc Analytics with C* Oxymoron All queries Must be Known Upfront 16
  16. 16. Copyright 2015 Glassbeam Inc. Workaround Possible but Intractable Sys_id Model Age OS City State Country 17 sys_by_model sys_by_os sys_by_age sys_by_state sys_by_state_age sys_by_age_state sys_by_model_age sys_by_age_model sys_by_age_model_state sys_by_model_state_age sys_by_model_state_os
  17. 17. Copyright 2015 Glassbeam Inc. Other Barriers to Ad Hoc Queries No Aggregation No Group By No Joins 18
  18. 18. Copyright 2015 Glassbeam Inc. 19 What Do I Do Now?
  19. 19. Copyright 2015 Glassbeam Inc. 20
  20. 20. Copyright 2015 Glassbeam Inc. Spark 21 Fast and General-purpose Cluster Computing Framework for Processing Large Datasets API in Scala, Java, Python, SQL, and R
  21. 21. Copyright 2015 Glassbeam Inc. Integrated Libraries for a Variety of Tasks 22 Spark Core Spark SQL GraphX Spark Streaming MLlib & Spark ML
  22. 22. Copyright 2015 Glassbeam Inc. One Minor Problem! Spark Does not Have Built-in Support for C* Built-in Support for HDFS, S3 and JDBC-compliant Databases 23
  23. 23. Copyright 2015 Glassbeam Inc. Spark Cassandra Connector Open Source Library for Integrating Spark with C* Enables a Spark Application to Process Data in C* Just Like Data from the Built-in Data Sources 24
  24. 24. Copyright 2015 Glassbeam Inc. Spark with C* Enables Ad Hoc Analytics CQL Limitations No Longer Apply Query Data Using SQL/HiveQL Filter on Any Column Aggregations Group By 25
  25. 25. Copyright 2015 Glassbeam Inc. Ad Hoc Analytics in Spark Shell 26
  26. 26. Copyright 2015 Glassbeam Inc. Launch the Spark Shell /path/to/spark/bin/spark-shell--master spark://host:7077--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0 27
  27. 27. Copyright 2015 Glassbeam Inc. Create a DataFrame val events = sqlContext.read .format("org.apache.spark.sql.cassandra") .options( Map( "keyspace" -> "test", "table" -> "event")) .load() 28
  28. 28. Copyright 2015 Glassbeam Inc. Fire Queries events.cache() events.select("ts", "module", "message").where($"severity" === "ERROR").show events.select("ts", "severity", "message").where($"module" === "m1").show events.select("ts", "message").where($"severity" === "ERROR" && $"module" === "m1").show events.groupBy("severity").count() 29
  29. 29. Copyright 2015 Glassbeam Inc. Spark SQL JDBC/ODBC Server Analyze data in C* with just SQL/HiveQL Command Line Shell Beeline Graphical SQL Client Squirrel Data Visualization Applications Tableau ZoomData Qlik 30
  30. 30. Copyright 2015 Glassbeam Inc. Ad hoc Analytics with Spark SQL JDBC/ODBC server 31
  31. 31. Copyright 2015 Glassbeam Inc. Start the Spark SQL JDBC Server /path/to/spark/sbin/start-thriftserver.sh--master spark://hostname:7077--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0 32
  32. 32. Copyright 2015 Glassbeam Inc. Launch Beeline From a Terminal /path/to/spark/bin/beeline 33
  33. 33. Copyright 2015 Glassbeam Inc. Connect to the Spark SQL JDBC Server beeline> !connect jdbc:hive2://localhost:10000 34
  34. 34. Copyright 2015 Glassbeam Inc. Create a Temporary Table 0: jdbc:hive2://localhost:10000> CREATE TEMPORARY TABLE event . . . . . . . . . . . . . . . .> USING org.apache.spark.sql.cassandra . . . . . . . . . . . . . . . .> OPTIONS ( . . . . . . . . . . . . . . . .> keyspace "test", . . . . . . . . . . . . . . . .> table "event" . . . . . . . . . . . . . . . .> ); 35
  35. 35. Copyright 2015 Glassbeam Inc. Query Data with SQL/HiveQL ...> CACHE TABLE event; ...> SELECT severity, count(1) as total FROM event GROUP BY severity; ...> SELECT module, severity, count(1) FROM event GROUP BY module, severity; 36
  36. 36. Copyright 2015 Glassbeam Inc. Caveats Latency Spark Query May Require Expensive Table Scan Reads Every Row Disk I / O Slow 37
  37. 37. Copyright 2015 Glassbeam Inc. Reduce the Impact of Slow Disk I / O Cache Tables Replace HDD with SSD Add More Nodes 38
  38. 38. Copyright 2015 Glassbeam Inc. Recommendations Known Queries Requiring Sub-second Response Time Query C* Directly Create Query Specific Tables Pre-aggregate Data Ad Hoc Queries Spark 39
  39. 39. Copyright 2015 Glassbeam Inc. 40