Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

39
© Copyright 2015 Glassbeam Inc. Ad Hoc Analytics on Internet of Complex Things with Spark and Cassandra Mohammed Guller September 2015

Transcript of Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

Page 1: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Ad Hoc Analytics

on

Internet of Complex Things

with

Spark and Cassandra

Mohammed Guller

September 2015

Page 2: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Let’s Take a Quick Poll

Familiar with IoT

Data modelling experience in C*

Familiar with Spark

Hands-on experience with Spark

3

Page 3: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

About Me

Principal Architect at Glassbeam

Author of an upcoming book

– “Big Data Analytics with Spark”

Founded two startups

Passionate about building new products, big data analytics, and Machine Learning

Berkeley Graduate

LinkedIn: www.linkedin.com/in/mohammedguller

Twitter: @MohammedGuller

4

Page 4: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Internet of Things (IoT)

5

Network of objects embedded with software for

collecting and exchanging data over the Internet

Page 5: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Internet of Complex Things (IoCT)

6

Data Center Devices

– Server, storage, controller

Medical Devices

– X-Ray, MRI scan, CT scan

Manufacturing Systems

Cars

Electric Vehicle Chargers

Other Complex Devices

Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket

IT&Networks Medical&HealthCare

Transporta on

EVChargers&SmartGrid

Industrial&Mfg

5

Page 6: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

IT & Networks

Medical & Healthcare

EV Chargers & Smart Grid

Industrial & Mfg

Transportation

Glassbeam

7

Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket

IT&Networks Medical&HealthCare

Transporta on

EVChargers&SmartGrid

Industrial&Mfg

5

Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket

IT&Networks Medical&HealthCare

Transporta on

EVChargers&SmartGrid

Industrial&Mfg

5Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket

IT&Networks Medical&HealthCare

Transporta on

EVChargers&SmartGrid

Industrial&Mfg

5

Glassbeamtargetmarketisfocusedondrivingopera onal&businessanaly csvalueforconnectedproductcompaniesinIndustrialIoTmarket

IT&Networks Medical&HealthCare

Transporta on

EVChargers&SmartGrid

Industrial&Mfg

5

Advanced and

Predictive Analytics

for Connected

Product Companies

Page 7: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

10101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000000101

01101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101000001

11101010101000101011010101110101111010101000101001010101010111110001011001000110000110101110100110011111000

00010101101010011111000101001010110010100101100010011010101140101010000101010000101111001001101011010010101

00000100100110101101001001001101011010010010011010001001101011010010010011010110100101101001101001101001101

Analytics on Operational Data

8

Operational Data

to

Powerful Insights

Page 8: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

High-level Architecture

9

1010100010101

10101011101011

1101010100010

1001010101010

11111000101100

1000110000110

10111010011001

11110000001010

11010100111110

0010100101011

0010100101100

0100110101011

4010101000010

10100001011110

0100110101101

0010101000001

11101001111001

0011010110100

1010101010100

0101011010101

11010111101010

1000101001010

10101011111000

1011001000110

00011010111010

011

DataInges on

DataTransforma on

DataStores Middleware Applica ons

Logs(Streams/docs)

SPLLibrary

SCALAR

INFOSERVER

LogVault

Explorer

Workbench

StandardApps

CustomApps

Rules&Alerts

DirectAccessGlassbeamStudio

CloudEnablement&Automa on

S3Amazon

Rawlogs

Cassandra

ProcessedData

SolrCloud

Index

Analy csandMachinelearning

SparkSQL

SparkStreaming

MLlib

EventProcessing&RulesEngine

End to End cloud based architecture built on modern

technologies to handle any machine, any data, any cloud

* SPL (Semiotic Parsing Language) and SCALAR are patent pending technology inventions of Glassbeam

Page 9: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Key Properties of IoCT Data

10

Volume Terabytes of Data

Variety Multi-structured Data

Velocity Fast Paced Batch Data

Streaming Data

Page 10: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Why We Chose C*

11

Volume Economically Scale from Gigabytes to Terabytes of Data

Variety Store Multi-structured Data

Velocity Fast Ingest of New Data Quick Reload of Old Data

Linear Scalability

Dynamic Schema

Fast Writes

Page 11: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Modeling Data in C*

Different from Modeling Data in RDBMS

Queries Drive Table and Primary Key Definitions

– Primary Key Definition Limits the Kind of Queries You Can Run

– C* Does Not Support Joins

12

Page 12: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

A Simple Table for Storing Event Data in C*

CREATE TABLE event (

sys_id text,

dt timestamp,

ts timestamp,

severity text,

module text,

message text,

PRIMARY KEY ((sys_id, dt), ts)

) WITH CLUSTERING ORDER BY (ts DESC);

13

Page 13: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Another Table to Filter Events by Severity

CREATE TABLE event_by_severity (

sys_id text,

dt timestamp,

ts timestamp,

severity text,

module text,

message text,

PRIMARY KEY ((sys_id, dt), severity, ts)

) WITH CLUSTERING ORDER BY (severity ASC, ts DESC);

14

Page 14: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Yet Another Table to Filter Events by Module

CREATE TABLE event_by_module (

sys_id text,

dt timestamp,

ts timestamp,

severity text,

module text,

message text,

PRIMARY KEY ((sys_id, dt), module, ts)

) WITH CLUSTERING ORDER BY (module ASC, ts DESC);

15

Page 15: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Ad Hoc Analytics with C*

Oxymoron

All queries Must be Known Upfront

16

Page 16: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Workaround Possible but Intractable

Sys_id Model Age OS City State Country

17

• sys_by_model • sys_by_os • sys_by_age • sys_by_state • sys_by_state_age • sys_by_age_state • sys_by_model_age • sys_by_age_model • sys_by_age_model_state • sys_by_model_state_age • sys_by_model_state_os

Page 17: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Other Barriers to Ad Hoc Queries

No Aggregation

No Group By

No Joins

18

Page 18: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

19

What Do

I Do

Now?

Page 19: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

20

Page 20: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Spark

21

Fast and General-purpose Cluster Computing

Framework for Processing Large Datasets

API in Scala, Java, Python, SQL, and R

Page 21: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Integrated Libraries for a Variety of Tasks

22

Spark Core

Spark SQL

GraphX Spark

Streaming MLlib &

Spark ML

Page 22: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

One Minor Problem!

Spark Does not Have Built-in Support for C*

Built-in Support for HDFS, S3 and JDBC-compliant

Databases

23

Page 23: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Spark Cassandra Connector

Open Source Library for Integrating Spark with C*

Enables a Spark Application to Process Data in C* Just

Like Data from the Built-in Data Sources

24

Page 24: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Spark with C*

Enables Ad Hoc Analytics

CQL Limitations No Longer Apply

Query Data Using SQL/HiveQL

– Filter on Any Column

– Aggregations

– Group By

25

Page 25: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Ad Hoc Analytics in Spark Shell

26

Page 26: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Launch the Spark Shell

/path/to/spark/bin/spark-shell \

--master spark://host:7077 \

--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0

27

Page 27: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Create a DataFrame

val events = sqlContext.read

.format("org.apache.spark.sql.cassandra")

.options( Map(

"keyspace" -> "test",

"table" -> "event"))

.load()

28

Page 28: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Fire Queries

events.cache()

events.select("ts", "module", "message").where($"severity" === "ERROR").show

events.select("ts", "severity", "message").where($"module" === "m1").show

events.select("ts", "message").where($"severity" === "ERROR" &&

$"module" === "m1").show

events.groupBy("severity").count()

29

Page 29: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Spark SQL JDBC/ODBC Server

Analyze data in C* with just SQL/HiveQL

Command Line Shell – Beeline

Graphical SQL Client – Squirrel

Data Visualization Applications – Tableau

– ZoomData

– Qlik

30

Page 30: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Ad hoc Analytics with Spark SQL JDBC/ODBC server

31

Page 31: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Start the Spark SQL JDBC Server

/path/to/spark/sbin/start-thriftserver.sh \

--master spark://hostname:7077 \

--packages com.datastax.spark:spark-cassandra-connector_2.10:1.4.0

32

Page 32: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Launch Beeline From a Terminal

/path/to/spark/bin/beeline

33

Page 33: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Connect to the Spark SQL JDBC Server

beeline> !connect jdbc:hive2://localhost:10000

34

Page 34: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Create a Temporary Table

0: jdbc:hive2://localhost:10000> CREATE TEMPORARY TABLE event

. . . . . . . . . . . . . . . .> USING org.apache.spark.sql.cassandra

. . . . . . . . . . . . . . . .> OPTIONS (

. . . . . . . . . . . . . . . .> keyspace "test",

. . . . . . . . . . . . . . . .> table "event"

. . . . . . . . . . . . . . . .> );

35

Page 35: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Query Data with SQL/HiveQL

...> CACHE TABLE event;

...> SELECT severity, count(1) as total FROM event GROUP BY severity;

...> SELECT module, severity, count(1) FROM event GROUP BY module, severity;

36

Page 36: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Caveats

Latency

Spark Query May Require Expensive Table Scan

– Reads Every Row

– Disk I / O Slow

37

Page 37: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Reduce the Impact of Slow Disk I / O

Cache Tables

Replace HDD with SSD

Add More Nodes

38

Page 38: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

Recommendations

Known Queries Requiring Sub-second Response Time

– Query C* Directly

– Create Query Specific Tables

– Pre-aggregate Data

Ad Hoc Queries

– Spark

39

Page 39: Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassandra and Apache Spark

© Copyright 2015 Glassbeam Inc.

40