Apache cassandra & apache spark for time series data

57
©2013 DataStax Confidential. Do not distribute without consent. @PatrickMcFadin Patrick McFadin Chief Evangelist for Apache Cassandra, DataStax Apache Cassandra & Apache Spark for Time Series Data 1

Transcript of Apache cassandra & apache spark for time series data

Page 1: Apache cassandra & apache spark for time series data

©2013 DataStax Confidential. Do not distribute without consent.

@PatrickMcFadin

Patrick McFadinChief Evangelist for Apache Cassandra, DataStax

Apache Cassandra & Apache Spark for Time Series Data

1

Page 2: Apache cassandra & apache spark for time series data

Cassandra for Applications

APACHE

CASSANDRA

Page 3: Apache cassandra & apache spark for time series data

Cassandra is…• Shared nothing • Masterless peer-to-peer • Based on Dynamo

Page 4: Apache cassandra & apache spark for time series data

Scaling• Add nodes to scale • Millions Ops/s

Cassandra HBase Redis MySQL

THRO

UG

HPU

T O

PS/S

EC)

Page 5: Apache cassandra & apache spark for time series data

Uptime• Built to replicate • Resilient to failure • Always on

NONE

Page 6: Apache cassandra & apache spark for time series data

Replication

10.0.0.1 00-25

10.0.0.4 76-100

10.0.0.2 26-50

10.0.0.3 51-75

DC1

DC1: RF=3

10.10.0.1 00-25

10.10.0.4 76-100

10.10.0.2 26-50

10.10.0.3 51-75

DC2

DC2: RF=3

Client Insert Data

Asynchronous Local Replication

Asynchronous WAN Replication

Page 7: Apache cassandra & apache spark for time series data

Data Model

• Familiar syntax • Collections • PRIMARY KEY for uniqueness

CREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map<text,text>, tags set<varchar>, added_date timestamp, PRIMARY KEY (videoid) );

Page 8: Apache cassandra & apache spark for time series data

Data Model - User Defined Types

• Complex data in one place

• No multi-gets (multi-partitions)

• Nesting!CREATE TYPE address ( street text, city text, zip_code int, country text, cross_streets set<text> );

Page 9: Apache cassandra & apache spark for time series data

Data Model - Updated

• Now video_metadata is embedded in videos

CREATE TYPE video_metadata ( height int, width int, video_bit_rate set<text>, encoding text );

CREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map<text,text>, tags set<varchar>, metadata set <frozen<video_metadata>>, added_date timestamp, PRIMARY KEY (videoid) );

Page 10: Apache cassandra & apache spark for time series data

Data Model - Storing JSON{ "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } }

CREATE TYPE dimensions ( units text, length float, width float, height float );

CREATE TYPE category ( catalogPage int, url text );

CREATE TABLE product ( productId int, name text, price float, description text, dimensions frozen <dimensions>, categories map <text, frozen <category>>, PRIMARY KEY (productId) );

Page 11: Apache cassandra & apache spark for time series data

Why…

Cassandra for Time Series?

Spark as a great addition to Cassandra?

Page 12: Apache cassandra & apache spark for time series data

Example 1: Weather Station• Weather station collects data • Cassandra stores in sequence • Application reads in sequence

Page 13: Apache cassandra & apache spark for time series data

Use case

• Store data per weather station • Store time series in order: first to last

• Get all data for one weather station • Get data for a single date and time • Get data for a range of dates and times

Needed Queries

Data Model to support queries

Page 14: Apache cassandra & apache spark for time series data

Data Model• Weather Station Id and Time

are unique • Store as many as needed

CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ((weather_station),year,month,day,hour) );

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,7,-5.6);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,8,-5.1);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,9,-4.9);

INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (‘10010:99999’,2005,12,1,10,-5.3);

Page 15: Apache cassandra & apache spark for time series data

Storage Model - Logical View

2005:12:1:7

-5.6

2005:12:1:8

-5.1

2005:12:1:9

-4.9

SELECT weather_station,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;

10010:99999

10010:99999

10010:99999

weather_station hour temperature

2005:12:1:10

-5.310010:99999

Page 16: Apache cassandra & apache spark for time series data

2005:12:1:12

-5.4

2005:12:1:11

-4.9 -5.3-4.9-5.1

2005:12:1:7

-5.6

Storage Model - Disk Layout

2005:12:1:8 2005:12:1:910010:99999

2005:12:1:10

Merged, Sorted and Stored Sequentially

SELECT weather_station,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999’ AND year = 2005 AND month = 12 AND day = 1;

Page 17: Apache cassandra & apache spark for time series data

Primary key relationship

PRIMARY KEY (weatherstation_id,year,month,day,hour)

Page 18: Apache cassandra & apache spark for time series data

Primary key relationship

PRIMARY KEY (weatherstation_id,year,month,day,hour)

Partition Key

Page 19: Apache cassandra & apache spark for time series data

Primary key relationship

PRIMARY KEY (weatherstation_id,year,month,day,hour)

Partition Key Clustering Columns

Page 20: Apache cassandra & apache spark for time series data

Primary key relationship

PRIMARY KEY (weatherstation_id,year,month,day,hour)

Partition Key Clustering Columns

10010:99999

Page 21: Apache cassandra & apache spark for time series data

2005:12:1:7

-5.6

Primary key relationship

PRIMARY KEY (weatherstation_id,year,month,day,hour)

Partition Key Clustering Columns

10010:99999-5.3-4.9-5.1

2005:12:1:8 2005:12:1:9 2005:12:1:10

Page 22: Apache cassandra & apache spark for time series data

Data Locality

weatherstation_id=‘10010:99999’ ?

1000 Node Cluster

You are here!

Page 23: Apache cassandra & apache spark for time series data

Query patterns• Range queries • “Slice” operation on disk

SELECT weatherstation,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;

Single seek on disk

2005:12:1:12

-5.4

2005:12:1:11

-4.9 -5.3-4.9-5.1

2005:12:1:7

-5.6

2005:12:1:8 2005:12:1:910010:99999

2005:12:1:10

Partition key for locality

Page 24: Apache cassandra & apache spark for time series data

Query patterns• Range queries • “Slice” operation on disk

Programmers like this

Sorted by event_time2005:12:1:7

-5.6

2005:12:1:8

-5.1

2005:12:1:9

-4.9

10010:99999

10010:99999

10010:99999

weather_station hour temperature

2005:12:1:10

-5.310010:99999

SELECT weatherstation,hour,temperature FROM temperature WHERE weatherstation_id=‘10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10;

Page 25: Apache cassandra & apache spark for time series data

Apache Spark

Page 26: Apache cassandra & apache spark for time series data

Apache Spark• 10x faster on disk,100x faster in memory than Hadoop MR • Works out of the box on EMR • Fault Tolerant Distributed Datasets • Batch, iterative and streaming analysis • In Memory Storage and Disk • Integrates with Most File and Storage Options

Up to 100× faster (2-10× on disk)

2-5× less code

Page 27: Apache cassandra & apache spark for time series data

Spark Components

Spark Core

Spark SQL structured

Spark Streaming

real-time

MLlib machine learning

GraphX graph

Page 28: Apache cassandra & apache spark for time series data
Page 29: Apache cassandra & apache spark for time series data

org.apache.spark.rdd.RDDResilient Distributed Dataset (RDD)

•Created through transformations on data (map,filter..) or other RDDs

•Immutable

•Partitioned

•Reusable

Page 30: Apache cassandra & apache spark for time series data

RDD Operations•Transformations - Similar to scala collections API

•Produce new RDDs

•filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract

•Actions

•Require materialization of the records to generate a value

•collect: Array[T], count, fold, reduce..

Page 31: Apache cassandra & apache spark for time series data

Analytic

Analytic

Search

Transformation

Action

RDD Operations

Page 32: Apache cassandra & apache spark for time series data

Collections and Files To RDDscala> val distData = sc.parallelize(Seq(1,2,3,4,5) distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e

val distFile: RDD[String] = sc.textFile(“directory/*.txt”) val distFile = sc.textFile(“hdfs://namenode:9000/path/file”) val distFile = sc.sequenceFile(“hdfs://namenode:9000/path/file”)

Page 33: Apache cassandra & apache spark for time series data

Spark and Cassandra

Page 34: Apache cassandra & apache spark for time series data

Spark on Cassandra• Server-Side filters (where clauses) • Cross-table operations (JOIN, UNION, etc.) • Data locality-aware (speed) • Data transformation, aggregation, etc. • Natural Time Series Integration

Page 35: Apache cassandra & apache spark for time series data

Spark Cassandra Connector• Loads data from Cassandra to Spark • Writes data from Spark to Cassandra • Implicit Type Conversions and Object Mapping • Implemented in Scala (offers a Java API) • Open Source • Exposes Cassandra Tables as Spark RDDs + Spark DStreams

Page 36: Apache cassandra & apache spark for time series data

Spark Cassandra Connector

C*

C*

C*C*Cassandra

Spark Executor

C* Java (Soon Scala) Driver

Spark-Cassandra Connector

User Application

https://github.com/datastax/spark-cassandra-connector

Page 37: Apache cassandra & apache spark for time series data

Spark Cassandra Example

val conf = new SparkConf(loadDefaults = true) .set("spark.cassandra.connection.host", "127.0.0.1") .setMaster("spark://127.0.0.1:7077")

val sc = new SparkContext(conf)

val table: CassandraRDD[CassandraRow] = sc.cassandraTable("keyspace", "tweets")

val ssc = new StreamingContext(sc, Seconds(30)) val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(topic -> 1), StorageLevel.MEMORY_ONLY) stream.map(_._2).countByValue().saveToCassandra("demo", "wordcount") ssc.start()ssc.awaitTermination()

Initialization

Transformations and Action

CassandraRDD

Stream Initialization

Page 38: Apache cassandra & apache spark for time series data

Weather Station Analysis• Weather station collects data • Cassandra stores in sequence • Spark rolls up data into new

tables

Windsor California July 1, 2014

High: 73.4F Low : 51.4F

Page 39: Apache cassandra & apache spark for time series data

Roll-up tableCREATE TABLE daily_aggregate_temperature ( wsid text, year int, month int, day int, high double, low double, PRIMARY KEY ((wsid), year, month, day) );

• Weather Station Id(wsid) is unique • High and low temp for each day

Page 40: Apache cassandra & apache spark for time series data

Setup connection

def main(args: Array[String]): Unit = {

// the setMaster("local") lets us run & test the job right in our IDE val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").setMaster("local")

// "local" here is the master, meaning we don't explicitly have a spark master set up val sc = new SparkContext("local", "weather", conf)

val connector = CassandraConnector(conf)

val cc = new CassandraSQLContext(sc) cc.setKeyspace("isd_weather_data")

Page 41: Apache cassandra & apache spark for time series data

Get data and aggregate

// Create SparkSQL statement val aggregationSql = "SELECT wsid, year, month, day, max(temperature) high, min(temperature) low " +

"FROM raw_weather_data " + "WHERE month = 6 " + "GROUP BY wsid, year, month, day;"

val srdd: SchemaRDD = cc.sql(aggregationSql);

val resultSet = srdd.map(row => ( new daily_aggregate_temperature( row.getString(0), row.getInt(1), row.getInt(2), row.getInt(3), row.getDouble(4), row.getDouble(5)))) .collect()

// Case class to store row data case class daily_aggregate_temperature (wsid: String, year: Int, month: Int, day: Int, high:Double, low:Double)

Page 42: Apache cassandra & apache spark for time series data

Store back into Cassandra connector.withSessionDo(session => { // Create a single prepared statement val prepared = session.prepare(insertStatement) val bound = prepared.bind

// Iterate over result set and bind variables for (row <- resultSet) { bound.setString("wsid", row.wsid) bound.setInt("year", row.year) bound.setInt("month", row.month) bound.setInt("day", row.day) bound.setDouble("high", row.high) bound.setDouble("low", row.low) // Insert new row in database session.execute(bound) } })

Page 43: Apache cassandra & apache spark for time series data

Result wsid | year | month | day | high | low --------------+------+-------+-----+------+------ 725300:94846 | 2012 | 9 | 30 | 18.9 | 10.6 725300:94846 | 2012 | 9 | 29 | 25.6 | 9.4 725300:94846 | 2012 | 9 | 28 | 19.4 | 11.7 725300:94846 | 2012 | 9 | 27 | 17.8 | 7.8 725300:94846 | 2012 | 9 | 26 | 22.2 | 13.3 725300:94846 | 2012 | 9 | 25 | 25 | 11.1 725300:94846 | 2012 | 9 | 24 | 21.1 | 4.4 725300:94846 | 2012 | 9 | 23 | 15.6 | 5 725300:94846 | 2012 | 9 | 22 | 15 | 7.2 725300:94846 | 2012 | 9 | 21 | 18.3 | 9.4 725300:94846 | 2012 | 9 | 20 | 21.7 | 11.7 725300:94846 | 2012 | 9 | 19 | 22.8 | 5.6 725300:94846 | 2012 | 9 | 18 | 17.2 | 9.4 725300:94846 | 2012 | 9 | 17 | 25 | 12.8 725300:94846 | 2012 | 9 | 16 | 25 | 10.6 725300:94846 | 2012 | 9 | 15 | 26.1 | 11.1 725300:94846 | 2012 | 9 | 14 | 23.9 | 11.1 725300:94846 | 2012 | 9 | 13 | 26.7 | 13.3 725300:94846 | 2012 | 9 | 12 | 29.4 | 17.2 725300:94846 | 2012 | 9 | 11 | 28.3 | 11.7 725300:94846 | 2012 | 9 | 10 | 23.9 | 12.2 725300:94846 | 2012 | 9 | 9 | 21.7 | 12.8 725300:94846 | 2012 | 9 | 8 | 22.2 | 12.8 725300:94846 | 2012 | 9 | 7 | 25.6 | 18.9 725300:94846 | 2012 | 9 | 6 | 30 | 20.6 725300:94846 | 2012 | 9 | 5 | 30 | 17.8 725300:94846 | 2012 | 9 | 4 | 32.2 | 21.7 725300:94846 | 2012 | 9 | 3 | 30.6 | 21.7 725300:94846 | 2012 | 9 | 2 | 27.2 | 21.7 725300:94846 | 2012 | 9 | 1 | 27.2 | 21.7

SELECT wsid, year, month, day, high, low FROM daily_aggregate_temperature WHERE wsid = '725300:94846' AND year=2012 AND month=9 ;

Page 44: Apache cassandra & apache spark for time series data

What just happened?• Data is read from raw_weather_data table • Transformed • Inserted into the daily_aggregate_temperature table

Table: raw_weather_data

Table: daily_aggregate_temperature

Read data from table Transform Insert data

into table

Page 45: Apache cassandra & apache spark for time series data

Weather Station Stream Analysis• Weather station collects data • Data processed in stream • Data stored in Cassandra

Windsor California Today

Rainfall total: 1.2cm

High: 73.4F Low : 51.4F

Page 46: Apache cassandra & apache spark for time series data

zillions of bytes gigabytes per second

Spark Versus Spark Streaming

Page 47: Apache cassandra & apache spark for time series data

Analytic

Analytic

Search

Spark Streaming

Kinesis,'S3'

Page 48: Apache cassandra & apache spark for time series data

DStream - Micro Batches

μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)

Processing of DStream = Processing of μBatches, RDDs

DStream

• Continuous sequence of micro batches • More complex processing models are possible with less effort • Streaming computations as a series of deterministic batch

computations on small time intervals

Page 49: Apache cassandra & apache spark for time series data

Spark Streaming Reduce Example

val sc = new SparkContext(..) val ssc = new StreamingContext(sc, Seconds(5))

val stream = TwitterUtils.createStream(ssc, auth, filters, StorageLevel.MEMORY_ONLY_SER_2)

val transform = (cruft: String) => Pattern.findAllIn(cruft).flatMap(_.stripPrefix("#")) /** Note that Cassandra is doing the sorting for you here. */stream.flatMap(_.getText.toLowerCase.split("""\s+""")) .map(transform) .countByValueAndWindow(Seconds(5), Seconds(5)) .transform((rdd, time) => rdd.map { case (term, count) => (term, count, now(time))}) .saveToCassandra(keyspace, suspicious, SomeColumns(“suspicious", "count", “timestamp"))

Even Machine Learning!

Page 50: Apache cassandra & apache spark for time series data

Temperature High/Low Stream

Weather Stations Receive API

Apache KafkaProducer

TemperatureActor

TemperatureActor

TemperatureActor

Consumer

Page 51: Apache cassandra & apache spark for time series data

You can do this at home!

https://github.com/killrweather/killrweather

Page 52: Apache cassandra & apache spark for time series data

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datastax Have Partnered for Apache Spark Engineering and Support

http://www.datastax.com/

Page 53: Apache cassandra & apache spark for time series data

Resources•Spark Cassandra Connector

https://github.com/datastax/spark-cassandra-connector

•Apache Cassandra http://cassandra.apache.org

•Apache Spark http://spark.apache.org

•Apache Kafka http://kafka.apache.org

•Akka http://akka.io

Analytic

Analytic

Page 54: Apache cassandra & apache spark for time series data
Page 55: Apache cassandra & apache spark for time series data

FREE tickets to our Annual Cassandra Summit Europe taking place in London in early December (3rd and 4th). The 4th is a full conference day with free admission to all attendees and will feature presentations by companies like ING, Credit Suisse, Target, UBS, The Noble Group as well as other top Cassandra experts in the world. There will be content for those entirely new to Cassandra all the way to the most seasoned Cassandra veteran, spanning development, architecture, and operations as well as how to integrate Cassandra with analytics and search technologies like Apache Spark and Apache Solr.

December 3rd is a paid training day. If you are interested in getting a discount on paid training, please speak with Diego - [email protected]

Page 56: Apache cassandra & apache spark for time series data

Munich Cassandra Users

Join your local Cassandra meetup group:

http://www.meetup.com/Munchen-Cassandra-Users/

© 2014 DataStax, All Rights Reserved. 56

Page 57: Apache cassandra & apache spark for time series data

Thank you

Follow me on twitter for more updates @PatrickMcFadin