Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

40
©2013 DataStax Confidential. Do not distribute without consent. @PatrickMcFadin Patrick McFadin Chief Evangelist Data Modeling Time Series 1

description

You know you need Cassandra for it's uptime and scaling, but what about that data model? Let's bridge that gap and get you building your game changing app. We'll break down topics like storing objects and indexing for fast retrieval. You will see by understanding a few things about Cassandra internals, you can put your data model in the spotlight. The goal of this talk is to get you comfortable working with data in Cassandra throughout the application lifecycle. What are you waiting for? The cameras are waiting!

Transcript of Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Page 1: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

©2013 DataStax Confidential. Do not distribute without consent.

@PatrickMcFadin

Patrick McFadinChief Evangelist

Data Modeling Time Series

1

Page 2: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Internet Of Things• 15B devices by 2015 • 40B devices by 2020!

Page 3: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Why Cassandra for Time Series

ScalesResilientGood data modelEfficient Storage Model

What about that?

Page 4: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Example 1: Weather Station•Weather station collects data • Cassandra stores in sequence • Application reads in sequence

Page 5: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Use case

• Store data per weather station • Store time series in order: first to last

• Get all data for one weather station • Get data for a single date and time • Get data for a range of dates and times

Needed Queries

Data Model to support queries

Page 6: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Data Model•Weather Station Id and Time

are unique • Store as many as needed

CREATE TABLE temperature ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time) );

INSERT INTO temperature(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:01:00','72F'); !INSERT INTO temperature(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:02:00','73F'); !INSERT INTO temperature(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:03:00','73F'); !INSERT INTO temperature(weatherstation_id,event_time,temperature) VALUES ('1234ABCD','2013-04-03 07:04:00','74F');

Page 7: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Storage Model - Logical View

2013-04-03 07:01:00

72F

2013-04-03 07:02:00

73F

2013-04-03 07:03:00

73F

SELECT weatherstation_id,event_time,temperature FROM temperature WHERE weatherstation_id='1234ABCD';

1234ABCD

1234ABCD

1234ABCD

weatherstation_id event_time temperature

2013-04-03 07:04:00

74F1234ABCD

Page 8: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Storage Model - Disk Layout

2013-04-03 07:01:00

72F

2013-04-03 07:02:00

73F

2013-04-03 07:03:00

73F1234ABCD

2013-04-03 07:04:00

74F

SELECT weatherstation_id,event_time,temperature FROM temperature WHERE weatherstation_id='1234ABCD';

Merged, Sorted and Stored Sequentially

2013-04-03 07:05:00 !!74F

2013-04-03 07:06:00 !!75F

Page 9: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Query patterns• Range queries • “Slice” operation on disk

SELECT weatherstation_id,event_time,temperature FROM temperature WHERE weatherstation_id='1234ABCD' AND event_time >= '2013-04-03 07:01:00' AND event_time <= '2013-04-03 07:04:00';

2013-04-03 07:01:00

72F

2013-04-03 07:02:00

73F

2013-04-03 07:03:00

73F1234ABCD

2013-04-03 07:04:00

74F

2013-04-03 07:05:00 !!74F

2013-04-03 07:06:00 !!75F

Single seek on disk

Page 10: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Query patterns• Range queries • “Slice” operation on disk

SELECT weatherstation_id,event_time,temperature FROM temperature WHERE weatherstation_id='1234ABCD' AND event_time >= '2013-04-03 07:01:00' AND event_time <= '2013-04-03 07:04:00';

2013-04-03 07:01:00

72F

2013-04-03 07:02:00

73F

2013-04-03 07:03:00

73F

1234ABCD

2013-04-03 07:04:00

74F

weatherstation_id event_time temperature

1234ABCD

1234ABCD

1234ABCD

Programmers like this

Sorted by event_time

Page 11: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Additional help on the storage engine

Page 12: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

SSTable seeks• Each read minimum

1 seek • Cache and bloom

filter help minimize

Total seek time = Disk Latency * number of seeks

Page 13: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

The key to speed

Use the first part of the primary key to get the node (data localization)

Minimize seeks for SStables (Key Cache, Bloom Filter)

Find the data fast in the SSTable (Indexes)

Page 14: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Min/Max Value Hint•New since 2.0 • Range index on primary key values per SSTable •Minimizes seeks on range data

CASSANDRA-5514 if you are interested in details

SELECT temperature FROM event_time,temperature WHERE weatherstation_id='1234ABCD' AND event_time > '2013-04-03 07:01:00' AND event_time < '2013-04-03 07:04:00';

Row Key: 1234ABCD Min event_time: 2013-04-01 00:00:00 Max event_time: 2013-04-04 23:59:59

Row Key: 1234ABCD Min event_time: 2013-04-05 00:00:00 Max event_time: 2013-04-09 23:59:59

Row Key: 1234ABCD Min event_time: 2013-03-27 00:00:00 Max event_time: 2013-03-31 23:59:59

?

This one

Page 15: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Ingestion models• Apache Kafka • Apache Flume • Storm • Custom Applications

Apache Kafka

Your totally!killer!application

Page 16: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Kafka + Storm• Kafka provides reliable queuing • Storm processes (rollups, counts) • Cassandra stores at the same speed • Storm lookup on Cassandra

Apache KafkaApache Storm

Queue Process Store

Page 17: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Flume• Source accepts data • Channel buffers data • Sink processes and stores • Popular for log processing

Sink

Channel

SourceApplication

Load Balancer

Syslog

Page 18: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Dealing with data at speed• 1 million writes per second? • 1 insert every microsecond • Collisions?

• Primary Key determines node placement • Random partitioning • Special data type - TimeUUID

Your totally!killer!application weatherstation_id='1234ABCD'

weatherstation_id='5678EFGH'

Page 19: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

How does data replicate?

Page 20: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Primary key determines placement*

Partitioning

jim age: 36 car: camaro gender: M

carol age: 37 car: subaru gender: F

johnny age:12 gender: M

suzy age:10 gender: F

Page 21: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

jim

carol

johnny

suzy

PK

5e02739678...

a9a0198010...

f4eb27cea7...

78b421309e...

MD5 Hash

MD5* hash operation yields a 128-bit number for keys of any size.

Key Hashing

Page 22: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Node A

Node D Node C

Node B

The Token Ring

Page 23: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

Page 24: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

Page 25: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

Page 26: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

Page 27: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

jim 5e02739678...

carol a9a0198010...

johnny f4eb27cea7...

suzy 78b421309e...

Start EndA 0xc000000000..1 0x0000000000..0

B 0x0000000000..1 0x4000000000..0

C 0x4000000000..1 0x8000000000..0

D 0x8000000000..1 0xc000000000..0

Page 28: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Node A

Node D Node C

Node B

carol a9a0198010...

Replication

Page 29: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Node A

Node D Node C

Node B

carol a9a0198010...

Replication

Page 30: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Node A

Node D Node C

Node B

carol a9a0198010...

ReplicationReplication factor = 3

Consistency is a different topic for later

Page 31: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

TimeUUID

• Also known as a Version 1 UUID • Sortable • Reversible

Timestamp to Microsecond + UUID = TimeUUID

04d580b0-9412-11e3-baa8-0800200c9a66 Wednesday, February 12, 2014 6:18:06 PM GMT

http://www.famkruithof.net/uuid/uuidgen

=

Page 32: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Example 2: Financial Transactions• Trading of stocks •When did they happen? •Massive speeds and volumes

“Sirca, a non-profit university consortium based in Sydney, is the world’s biggest broker of financial data, ingesting into its database 2million pieces of information a second from every major trading exchange.”*

* http://www.theage.com.au/it-pro/business-it/help-poverty-theres-an-app-for-that-20140120-hv948.html

Page 33: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Use case

• Store data per symbol and date • Store time series in reverse order: last to first •Make sure every transaction is unique

• Get all trades for symbol and day • Get trade for a single date and time • Get last 10 trades for symbol and date

Needed Queries

Data Model to support queries

Page 34: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Data Model

• date is int of days since epoch • timeuuid keeps it unique • Reverse the times for later

queries

CREATE TABLE stock_ticks ( symbol text, date int, trade timeuuid, trade_details text, PRIMARY KEY ((symbol, date), trade) ) WITH CLUSTERING ORDER BY (trade DESC);

INSERT INTO stock_ticks(symbol, date, trade, trade_details) VALUES (‘NFLX’,340,04d580b0-1431-1e33-baf8-0833200c98a6,'BUY:2000'); !INSERT INTO stock_ticks(symbol, date, trade, trade_details) VALUES (‘NFLX’,340,05d580b0-6472-1ef3-a3a8-0430200c9a66,'BUY:300'); !INSERT INTO stock_ticks(symbol, date, trade, trade_details) VALUES (‘NFLX’,340,02d580b0-9412-d223-55a8-0976200c9a25,'SELL:450'); !INSERT INTO stock_ticks(symbol, date, trade, trade_details) VALUES (‘NFLX’,340,08d580b0-4482-11e3-5fd3-3421200c9a65,'SELL:3000');

Page 35: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Storage Model - Logical View

08d580b0-4482-11e3-5fd3-3421200c9a65

SELL:3000

02d580b0-9412-d223-55a8-0976200c9a25

SELL:450

05d580b0-6472-1ef3-a3a8-0430200c9a66

BUY:300

SELECT trade,trade_details FROM stock_ticks WHERE symbol =‘NFLX’ AND date=‘340’;

NFLX:340

NFLX:340

NFLX:340

symbol:date trade trade_details

04d580b0-1431-1e33-baf8-0833200c98a6

BUY:2000NFLX:340

Page 36: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

04d580b0-1431-1e33-baf8-0833200c98a6

05d580b0-6472-1ef3-a3a8-0430200c9a66

02d580b0-9412-d223-55a8

BUY:2000BUY:300

08d580b0-4482-11e3-5fd3-3421200c9a65

SELL:3000 SELL:450

Storage Model - Disk Layout

NFLX:340

Order is from last trade to first

SELECT trade,trade_details FROM stock_ticks WHERE symbol =‘NFLX’ AND date=‘340’;

Page 37: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

04d580b0-1431-1e33-baf8-0833200c98a6

05d580b0-6472-1ef3-a3a8-0430200c9a66

02d580b0-9412-d223-55a8-0976200c9a25

Query patterns• Limit queries • Get last X trades

From here

SELECT trade,trade_details FROM stock_ticks WHERE symbol =‘NFLX’ AND date=‘340’ LIMIT 3;

BUY:2000BUY:300

08d580b0-4482-11e3-5fd3-3421200c9a65

SELL:3000 SELL:450NFLX:340

to here

Page 38: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Query patterns

Reverse sorted by trade Last 3 trades

08d580b0-4482-11e3-5fd3-3421200c9a65

SELL:3000

02d580b0-9412-d223-55a8-0976200c9a25

SELL:450

05d580b0-6472-1ef3-a3a8-0430200c9a66

BUY:300

NFLX:340

NFLX:340

NFLX:340

symbol:date trade trade_details

• Limit queries • Get last X trades

SELECT trade,trade_details FROM stock_ticks WHERE symbol =‘NFLX’ AND date=‘340’ LIMIT 3;

Page 39: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Way more examples

• 5 minute interviews • Use cases • Free training!

!www.planetcassandra.org

Page 40: Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Thank You!

Follow me for more updates all the time: @PatrickMcFadin