Cassandra Data Modelling with CQL (OSCON 2015)

D ATA M O D E L I N G C A S S A N D R A U S I N G C Q L 3

M I K E B I G L A N & E L I J A H H A M O V I T Z

A B O U T U S

Mike Biglan, M.S.

• Twenty Ideas, Inc – twentyideas.com

• Analytic Spot – spo.td

Elijah Hamovitz

• code.org

• Analytic Spot – spo.td

http://code.org

S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y

A D ATA M O D E L M O D E L S D ATA

• Framework to store and organize data

• Models things, their differences, and relationships between them

• Things can be real or virtual

I D E A L D ATA M O D E L P R O P E R T I E S

• Easy to create

• Easy to interface with

• Quick, flexible querying

• Writing is direct and simple

• Easily understandable

• Scalable: can read, write, and store huge amount of data safely

S C A L E A W AY, S C A L E A W AY, S C A L E A W AY

• Availability: fault tolerance, redundancy, supports multiple data centers

• Consistency: strong or tunable

• Huge amounts of data (that won’t fit on single server)

• High-speed of incoming and/or accessed data

C O L U M N A R D ATA B A S E

• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,

Dynamo) are both “NoSQL”

• But modeling in Document-Store is quite different than Columnar

• Atomic unit of data storage

• Document-Store: document

• Relational Database: row

• Columnar: column

C A S S A N D R A C O L U M N : - N A M E - V A L U E ( O R T O M B S T O N E ) - T I M E S TA M P - T T L

C A S S A N D R A

Highly Available Distributed Columnar Datastore that’s:

• Near-linearly scalable

• Fault tolerant, no master

• Tunable consistency

• Performant, especially for writes – don’t read before write

G E T T I N G O N T H E S C A L E

Phase 1 Install Cassandra

Phase 2

Phase 3 Scale!

CQL != SQL

S Q L / C Q L S TAT E M E N T F L O W

• SQL execution is complex

• CQL execution is relatively simple, hence tiny subset of syntax

• Much of CQL Query complexity is which node(s) to fetch/write/confirm data from/to

• So denormalize!Relational Database Cassandra

SQL Statement CQL Statement

Syntax/Semantic Check

Query Plan & Optimization

Result

Data Store Data Store

Syntax/Semantic Check

Query Execution Query Execution

Result

C Q L ! = S Q L

SELECT <col1>, <col2>, … FROM <table> LEFT JOIN <table2>… WHERE <where-clause> GROUP BY <colx> HAVING … ORDER BY <order-clause>

S E V E R E LY L I M I T E D

S E V E R E LY L I M I T E D

• CQL syntax is small subset of SQL

mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

http://mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

S O W H Y T H E L I M I TAT I O N S ?

Thinking of Cassandra as a relational database, it’s hard to understand:

• what is easy

• what is hard

• what is impossible

“Language serves not only to express thought but to make possible thoughts which could not exist without it.”

— Bertrand Russell

T H E D I S T O R T I O N O F C Q L

• Broken mental model hinders optimal modeling

• CQL falsely implies a relational data model and the design patterns that go with it

• To model Cassandra well, know the underlying data structure

D ATA M O D E L I N G I N S Q L ( N O S H A R D I N G )

1. What are the Data?

2. What is the normalized data model?

… months pass …

3. How are the data going to be queried?

4. Optimize any slow areas and/or bottlenecks

• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc

D ATA M O D E L I N G I N C Q L

1. What are the data?

2. What read-queries are needed?

3. How to denormalize during writes?

• on initial write, or use external tools to make this sane

(Some) “premature” optimization is inherent and unavoidable

D ATA E C O S Y S T E M

To fully (and efficiently) enable everything SQL you are used to, must rely on the big(ish) data ecosystem:

• ElasticSearch, Solr, Sphinx

• Redis, Memcached

• Spark

• Spark Streaming or Storm (and Kafka)

C O M P L E X I T Y O F I N I T I A L D ATA M O D E L

• Modeling with Relational DB

• Items & their relationships

• Modeling with Cassandra

• Items & their relationships

• How/Where they are stored (sharding and hot spots)

• What data we want to read

• How (and how often) we write data into those models

C A N O P T I M I Z E L AT E R

T O M O D E L , O P E N T H E B L A C K B O X

• Goal of a good black box is you can do a lot without knowing much about what’s inside

• CQL DOES NOT allow you to ignore what’s inside Cassandra

I N T H E B E G I N N I N G : T H R I F T

• Around Cassandra version 0.8, Thrift started getting replaced with CQL

• Thrift too low-level, but the interface had a close mapping to the underlying Cassandra data structure

T H R I F T & C Q L T E R M I N O L O G Y

T H R I F T C Q L

C O L U M N FA M I LY TA B L E

R O W PA R T I T I O N

C O L U M N C E L L

[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N

[ G R O U P O F C E L L S W I T H S H A R E D C O M P O N E N T P R E F I X E S ]

R O W

www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

C Q L TA B L E

CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY ((company), name) );

V S H O W I T ’ S S T O R E D

employees = { "Foo, inc" : { "Fred:age" : 31, "Fred:role" : "coder", "Sara:age" : 39, "Sara:role" : "boss" }, "BarCo" : { "Bill:age" : 50, "Bill:role" : "SQL guru" "Jane:age" : 20, "Jane:role" : "hotshot", } }


W I D E PA R T I T I O N S ( F O R M E R LY W I D E “ R O W S ” )

www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51) www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

• Based on Thrift “rows”, so actually wide “partitions”

• Columns are the clustering key values with the column-name suffix

• Up to 2 billion (but don’t do this)

http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure

http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

S E T S , M A P S , A N D L I S T S : O H M Y

• Sets/Maps/List still column-level storage

• Enabling Schemaless, but can result in long column names

www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)

http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure

PA R T I T I O N A N D C L U S T E R I N G K E Y

CQL Where-Clause variants

• None (kind of)

• key1 & key2

• key1 & key2 & key3

• key1 & key2 & key3 & key4

company | year | month | day | employee | reason ------------------------------------------------- Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving Foo, Inc | 2014 | 12 | 25 | Fred | Christmas Foo, Inc | 2014 | 12 | 25 | Sara | Christmas Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day

CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) );

C O M P O S I T E K E Y S

C O M P O S I T E K E Y S

CREATE TABLE breaks ( company text, year int, month int, day int, employee text, reason text, PRIMARY KEY ((company, year), month, day, employee) );

breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }

T H E N W H AT I S E A S Y ?

With a dictionary of ordered dictionaries:

• Grabbing the data (or subset) from a partition-key

• Getting a slice of data (uses linear

search) based on a partition-key

breaks = { "Foo, Inc:2014" : { "11" : { "27" : { "Fred:reason" : "Thanksgiving" } }, "12" : { "25" : { "Fred:reason" : "Christmas", "Sara:reason" : "Christmas" }, "26" : { "Sara:reason" : "Boxing day" } } } }

W H AT I S H A R D ? ( I . E . C O M M O N S Q L PAT T E R N S )

• Unique and Group by

• Ordered

• Inverted Index

G R O U P - B Y & C O U N T E R S

• Often group-by is used for counting

• Use counter columns or other tools (e.g. elasticsearch)

CREATE TABLE employee_break_counts ( company text, employee text, break_counts counter, PRIMARY KEY ((company), employee) );

O R D E R I N G O R I N V E R T E D - I N D E X

• Redundant table, but ordered by new “column”

• Depending on needs, this can store the order-field and lookup key OR some/all of the other data in that table

• If a read will generate more than a few subsequent child reads then some/all the other data should be included

CREATE TABLE employees_by_age ( company text, id int, age int, name text, role text, PRIMARY KEY ((company), age, id) );

C * M O D E L I N G A N T I - PAT T E R N S

C * G U I D E L I N E S S Q L G U I D E L I N E S

W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S

S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA

PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K

S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A R Y I N D E X E S

S I M P L E Q U E R I E S C O M P L E X Q U E R I E S

C * M O D E L I N G PAT T E R N S

C * G U I D E L I N E S C * PAT T E R N S

W R I T E S A R E C H E A P / FA S T

D U P L I C AT E Y O U R D ATA

S T O R A G E I S C H E A P

PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S

S T R I C T C O M P O S I T E K E Y S

D E S I G N TA B L E S A R O U N D Q U E R I E S

S I M P L E Q U E R I E S

Questions?

Mike Biglan [email protected]

@twentyideas

Elijah Hamovitz [email protected]

mailto:[email protected]

mailto:[email protected]

Cassandra Data Modelling with CQL (OSCON 2015)

Technology

Transcript of Cassandra Data Modelling with CQL (OSCON 2015)