Cassandra, Modeling and Availability at AMUG

69
Conceptual Modeling Differences From A RDBMS Matthew F. Dennis, DataStax // @mdennis Austin MySQL User Group January 11, 2012

description

brief high level comparison of modeling between relational databases and Cassandra followed by a brief description of how Cassandra achieves global availability

Transcript of Cassandra, Modeling and Availability at AMUG

Page 1: Cassandra, Modeling and Availability at AMUG

Conceptual Modeling Differences From A RDBMSMatthew F. Dennis, DataStax // @mdennis

Austin MySQL User GroupJanuary 11, 2012

Page 2: Cassandra, Modeling and Availability at AMUG

Cassandra Is Not Relationalget out of the relational mindset when working

with Cassandra (or really any NoSQL DB)

Page 3: Cassandra, Modeling and Availability at AMUG

Work Backwards From QueriesThink in terms of queries, not in terms of

normalizing the data; in fact, you often want to denormalize (already common in the data

warehousing world, even in RDBMS)

Page 4: Cassandra, Modeling and Availability at AMUG

OK great, but how do I do that?Well, you need to know how Cassandra Models

Data (e.g. Google Big Table)

research.google.com/archive/bigtable-osdi06.pdf

Go Read It!

Page 5: Cassandra, Modeling and Availability at AMUG

In Cassandra:

➔data is organized into Keyspaces (usually one per app)

➔each Keyspace can have multiple Column Families

➔each Column Family can have many Rows

➔each Row has a Row Key and a variable number of Columns

➔each Column consists of a Name, Value and Timestamp

Page 6: Cassandra, Modeling and Availability at AMUG

In Cassandra, Keyspaces:

➔are similar in concept to a “database” in some RDBMs

➔are stored in separate directories on disk

➔are usually one-one with applications

➔are usually the administrative unit for things related to ops

➔contain multiple column families

Page 7: Cassandra, Modeling and Availability at AMUG

In Cassandra, In Keyspaces, Column Famlies:

➔are similar in concept to a “table” in most RDBMs

➔are stored in separate files on disk (many per CF)

➔are usually approximately one-one with query type

➔are usually the administrative unit for things related to your data

➔can contain many (~billion* per node) rows

* for a good sized node(you can always add nodes)

Page 8: Cassandra, Modeling and Availability at AMUG

In Cassandra, In Keyspaces, In Column Families ...

Page 9: Cassandra, Modeling and Availability at AMUG

Rows

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Row Keys

Page 10: Cassandra, Modeling and Availability at AMUG

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Columns

Page 11: Cassandra, Modeling and Availability at AMUG

Column Names

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Page 12: Cassandra, Modeling and Availability at AMUG

Column Values

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Page 13: Cassandra, Modeling and Availability at AMUG

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Rows Are Randomly Ordered(if using the RandomPartitioner)

Page 14: Cassandra, Modeling and Availability at AMUG

thepaul office: Austin OS: OSX twitter: thepaul0

mdennis office: UA OS: Linux twitter: mdennis

thobbs office: Austin twitter: tylhobbs

Columns Are Ordered by Name(by a configurable comparator)

Page 15: Cassandra, Modeling and Availability at AMUG

Columns are ordered because doing so allows very efficient

implementations of useful and common operations

(e.g. merge joins)

Page 16: Cassandra, Modeling and Availability at AMUG

In particular, within a row I can find given columns by name very quickly (ordered names => log(n)

binary search).

Page 17: Cassandra, Modeling and Availability at AMUG

More importantly, I can query for a slice between a start and end

RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...

start end

Row Key

Page 18: Cassandra, Modeling and Availability at AMUG

Why does that matter?Because columns within a row aren't static!

Page 19: Cassandra, Modeling and Availability at AMUG

INTC ts0: $25.20 ts1: $25.25 ...

AMR ts0: $6.20 ts9: $0.26 ...

CRDS ts0: $1.05 ts5: $6.82 ...

Columns Are Ordered by Name(in this case by a TimeUUID Comparator)

The Column Name Can Be Part of Your Data

Page 20: Cassandra, Modeling and Availability at AMUG

Turns Out That Pattern Comes Up A Lot

➔stock ticks➔event logs➔ad clicks/views➔sensor records➔access/error logs➔plane/truck/person/”entity” locations➔…

Page 21: Cassandra, Modeling and Availability at AMUG

OK, but I can do that in SQLNot efficiently at scale, at least not easily ...

Page 22: Cassandra, Modeling and Availability at AMUG

ticker timestamp bid ask ...

AMR ts0 ... ... ...

... ... ... ... ...

CRDS ts0 ... ... ...

... ... ... ... ...

... ts0 ... ... ...

AMR ts1 ... ... ...

... ... ... ... ...

... ... ... ... ...

… ts1 ... ... ...

AMR ts2 ... ... ...

... ts2 ... ... ...

Data I Care About

How it Looks In a RDBMS

Page 23: Cassandra, Modeling and Availability at AMUG

ticker timestamp bid ask ...

AMR ts0 ... ... ...

AMR ts1 ... ... ...

AMR ts2 ... ... ...

... ts2 ... ... ...

Disk Seeks

How it Looks In a RDBMS

Larger Than Your Page Size

Larger Than Your Page Size

Page 24: Cassandra, Modeling and Availability at AMUG

OK, but what about ...

➔PostgreSQL Cluster Command?

➔MySQL Cluster Indexes?

➔Oracle Index Organized Tables?

➔SQLServer Clustered Index?

Page 25: Cassandra, Modeling and Availability at AMUG

OK, but what about ...➔PostgreSQL Cluster Using?

➔MySQL [InnoDB] Cluster Indexes?

➔Oracle Index Organized Table?

➔SQLServer Clustered Index? (seriously, who uses SQLServer?!)

Meh ...

Page 26: Cassandra, Modeling and Availability at AMUG

The on-disk management of that clustering results in tons of IO …

In the case of PostgreSQL:

➔clustering is a one time operation (implies you must periodically rewrite the entire table)

➔new data is *not* written in clustered order (which is often the data you care most about)

Page 27: Cassandra, Modeling and Availability at AMUG

OK, so just partition the tables ...

Page 28: Cassandra, Modeling and Availability at AMUG

Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB

(you should probably still do it if using MySQL though)

http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html

Page 29: Cassandra, Modeling and Availability at AMUG

OK fine, I agree storing data that is queried together on disk together is a good thing but

what's that have to do with modeling?

RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...

Read Precisely My Data *

Seek To Here

* more on some caveats later

Page 30: Cassandra, Modeling and Availability at AMUG

Well, that's what is meant by “work backwards from your queries” or “think in terms of queries”

(NB: this concept, in general, applies to RDBMSat scale as well; it is not specific to Cassandra)

Page 31: Cassandra, Modeling and Availability at AMUG

An Example From Fraud Detection

To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phone

numbers, et cetera ever used for the account in question

Page 32: Cassandra, Modeling and Availability at AMUG

id name ...

1 guy ...

2 gal ...

... ... ...

id email ...

100 guy@ ...

200 gal@ ...

... ... ...

id dest ...

15 USA ...

25 Finland ...

... ... ...

id device ...

1000 0xdead ...

2000 0xb33f ...

... ... ...

id origin ...

150 USA ...

250 Nigeria ...

... ... ...

In a normalized model that usually translates to a table for each type of entity being tracked

Page 33: Cassandra, Modeling and Availability at AMUG

The problem is that at scale that also means a disk seek for each one …

(even for perfect IOT et al if across multiple tables)

➔Previous emails? That's a seek …➔Previous devices? That's a seek …➔Previous destinations? That's a seek ...

Page 34: Cassandra, Modeling and Availability at AMUG

But In Cassandra I Store The Data I Query Together On Disk Together

(remember, column names need not be static)

acctY ... ... ... ... ... ... ...

acctX dest21 dev2 dev7 email9 orig4 ...

acctZ ... ... ... ... ... ... ...

Data I Care About

email:[email protected] = dateEmailWasLastUsed

Column Name Column Value

email3

Page 35: Cassandra, Modeling and Availability at AMUG

Don't treat Cassandra (or any DB) as a black box

➔Understand how your DBs (and data structures) work

➔Understand the building blocks they provide

➔Understand the work complexity (“big O”) of queries

➔For data sets > memory, goal is to minimize seeks *

* on a related note, SSDs are awesome

Page 36: Cassandra, Modeling and Availability at AMUG

Q?(then brief intermission)

Page 37: Cassandra, Modeling and Availability at AMUG

Availability Has Many Levels

➔ Component Failure (disk)

➔ Machine Failure (NIC, cpu, power supply)

➔ Site Failure (UPS, power grid, tornado)

➔ Political Failure (war, coup)

Page 38: Cassandra, Modeling and Availability at AMUG

The Common Theme In The Solutions?

Replication

Page 39: Cassandra, Modeling and Availability at AMUG

Replication In Cassandra Follows The Dynamo Model *

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Read It!

Page 40: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

Every Node Has A Token0 - 2^127

t0 < t1 < t2 < t3 < 2^127

Page 41: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

Row Key Determines Node(s)

MD5(RK) => T

t3 < T < 2^127

Page 42: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

Row Key Determines Node

MD5(RK) => T

t3 < T < 2^127

First Replica

Page 43: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

Walk The Ring To Find Subsequent Replicas *

MD5(RK) => T

t3 < T < 2^127

First Replica

Second Replica

* by default

Page 44: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

Writes Happen In Parallel To All Replicas

First Replica

Second Replica

client

RK= ...

Coordinator(not a master)

RK= ...

RK= ...

Page 45: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

Some Or All Replicas Respond

First Replica

Second Replica

client

RK= ...

Coordinator Waits For Ack(s)From Destination Node(s)

“ok”

“ok”

X

Page 46: Cassandra, Modeling and Availability at AMUG

t3 t1

t2

t0

The Coordinator Responds To Client

First Replica

Second Replica

client

“ok”

Coordinator Waits For Ack(s)From Destination Node(s)

“ok”

“ok”

X

Page 47: Cassandra, Modeling and Availability at AMUG

What Nodes Can Be A Coordinator?

The coordinator for any given read or write is really just whatever node the client connected to for that request

any node for any request at any time

Page 48: Cassandra, Modeling and Availability at AMUG

How Many Replicas Does The Coordinator Wait For?

➔configurable, per query

➔ONE / QUORUM are the most common (more on this in a moment)

Page 49: Cassandra, Modeling and Availability at AMUG

Writing At CL.One

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For At Least One Node(eventually all nodes get updates)

X

Page 50: Cassandra, Modeling and Availability at AMUG

Writing At CL.One

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For At Least One Node(eventually all nodes get updates)

X

“ok”

“ok”

Page 51: Cassandra, Modeling and Availability at AMUG

Reading At CL.One

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For At Least One Node(so you might read stale data)

X

Page 52: Cassandra, Modeling and Availability at AMUG

Reading At CL.One

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For At Least One Node(so you might read stale data)

X

“old”“old”

Page 53: Cassandra, Modeling and Availability at AMUG

Writing At CL.Quorum

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For Majority Of Nodes(eventually all nodes get updates)

X

Page 54: Cassandra, Modeling and Availability at AMUG

Writing At CL.Quorum

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For Majority Of Nodes(eventually all nodes get updates)

X

“ok”

“ok”

“ok”

Page 55: Cassandra, Modeling and Availability at AMUG

Reading At CL.Quorum

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For Majority Of Nodes(majority => overlap => consistent)

X

Page 56: Cassandra, Modeling and Availability at AMUG

Reading At CL.Quorum

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Wait For Majority Of Nodes(majority => overlap => consistent)

X“ok”

“ok”

“old”coordinator chooses client response based on client supplied per column TS

Page 57: Cassandra, Modeling and Availability at AMUG

Reading At CL.Quorum

t3 t1

t2

t0First Replica

Second Replica

client

Third Replica

Read Repair Updates Stale Nodes

X

“current”

Already Has Response

Page 58: Cassandra, Modeling and Availability at AMUG

t3

t0

On A Side Note, A Lost Response

“ok”

X

Page 59: Cassandra, Modeling and Availability at AMUG

t3

t0

Is The Same As A Lost Request

RK = ...X

* In Regards To Meeting Consistency

Page 60: Cassandra, Modeling and Availability at AMUG

t3

t0

Which Is The Same As A Failed/Slow Node

RK = ...

X

* In Regards To Meeting Consistency

Page 61: Cassandra, Modeling and Availability at AMUG

In fact, it is actually impossible for the originator to reliably distinguish between the 3

Page 62: Cassandra, Modeling and Availability at AMUG

One More Important Piece:

writes are idempotent *

* except with the counter API, but if you want that it can be done

Page 63: Cassandra, Modeling and Availability at AMUG

Why is that important?

It means we can replay/retry writes, even late and/or out of order, and get the same results

➔After/during node failures

➔After/during network partitions

➔After/during upgrades

Page 64: Cassandra, Modeling and Availability at AMUG

In other words you can concurrently issue conflicting updates to two different nodes while

those nodes have no communication between them

Page 65: Cassandra, Modeling and Availability at AMUG

Which is important because ...

Page 66: Cassandra, Modeling and Availability at AMUG

Availability Has Many Levels

➔ Component Failure (disk)

➔ Machine Failure (NIC, cpu, power supply)

➔ Site Failure (UPS, power grid, tornado)

➔ Political Failure (war, coup)

Page 67: Cassandra, Modeling and Availability at AMUG

If you care about global availability you must serve reads and writes from multiple data centers

There is no way around this

Page 68: Cassandra, Modeling and Availability at AMUG

Q?Conceptual Modeling Differences From A RDBMS

Matthew F. Dennis, DataStax // @mdennis

Page 69: Cassandra, Modeling and Availability at AMUG

A Brief Rant On Query Planners, Garbage Collectors, Virtual Memory, Automatic Transmissions and Data Structures