Cassandra, Modeling and Availability at AMUG
-
Upload
matthew-dennis -
Category
Technology
-
view
2.844 -
download
2
description
Transcript of Cassandra, Modeling and Availability at AMUG
Conceptual Modeling Differences From A RDBMSMatthew F. Dennis, DataStax // @mdennis
Austin MySQL User GroupJanuary 11, 2012
Cassandra Is Not Relationalget out of the relational mindset when working
with Cassandra (or really any NoSQL DB)
Work Backwards From QueriesThink in terms of queries, not in terms of
normalizing the data; in fact, you often want to denormalize (already common in the data
warehousing world, even in RDBMS)
OK great, but how do I do that?Well, you need to know how Cassandra Models
Data (e.g. Google Big Table)
research.google.com/archive/bigtable-osdi06.pdf
Go Read It!
In Cassandra:
➔data is organized into Keyspaces (usually one per app)
➔each Keyspace can have multiple Column Families
➔each Column Family can have many Rows
➔each Row has a Row Key and a variable number of Columns
➔each Column consists of a Name, Value and Timestamp
In Cassandra, Keyspaces:
➔are similar in concept to a “database” in some RDBMs
➔are stored in separate directories on disk
➔are usually one-one with applications
➔are usually the administrative unit for things related to ops
➔contain multiple column families
In Cassandra, In Keyspaces, Column Famlies:
➔are similar in concept to a “table” in most RDBMs
➔are stored in separate files on disk (many per CF)
➔are usually approximately one-one with query type
➔are usually the administrative unit for things related to your data
➔can contain many (~billion* per node) rows
* for a good sized node(you can always add nodes)
In Cassandra, In Keyspaces, In Column Families ...
Rows
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Row Keys
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Columns
Column Names
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Column Values
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Rows Are Randomly Ordered(if using the RandomPartitioner)
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Columns Are Ordered by Name(by a configurable comparator)
Columns are ordered because doing so allows very efficient
implementations of useful and common operations
(e.g. merge joins)
In particular, within a row I can find given columns by name very quickly (ordered names => log(n)
binary search).
More importantly, I can query for a slice between a start and end
RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...
start end
Row Key
Why does that matter?Because columns within a row aren't static!
INTC ts0: $25.20 ts1: $25.25 ...
AMR ts0: $6.20 ts9: $0.26 ...
CRDS ts0: $1.05 ts5: $6.82 ...
Columns Are Ordered by Name(in this case by a TimeUUID Comparator)
The Column Name Can Be Part of Your Data
Turns Out That Pattern Comes Up A Lot
➔stock ticks➔event logs➔ad clicks/views➔sensor records➔access/error logs➔plane/truck/person/”entity” locations➔…
OK, but I can do that in SQLNot efficiently at scale, at least not easily ...
ticker timestamp bid ask ...
AMR ts0 ... ... ...
... ... ... ... ...
CRDS ts0 ... ... ...
... ... ... ... ...
... ts0 ... ... ...
AMR ts1 ... ... ...
... ... ... ... ...
... ... ... ... ...
… ts1 ... ... ...
AMR ts2 ... ... ...
... ts2 ... ... ...
Data I Care About
How it Looks In a RDBMS
ticker timestamp bid ask ...
AMR ts0 ... ... ...
AMR ts1 ... ... ...
AMR ts2 ... ... ...
... ts2 ... ... ...
Disk Seeks
How it Looks In a RDBMS
Larger Than Your Page Size
Larger Than Your Page Size
OK, but what about ...
➔PostgreSQL Cluster Command?
➔MySQL Cluster Indexes?
➔Oracle Index Organized Tables?
➔SQLServer Clustered Index?
OK, but what about ...➔PostgreSQL Cluster Using?
➔MySQL [InnoDB] Cluster Indexes?
➔Oracle Index Organized Table?
➔SQLServer Clustered Index? (seriously, who uses SQLServer?!)
Meh ...
The on-disk management of that clustering results in tons of IO …
In the case of PostgreSQL:
➔clustering is a one time operation (implies you must periodically rewrite the entire table)
➔new data is *not* written in clustered order (which is often the data you care most about)
OK, so just partition the tables ...
Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB
(you should probably still do it if using MySQL though)
http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
OK fine, I agree storing data that is queried together on disk together is a good thing but
what's that have to do with modeling?
RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...
Read Precisely My Data *
Seek To Here
* more on some caveats later
Well, that's what is meant by “work backwards from your queries” or “think in terms of queries”
(NB: this concept, in general, applies to RDBMSat scale as well; it is not specific to Cassandra)
An Example From Fraud Detection
To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question
id name ...
1 guy ...
2 gal ...
... ... ...
id email ...
100 guy@ ...
200 gal@ ...
... ... ...
id dest ...
15 USA ...
25 Finland ...
... ... ...
id device ...
1000 0xdead ...
2000 0xb33f ...
... ... ...
id origin ...
150 USA ...
250 Nigeria ...
... ... ...
In a normalized model that usually translates to a table for each type of entity being tracked
The problem is that at scale that also means a disk seek for each one …
(even for perfect IOT et al if across multiple tables)
➔Previous emails? That's a seek …➔Previous devices? That's a seek …➔Previous destinations? That's a seek ...
But In Cassandra I Store The Data I Query Together On Disk Together
(remember, column names need not be static)
acctY ... ... ... ... ... ... ...
acctX dest21 dev2 dev7 email9 orig4 ...
acctZ ... ... ... ... ... ... ...
Data I Care About
email:[email protected] = dateEmailWasLastUsed
Column Name Column Value
email3
Don't treat Cassandra (or any DB) as a black box
➔Understand how your DBs (and data structures) work
➔Understand the building blocks they provide
➔Understand the work complexity (“big O”) of queries
➔For data sets > memory, goal is to minimize seeks *
* on a related note, SSDs are awesome
Q?(then brief intermission)
Availability Has Many Levels
➔ Component Failure (disk)
➔ Machine Failure (NIC, cpu, power supply)
➔ Site Failure (UPS, power grid, tornado)
➔ Political Failure (war, coup)
The Common Theme In The Solutions?
Replication
Replication In Cassandra Follows The Dynamo Model *
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Read It!
t3 t1
t2
t0
Every Node Has A Token0 - 2^127
t0 < t1 < t2 < t3 < 2^127
t3 t1
t2
t0
Row Key Determines Node(s)
MD5(RK) => T
t3 < T < 2^127
t3 t1
t2
t0
Row Key Determines Node
MD5(RK) => T
t3 < T < 2^127
First Replica
t3 t1
t2
t0
Walk The Ring To Find Subsequent Replicas *
MD5(RK) => T
t3 < T < 2^127
First Replica
Second Replica
* by default
t3 t1
t2
t0
Writes Happen In Parallel To All Replicas
First Replica
Second Replica
client
RK= ...
Coordinator(not a master)
RK= ...
RK= ...
t3 t1
t2
t0
Some Or All Replicas Respond
First Replica
Second Replica
client
RK= ...
Coordinator Waits For Ack(s)From Destination Node(s)
“ok”
“ok”
X
t3 t1
t2
t0
The Coordinator Responds To Client
First Replica
Second Replica
client
“ok”
Coordinator Waits For Ack(s)From Destination Node(s)
“ok”
“ok”
X
What Nodes Can Be A Coordinator?
The coordinator for any given read or write is really just whatever node the client connected to for that request
any node for any request at any time
How Many Replicas Does The Coordinator Wait For?
➔configurable, per query
➔ONE / QUORUM are the most common (more on this in a moment)
Writing At CL.One
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For At Least One Node(eventually all nodes get updates)
X
Writing At CL.One
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For At Least One Node(eventually all nodes get updates)
X
“ok”
“ok”
Reading At CL.One
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For At Least One Node(so you might read stale data)
X
Reading At CL.One
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For At Least One Node(so you might read stale data)
X
“old”“old”
Writing At CL.Quorum
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For Majority Of Nodes(eventually all nodes get updates)
X
Writing At CL.Quorum
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For Majority Of Nodes(eventually all nodes get updates)
X
“ok”
“ok”
“ok”
Reading At CL.Quorum
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For Majority Of Nodes(majority => overlap => consistent)
X
Reading At CL.Quorum
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Wait For Majority Of Nodes(majority => overlap => consistent)
X“ok”
“ok”
“old”coordinator chooses client response based on client supplied per column TS
Reading At CL.Quorum
t3 t1
t2
t0First Replica
Second Replica
client
Third Replica
Read Repair Updates Stale Nodes
X
“current”
Already Has Response
t3
t0
On A Side Note, A Lost Response
“ok”
X
t3
t0
Is The Same As A Lost Request
RK = ...X
* In Regards To Meeting Consistency
t3
t0
Which Is The Same As A Failed/Slow Node
RK = ...
X
* In Regards To Meeting Consistency
In fact, it is actually impossible for the originator to reliably distinguish between the 3
One More Important Piece:
writes are idempotent *
* except with the counter API, but if you want that it can be done
Why is that important?
It means we can replay/retry writes, even late and/or out of order, and get the same results
➔After/during node failures
➔After/during network partitions
➔After/during upgrades
In other words you can concurrently issue conflicting updates to two different nodes while
those nodes have no communication between them
Which is important because ...
Availability Has Many Levels
➔ Component Failure (disk)
➔ Machine Failure (NIC, cpu, power supply)
➔ Site Failure (UPS, power grid, tornado)
➔ Political Failure (war, coup)
If you care about global availability you must serve reads and writes from multiple data centers
There is no way around this
Q?Conceptual Modeling Differences From A RDBMS
Matthew F. Dennis, DataStax // @mdennis
A Brief Rant On Query Planners, Garbage Collectors, Virtual Memory, Automatic Transmissions and Data Structures