The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

54
Five questions for your NoSQL solution Jonathan Ellis CTO, DataStax Project Chair, Apache Cassandra

description

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/top-five-questions-about-nosql/jonathan-ellis

Transcript of The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

Page 1: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

Five questionsfor your NoSQL solution!Jonathan EllisCTO, DataStaxProject Chair, Apache Cassandra

Page 2: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

how do I

modelmy application?

Page 3: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Popular options• Key/value

• Tabular

• Document

• Graph?

Page 4: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Schema is your friend

{ "id": "e451dd42-ece3-11e1-a0a3-34159e154f4c", "name": "jbellis", "state": "TX", "birthdate": "1/1/1976", "email_addresses": ["jbellis@gmail", "[email protected]"],}

Page 5: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

SQL can be your friend too

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE INDEX ON users(state);

SELECT * FROM usersWHERE state=‘Texas’ AND birth_date > ‘1950-01-01’;

Page 6: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

Page 7: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date);

CREATE TABLE users_addresses ( user_id uuid REFERENCES users, email text);

SELECT *FROM users NATURAL JOIN users_addresses;

Collections

X

Page 8: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date date, email_addresses set<text>);

UPDATE usersSET email_addresses = email_addresses + {‘[email protected]’, ‘[email protected]’};

Collections

Page 9: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Joins don’t scale• No joins

• No subqueries

• No aggregation functions* or GROUP BY

• ORDER BY?

Page 10: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

SELECT * FROM tweetsWHERE user_id IN (SELECT follower FROM followers WHERE user_id = ’driftx’)

followers

?

tweets

Page 11: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Clustering in Cassandrauser_id tweet_id _author _body

jbellis 3290f9da.. rbranson loremjbellis 3895411a.. tjake ipsum

... ... ...

driftx 3290f9da.. rbranson loremdriftx 71b46a84.. yzhang dolor

... ... ...

yukim 3290f9da.. rbranson loremyukim e451dd42.. tjake amet

... ... ...

Page 12: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

CREATE TABLE timeline (  user_id uuid,  tweet_id timeuuid,  tweet_author uuid, tweet_body text,  PRIMARY KEY (user_id, tweet_id));

Clustering in Cassandrauser_id tweet_id _author _body

jbellis 3290f9da.. rbranson loremjbellis 3895411a.. tjake ipsum

... ... ...

driftx 3290f9da.. rbranson loremdriftx 71b46a84.. yzhang dolor

... ... ...

yukim 3290f9da.. rbranson loremyukim e451dd42.. tjake amet

... ... ...

SELECT * FROM timelineWHERE user_id = ’driftx’;

Page 13: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

how does it

perform?

Page 14: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

VLDB benchmark

Page 15: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Locking

Page 16: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Efficiency

Page 17: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

UPDATE usersSET email_addresses = email_addresses + {...}WHERE user_id = ‘jbellis’;

Page 18: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Durability

Page 19: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Log-structured storage engine

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

Page 20: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Memory

Hard drive

Memtable

write( , )k1 c1:v1

Commit log

k1 c1:v1

k1 c1:v1

Page 21: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Memory

Hard drive

write( , )k1 c2:v2

k1 c1:v1

k1 c1:v1

k1 c2:v2

c2:v2

Page 22: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Memory

Hard drive

k1 c1:v1

k1 c1:v1

k1 c2:v2

c2:v2

write( , )k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

Page 23: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Memory

Hard drive

k1 c1:v1

k1 c1:v4

k1 c2:v2

c2:v2

write( , )k1 c1:v4 c3:v3

k2 c1:v1 c2:v2

k2 c1:v1 c2:v2

k1 c1:v4 c3:v3

c3:v3

Page 24: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Memory

Hard drive

SSTable

flush

k1 c1:v4 c2:v2

k2 c1:v1 c2:v2

c3:v3

index / BF

cleanup

Page 25: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

No random writes

Page 26: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

The gory details

Page 27: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Larger than memory datasets

Page 28: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

how does it handle

failure?

Page 29: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Classic partitioning with SPOFpartition 1 partition 2 partition 3 partition 4

router

client

Page 30: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Availability• “High availability implies that a single fault will

not bring down your system. Not ‘we’ll recover quickly.’” -- Ben Coverston: DataStax

• “The biggest problem with failover is that you're almost never using it until it really hurts. It's like backups that you never test.” -- Rick Branson: Instagram

Page 31: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Fully distributed, no SPOFclient

p1

p1

p1p3

p6

Page 32: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Multiple datacenters

Page 33: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Page 34: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Self-healing

Client

request

Coordinator

Replica

internalrequest

internalresponse

response

1

2

3

4

Page 35: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Self-healing

Client

request

Coordinator

Replica

internalrequest

internalresponse

response

1

2

3

4

Page 36: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Self-healing

Client

request

Coordinator

Replica

internalrequest

1

2

replica fails

timeoutresponse 4

Page 37: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Self-healing

Client

request

Coordinator

Replica

internalrequest

1

2

Xreplica fails

timeoutresponse 4

Page 38: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Self-healing

Client

request

Coordinator

Replica

internalrequest

1

2

4

replica fails

timeoutresponse

hint 3

Page 39: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Self-healing

Client

request

Coordinator

Replica

internalrequest

1

2

4

Xreplica fails

timeoutresponse

hint 3

Page 40: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Other healing modes• AntiEntropyService

• Read repair

Page 41: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Dynamic snitch(dealing with partial failure)

Client Coordinator

40% busy

90% busy

30% busy

Page 42: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

how does itscale?

Page 43: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

VLDB benchmark

Page 44: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Scaling antipatterns• Metadata servers

• Router bottlenecks

• Overloading existing nodes when adding capacity

Page 45: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

how

flexibleis it?

Page 46: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Page 47: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Data model: Realtime

Portfolios

StockHist

stock lastGOOG $95.52AAPL $186.10AMZN $112.98

LiveStocks

stock date priceGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78

user stock sharesjbellis GOOG 80jbellis LNKD 20yukim AMZN 100

Page 48: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Data model: Analytics

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Page 49: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Data model: Analyticsstock rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68

10dayreturns

INSERT OVERWRITE TABLE 10dayreturnsSELECT a.stock, b.date as rdate, b.price - a.priceFROM StockHist a JOIN StockHist b ON (a.stock = b.stock AND date_add(a.date, 10) = b.date);

Page 50: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Data model: Analytics

portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19

portfolio_returns

INSERT OVERWRITE TABLE portfolio_returnsSELECT portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.stock = b.stock)GROUP BY portfolio, rdate;

Page 51: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Data model: Analytics

INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);

worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93

Portfolio1

HistLoss

Portfolio2Portfolio3

Page 52: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Page 53: The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012

©2012 DataStax

Some Cassandra users