Apache Cassandra @Geneva JUG 2013.02.26

39
Apache Cassandra http://cassandra.apache.org Benoit Perroud Software Engineer @Verisign & Apache Committer Geneva JUG, 26.02.2013

description

Apache Cassandra Overview, presented @GenevaJUG 2013.02.26

Transcript of Apache Cassandra @Geneva JUG 2013.02.26

Page 1: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandrahttp://cassandra.apache.org

Benoit PerroudSoftware Engineer @Verisign

& Apache Committer

Geneva JUG, 26.02.2013

Page 2: Apache Cassandra @Geneva JUG 2013.02.26

Agenda

• NoSQL Quick Overview

• Apache Cassandra Fundamentals

– Design principles

– Data & Query Model

• Real Life Uses Cases

– Illustrated in CQL3

• What‟s new in 1.2

• Conclusion

• Q & A

2

Page 3: Apache Cassandra @Geneva JUG 2013.02.26

NoSQL

• [Wikipedia] NoSQL is a term used to designate database

management systems that differ from classic relational

database management systems (RDBMS) in some way.

These data stores may not require fixed table schemas,

usually avoid join operations, do not attempt to provide

ACID properties and typically scale horizontally.

• Pioneers : Google BigTable, Amazon Dynamo, etc.

3

Page 4: Apache Cassandra @Geneva JUG 2013.02.26

Scalability

• [Wikipedia] Scalability is a desirable property of a

system, a network, or a process, which indicates its

ability to either handle growing amounts of work in a

graceful manner or to be readily enlarged.

• Scalability in two dimensions :

– Scale up → scale vertically (increase RAM in an existing node)

– Scale out → scale horizontally (add a node to the cluster)

• In summary : handle load and peaks.

4

Page 5: Apache Cassandra @Geneva JUG 2013.02.26

Availability

• [Wikipedia] Availability refers to the ability of the users to

access and use the system. If a user cannot access the

system, it is said to be unavailable. Generally, the term

downtime is used to refer to periods when a system is

unavailable.

• In summary : minimize downtime.

5

Page 6: Apache Cassandra @Geneva JUG 2013.02.26

CAP Theorem

• Consistency : all nodes see the same data at the same

time

• Availability : node failures do not prevent survivors from

continuing to operate

• Partition Tolerance : the system continues to operate

despite arbitrary message loss

• According to the theorem, a distributed system can

satisfy any two of these guarantees at the same time, but

not all three.

6

Page 7: Apache Cassandra @Geneva JUG 2013.02.26

NoSQL Promises

• Scale horizontally

– Double computational power or storage by doubling size of the

cluster. Cluster shrinking should also be true (tight provisioning)

– Adding nodes to the cluster in constant time

• High availability

– No / few / under control SPoF

• On commodity hardware

– 32 cores, 64GB RAM, 12x2TB HDD IS commodity hardware

• Let see how Cassandra achieves all of these

7

Page 8: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra

• Apache Cassandra is could be simplified as a scalable,

distributed, sparse and eventually consistent hash

map. But it's actually way more.

• Originally developed by Facebook, hit AFS incubator

early 2008, version 1.0 in 2010, version 1.2 early 2013

• Inspired from Amazon Dynamo and Google BigTable

• Version at time of speaking 1.2.2

• Under high development by several startups : Datastax,

Acunu, Netflix, Twitter, Rackspace, …

8

Page 9: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra is a scalable distributed,

sparse, eventually consistent hash map

• Gossip protocol (spreading states like a rumor)

• Consistent hashing

– Node responsible for key range and replica sets

• Replication factor (RF) to achieve persistence

• No single point of failure

• Key space is 2^128 bits

9

0

50

100% keyspace

Take half of key range

of most loaded node

75

?25

?

Take half of key range

of most loaded node

12

?

37

?

62

?

87

?

More on this later

with VNodes

Page 10: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra is a scalable distributed,

sparse, eventually consistent hash map

• Schemaless

– A schema (metadata) may be determined for convenience

– Column names are stored for every rows

• [Wikipedia] Bloom filter is a space-efficient probabilistic

data structure that is used to test whether an element is a

member of a set.

10

Page 11: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra is a scalable distributed,

sparse, eventually consistent hash map

• [Wikipedia] A quorum is the minimum number of votes

that a distributed transaction has to obtain in order to be

allowed to perform an operation in a distributed system.

A quorum-based technique is implemented to enforce

consistent operation in a distributed system.

• Quorum : W + R > N

– N : number of replica, R : number of node read, W : number of

node written.

– Condition met when:

• R = 1, W = N

• R = N, W = 1

• R = N/2, W = N/2 (+1 if N is even) 11

Page 12: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra is a scalable distributed,

sparse, eventually consistent hash map

• Key space [0,99], previously put(22, 1, t1)

• Replication factor 2

• Consistency : ONE

12

0

20

4060

80Put (22, 2, t2)

Async put(22,2, t2)

coordinator

owner

replica

Page 13: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra is a scalable distributed,

sparse, eventually consistent hash map

• Key space [0,99], previously put(13, 1, t1)

• Replication factor 3

• Consistency : QUORUM (R = 2, W = 2)

13

0

20

4060

80Put (13, 2, t2) Put (13, 2, t2)

Read(13) = 2, t2

Read(13) = 1, t1

Read repair

Page 14: Apache Cassandra @Geneva JUG 2013.02.26

Apache Cassandra is a scalable distributed,

sparse, eventually consistent hash map

14

• Can be seen as a multilevel map :

Map of SortedMap of Objects

• Keyspace > ColumnFamily > row > column name = value

– # use keyspace1;

– # set ColumnFamily1['key1']['columName1'] = 'value1';

– # get ColumnFamily1['key1']['columName1'];

Page 15: Apache Cassandra @Geneva JUG 2013.02.26

Data Model : Keyspace

• Equivalent to database name in SQL world

• Define replication factor and network topology

– Network topology include multi datacenters topology

– Replication factor can be defined per datacenters

15

Keyspace > ColumnFamily > row > column name = value

Page 16: Apache Cassandra @Geneva JUG 2013.02.26

• Equivalent to table name in SQL world

– Term may change in upcoming releases to stop confusing users

• Define

– Type of the keys

– Column name comparator

– Additional metadata (types of certain known columns)

Data Model : Column Family

16

Keyspace > ColumnFamily > row > column name = value

Page 17: Apache Cassandra @Geneva JUG 2013.02.26

• Defined by the key.

– Eventually stored to a node and it's replicas

• Keys are typed

• 2 strategies of key partitioner on the key space

– Random partitioner

• md5(key), murmur3(key), evenly distribute keys on nodes

– Byte Ordered partitioner

• Keep order while iterating through the keys, may lead to hot spots

Data Model : Row

17

Keyspace > ColumnFamily > row > column name = value

Page 18: Apache Cassandra @Geneva JUG 2013.02.26

• Could be seen as column in SQL world

• Not mandatory to be declared

– If declared, their corresponding values have types

– Or secondary index

• Ordered

• Column Names are often used as values

Data Model : Column Name

18

Event1

24.04.2012 07:00 08:00

239 255

Row key

Column names

Values

Column

Family

Keyspace > ColumnFamily > row > column name = value

Page 19: Apache Cassandra @Geneva JUG 2013.02.26

Data Model : Value

• Can be typed, seen as array of bytes otherwise

• Existing types include

– Bytes

– Strings (ASCII or UTF-8 strings)

– Integer, Long, Float, Double, Decimal

– UUID, dates

– Counters (of long)

• Can expire

• No foreign keys (!)

19

Keyspace > ColumnFamily > row > column name = value

Page 20: Apache Cassandra @Geneva JUG 2013.02.26

Write path

1. Write to commit log

2. Update MemTable

3. Acknowledge the client

4. When MemTable reaches a

threshold, flush to disk as

SSTable

20

CF1

MemTable

CF2

MemTable

CFn

MemTable…

Memory

Disks

CF1 CFn

…Bloom filter

Index

Data

SSTable

SSTable

SSTable

Commit log

Page 21: Apache Cassandra @Geneva JUG 2013.02.26

Read path

• Versions of the same column

can be spread at the same time

– In the MemTable

– In the MemTable being flushed

– In one or multiple SSTable

• All versions read, and resolved /

merged using timestamp

– Keys and Rows cache

– Bloom filters allow to skip reading

unnecessary SSTables

– SSTables are indexed

– Compaction keep things

reasonable21

CF1

MemTable

CF2

MemTable

CFn

MemTable…

Memory

Disks

CF1 CFn

…Bloom filter

Index

Data

SSTable

SSTable

SSTable

Commit log

Page 22: Apache Cassandra @Geneva JUG 2013.02.26

Compaction

• Runs regularly as a background operation

• Merge SSTables together

• Remove expired and deleted values

• Has impact on general I/O availability (and thus

performance)

– This is where most of tuning happens

– Can be throttled

• Two type of compaction

– Size-tiered

• Fewer I/O consumption write-heavy workload

– Leveled

• Guarantee to read from fewer SSTables read-heavy workload• See http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra for complete details. 22

Page 23: Apache Cassandra @Geneva JUG 2013.02.26

Query Model

• Thrift API

– CLI

– Higher level third-party libraries

• Hector

• Pycassa

• Phpyandra

• Astyanax

• Helenus

• CQL (Cassandra Query Language)

– And newly CQL3 released with C*1.2

23

Page 24: Apache Cassandra @Geneva JUG 2013.02.26

Query Model

• Cassandra is more than a key – value store.

– Get

– Put

– Delete

– Update

– But also various range queries

• Key range

• Column range (slice)

– Secondary indexes

24

Page 25: Apache Cassandra @Geneva JUG 2013.02.26

Query Model : Get

• Get single key

– Give me key „a‟

• Get multiple keys

– Give me rows for keys „a‟, „c‟, „d‟ and „f‟

25

‘1’ ‘2’ ‘3’ ‘4’ ‘5’

„c‟ 8 9 10 11

„e‟ 12 13 14

„f‟ 15 16 17

„a‟ 18

„b‟ 19 20 20

„d‟ 22 23 24 25 26

Ordered regarding column name comparator

Ran

do

mP

art

itio

nner

Page 26: Apache Cassandra @Geneva JUG 2013.02.26

Query Model : Get Range

• Range

– Query for a range of key

• Give me all rows with keys between „c‟ and „f‟.

• Mind the partitioner.

26

‘1’ ‘2’ ‘3’ ‘4’ ‘5’

„c‟ 8 9 10 11

„e‟ 12 13 14

„f‟ 15 16 17

„a‟ 18

„b‟ 19 20 20

„d‟ 22 23 24 25 26

Page 27: Apache Cassandra @Geneva JUG 2013.02.26

Query Model : Get Slice

• Slice

– Query for a slice of columns

• For key „c‟, give me all columns between „3‟ and „5‟

• For key „d‟, give me all columns between „3‟ and „5‟

27

‘1’ ‘2’ ‘3’ ‘4’ ‘5’

„c‟ 8 9 10 11

„e‟ 12 13 14

„f‟ 15 16 17

„a‟ 18

„b‟ 19 20 20

„d‟ 22 23 24 25 26

Page 28: Apache Cassandra @Geneva JUG 2013.02.26

Query Model : Get Range Slice

• Range and Slice can be combined : rangeSliceQuery

– For keys between „b‟ and „d‟, give me columns between „2‟ and „4‟

28

‘1’ ‘2’ ‘3’ ‘4’ ‘5’

„a‟ 8 9 10 11

„b‟ 12 13 14

„c‟ 15 16 17

„d‟ 18

„e‟ 19 20 20

„f‟ 22 23 24 25 26

Page 29: Apache Cassandra @Geneva JUG 2013.02.26

Query Model : Secondary Index

• Secondary Index

– Give me all rows where value for column „2‟ is „12‟

29

‘1’ ‘2’ ‘3’ ‘4’ ‘5’

„a‟ 8 9 10 11

„b‟ 12 13 14

„c‟ 15 16 17

„d‟ 18

„e‟ 19 20 20

„f‟ 22 23 24 25 26

Page 30: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : Doodle Clone

• Living demo http://doodle.noisette.ch

Data model

Polls { id, label, [choices], email, limit, [ subscribers ] }

• Id generation

– TimeUUID is your friend

• Avoid super column families

– Use composite, or CQL3

• Subscriber‟s name uniqueness per poll ?

– Cassandra anti-pattern (read after write)

• Limit to n subscribers per option ?

– Cassandra anti-pattern (read after write)

31

Page 31: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : Doodle Clone

CREATE KEYSPACE Doodle

WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1};

USE doodle;

CREATE TABLE Polls (

id uuid,

label text,

choices list<text>,

email text,

maxChoices int,

subscribers list<text>,

PRIMARY KEY (id)

) WITH compaction = { 'class' : 'LeveledCompactionStrategy' }

AND read_repair_chance = 0.0;

INSERT INTO Polls (id, label, email, choices) VALUES (eba080a0-8011-11e2-9e96-0800200c9a66,

'Test poll1', '[email protected]', ['Monday', 'Tuesday', 'Wednesday', 'Thursday',

'Friday']);

UPDATE Polls SET subscribers = subscribers + [ 'Benoit' ] WHERE id = eba080a0-8011-11e2-9e96-

0800200c9a66;

UPDATE Polls SET subscribers = subscribers + [ 'Maxime', 'Nicolas' ] WHERE id = eba080a0-8011-

11e2-9e96-0800200c9a66;

DELETE subscribers[0] FROM Polls WHERE id = eba080a0-8011-11e2-9e96-0800200c9a66;

32

Page 32: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : Heavy Writes

• Cassandra is a really good fit when the ratio read / write

is close to 0

– Event logging / redo logs

– Time series

• Best practice to write data in its raw format

AND in aggregated forms at the same time

• But need compation tuning

– {min,max}_compaction_threshold

– memtable_flush_writers

– … no magic solution here, only pragmatic approach• change configuration in one node, and mesure the difference (load, latency, …)

33

Page 33: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : Counters

34

Query per entitynumber of hits for „entity1‟

between 18:30:00 and 19:00:00

Query per date rangeall entities being hit between

18:30:00 and 19:00:00

! need complete date enumeration

• Cassandra >= 0.8 (CASSANDRA-1072)CREATE TABLE Events (id uuid, count counter, PRIMARY KEY (id));

UPDATE Events SET count = count + 1 WHERE id = 95b64d72-8014-11e2-9e96-0800200c9a66;

• ExamplecounterCF['entity1'][2012-06-14 18:30:00]

counterCF['entity1'][2012-06-14 18:30:05]

counterCF['entity1'][2012-06-14 18:30:10]

counterCF['entity2'][2012-06-14 18:30:05]

counterCF[2012-06-14 18:30:00]['entity1']

counterCF[2012-06-14 18:30:00]['entity2']

counterCF[2012-06-14 18:30:00]['entity3']

counterCF[2012-06-14 18:30:05]['entity1']

Page 34: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : Bulk Loading

• Data is transformed (e.g. using MapReduce)

• Data is bulk loaded

– ColumFamilyOutputFormat (< v1.1)

• Not real bulk loading

– BulkOutputFormat (>= v1.1)

• SSTable generated during the tranformation, and streamed

• Prefer Leveled Compaction Strategy

– Reduce read latency

– Size sstable_size_in_mb to your data

35

Page 35: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : Bulk Loading

• Data is transformed (e.g. using MapReduce)

• Data is bulk loaded

– ColumFamilyOutputFormat (< v1.1)

• Not real bulk loading

– BulkOutputFormat (>= v1.1)

• SSTable generated during the tranformation, and streamed

• Prefer Leveled Compaction Strategy

– Reduce read latency

– Size sstable_size_in_mb to your data

36

Page 36: Apache Cassandra @Geneva JUG 2013.02.26

Real Life Use Case : λ Architecture

• Enabling real-time queries to end-users

– “Hybrid Approach to Enable Real-Time Queries to End-Users”,

Software Developer Journal February 2013

37

Page 37: Apache Cassandra @Geneva JUG 2013.02.26

What‟s New in 1.2

• CQL3

– http://cassandra.apache.org/doc/cql3/CQL.html

• Virtual Nodes (vnodes)

• Atomic batches

• Murmur3Partitioner

• Off-heap SSTable metadata

• Query tracing

• … a lot more … 38

Illustrations credits to Datastax, http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes

Page 38: Apache Cassandra @Geneva JUG 2013.02.26

Conclusion

• Cassandra is not a general purpose solution

• But Cassandra is doing a really good job if used

accordingly

– Really good scalability

• Netflix‟s 1M w/s on AWS

http://techblog.netflix.com/2011/11/benchmarking-cassandra-

scalability-on.html

– Low operational cost

• Admin friendly, no SPoF, Vnodes, snapshot, …

– Advanced data and query model

39

Page 39: Apache Cassandra @Geneva JUG 2013.02.26

Thanks for your attention

• Questions?

[email protected]

@killerwhile

• No? Cool … Apéro

40