NOSQL and Cassandra

70
Introduction to NOSQL And Cassandra @rantav @outbrain

Transcript of NOSQL and Cassandra

Page 1: NOSQL and Cassandra

Introduction to NOSQLAnd Cassandra

@rantav @outbrain

Page 2: NOSQL and Cassandra

SQL is good

• Rich language• Easy to use and integrate• Rich toolset • Many vendors

• The promise: ACIDo Atomicityo Consistencyo Isolationo Durability

Page 3: NOSQL and Cassandra

SQL Rules

Page 4: NOSQL and Cassandra

BUT

Page 5: NOSQL and Cassandra

The Challenge: Modern web apps

• Internet-scale data size• High read-write rates• Frequent schema changes

• "social" apps - not bankso They don't need the same

 level of ACID 

SCALING

Page 6: NOSQL and Cassandra

Scaling Solutions - Replication

Scales Reads

Page 7: NOSQL and Cassandra

Scaling Solutions - Sharding

Scales also Writes

Page 8: NOSQL and Cassandra

Brewer's CAP Theorem: 

You can only choose two

Page 9: NOSQL and Cassandra

CAP

Page 10: NOSQL and Cassandra

Availability + Partition Tolerance (no Consistency)

Page 11: NOSQL and Cassandra

Existing NOSQL Solutions

Page 12: NOSQL and Cassandra

Taxonomy of NOSQL data stores

• Document Orientedo CouchDB, MongoDB, Lotus Notes, SimpleDB, Orient

• Key-Valueo Voldemort, Dynamo, Riak (sort of), Redis, Tokyo 

• Columno Cassandra, HBase, BigTable

• Graph Databaseso  Neo4J, FlockDB, DEX, AlegroGraph

http://en.wikipedia.org/wiki/NoSQL

Page 13: NOSQL and Cassandra

 

• Developed at facebook

• Follows the BigTable Data Model - column oriented

• Follows the Dynamo Eventual Consistency model

• Opensourced at Apache

• Implemented in Java

Page 14: NOSQL and Cassandra

N/R/W

• N - Number of replicas (nodes) for any data item

• W - Number or nodes a write operation blocks on

• R - Number of nodes a read operation blocks on

CONSISTENCY DOWN TO EARTH

Page 15: NOSQL and Cassandra

N/R/W - Typical Values

• W=1 => Block until first node written successfully• W=N => Block until all nodes written successfully• W=0 => Async writes

• R=1 => Block until the first node returns an answer• R=N => Block until all nodes return an answer• R=0 => Doesn't make sense

• QUORUM:o R = N/2+1o W = N/2+1o => Fully consistent

Page 16: NOSQL and Cassandra

Data Model - Forget SQL

Do you know SQL?

Page 17: NOSQL and Cassandra

Data Model - Vocabulary

• Keyspace – like namespace for unique keys.

• Column Family – very much like a table… but not quite.

• Key – a key that represent row (of columns)

• Column – representation of value with:o Column nameo Valueo Timestamp

• Super Column – Column that holds list of columns inside

Page 18: NOSQL and Cassandra

Data Model - Columns

struct Column {   1: required binary name,   2: optional binary value,   3: optional i64 timestamp,   4: optional i32 ttl,}

JSON-ish notation:{  "name":      "emailAddress",  "value":     "[email protected]",  "timestamp": 123456789 }

Page 19: NOSQL and Cassandra

Data Model - Column Family

• Similar to SQL tables• Has many columns• Has many rows

Page 20: NOSQL and Cassandra

Data Model - Rows

• Primary key for objects• All keys are arbitrary length binaries

Users:                                 CF    ran:                               ROW        emailAddress: [email protected],     COLUMN        webSite: http://bar.com        COLUMN    f.rat:                             ROW        emailAddress: [email protected]  COLUMNStats:                                 CF    ran:                               ROW        visits: 243                    COLUMN

Page 21: NOSQL and Cassandra

Data Model - Songs example

Songs:     Meir Ariel:         Shir Keev: 6:13,         Tikva: 4:11,        Erol: 6:17        Suetz: 5:30        Dr Hitchakmut: 3:30    Mashina:        Rakevet Layla: 3:02        Optikai: 5:40

Page 22: NOSQL and Cassandra

Data Model - Super Columns

• Columns whose values are lists of columns

Page 23: NOSQL and Cassandra

Data Model - Super Columns

Songs:     Meir Ariel:        Shirey Hag:            Shir Keev: 6:13,             Tikva: 4:11,            Erol: 6:17        Vegluy Eynaim:             Suetz: 5:30            Dr Hitchakmut: 3:30    Mashina:        ...

Page 24: NOSQL and Cassandra

The API - Read

getget_sliceget_countmultigetmultiget_sliceget_ranage_slicesget_indexed_slices

Page 25: NOSQL and Cassandra

The True API

get(keyspace, key, column_path, consistency)get_slice(ks, key, column_parent, predicate, consistency)multiget(ks, keys, column_path, consistency)multiget_slice(ks, keys, column_parent, predicate, consistency)...

Page 26: NOSQL and Cassandra

The API - Write

insertaddremoveremove_counterbatch_mutate

Page 27: NOSQL and Cassandra

The API - Meta

describe_schema_versionsdescribe_keyspacesdescribe_cluster_namedescribe_versiondescribe_ringdescribe_partitionerdescribe_snitch

Page 28: NOSQL and Cassandra

The API - DDL

system_add_column_familysystem_drop_column_familysystem_add_keyspacesystem_drop_keyspacesystem_update_keyspacesystem_update_column_family

Page 29: NOSQL and Cassandra

The API - CQL

execute_cql_query

cqlsh> SELECT key, state FROM users;

cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT');

Page 30: NOSQL and Cassandra

Consistency Model

• N - per keyspace• R - per each read requests• W - per each write request

Page 31: NOSQL and Cassandra

Consistency Model

Cassandra defines:

enum ConsistencyLevel {    ONE,    QUORUM,    LOCAL_QUORUM,    EACH_QUORUM,    ALL,    ANY,    TWO,    THREE,}

Page 32: NOSQL and Cassandra

Java Code

TTransport tr = new TSocket("localhost", 9160); TProtocol proto = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); 

String key_user_id = "1"; 

long timestamp = System.currentTimeMillis(); client.insert("Keyspace1",               key_user_id,               new ColumnPath("Standard1",                              null,                             "name".getBytes("UTF-8")),               "Chris Goffinet".getBytes("UTF-8"),              timestamp,               ConsistencyLevel.ONE); 

Page 33: NOSQL and Cassandra

Java Client - Hectorhttp://github.com/rantav/hector

• The de-facto java client for cassandra

• Encapsulates thrift• Adds JMX (Monitoring)• Connection pooling• Failover• Open-sourced at github and has a growing

community of developers and users.

Page 34: NOSQL and Cassandra

Java Client - Hector - cont

 /**   * Insert a new value keyed by key   *   * @param key   Key for the value   * @param value the String value to insert   */  public void insert(final String key, final String value) {    Mutator m = createMutator(keyspaceOperator);    m.insert(key,              CF_NAME,              createColumn(COLUMN_NAME, value));  }

Page 35: NOSQL and Cassandra

Java Client - Hector - cont

  /**   * Get a string value.   *   * @return The string value; null if no value exists for the given key.   */  public String get(final String key) throws HectorException {    ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer);    Result<HColumn<String, String>> r = q.setKey(key).        setName(COLUMN_NAME).        setColumnFamily(CF_NAME).        execute();    HColumn<String, String> c = r.get();    return c == null ? null : c.getValue();  }

Page 36: NOSQL and Cassandra

Extra

If you're not snoring yet...

Page 37: NOSQL and Cassandra

Sorting

Columns are sorted by their type • BytesType • UTF8Type• AsciiType• LongType• LexicalUUIDType• TimeUUIDType

Rows are sorted by their Partitioner• RandomPartitioner• OrderPreservingPartitioner• CollatingOrderPreservingPartitioner

Page 38: NOSQL and Cassandra

Thrift

Cross-language protocolCompiles to: C++, Java, PHP, Ruby, Erlang, Perl, ...

struct UserProfile {     1: i32    uid,     2: string name,     3: string blurb } 

service UserStorage {     void         store(1: UserProfile user),    UserProfile  retrieve(1: i32 uid) }

Page 39: NOSQL and Cassandra

Thrift

Generating sources:

thrift --gen java cassandra.thrift

thrift -- gen py cassandra.thrift

Page 40: NOSQL and Cassandra

Internals

Page 41: NOSQL and Cassandra

Agenda

• Background and history• Architectural Layers• Transport: Thrift• Write Path (and sstables, memtables)• Read Path• Compactions• Bloom Filters• Gossip• Deletions• More...

Page 42: NOSQL and Cassandra

Required Reading ;-)

BigTable http://labs.google.com/papers/bigtable.html

Dynamo http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Page 43: NOSQL and Cassandra

From Dynamo:

• Symmetric p2p architecture• Gossip based discovery and error detection• Distributed key-value store

o Pluggable partitioning o Pluggable topology discovery

• Eventual consistent and Tunable per operation 

Page 44: NOSQL and Cassandra

From BigTable

• Sparse Column oriented sparse array• SSTable disk storage

o Append-only commit logo Memtable (buffering and sorting)o Immutable sstable fileso Compactionso High write performance 

Page 45: NOSQL and Cassandra

Architecture Layers

Cluster ManagementMessaging service Gossip Failure detection Cluster state Partitioner Replication 

Single HostCommit log Memtable SSTable Indexes Compaction 

Page 46: NOSQL and Cassandra

Write Path

Page 47: NOSQL and Cassandra

Writing

Page 48: NOSQL and Cassandra

Writing

Page 49: NOSQL and Cassandra

Writing

Page 50: NOSQL and Cassandra

Writing

Page 51: NOSQL and Cassandra

Memtables

• In-memory representation of recently written data• When the table is full, it's sorted and then flushed to disk ->

sstable

Page 52: NOSQL and Cassandra

SSTables

Sorted Strings Tables• Immutable• On-disk• Sorted by a string key• In-memory index of elements• Binary search (in memory) to find element location• Bloom filter to reduce number of unneeded binary searches.

Page 53: NOSQL and Cassandra

Write Properties

• No Locks in the critical path• Always available to writes, even if there are failures.

• No reads• No seeks • Fast • Atomic within a Row

Page 54: NOSQL and Cassandra

Read Path

Page 55: NOSQL and Cassandra

Reads

Page 56: NOSQL and Cassandra

Reading

Page 57: NOSQL and Cassandra

Reading

Page 58: NOSQL and Cassandra

Reading

Page 59: NOSQL and Cassandra

Reading

Page 60: NOSQL and Cassandra

Bloom Filters

• Space efficient probabilistic data structure• Test whether an element is a member of a set• Allow false positive, but not false negative • k hash functions• Union and intersection are implemented as bitwise OR, AND

Page 61: NOSQL and Cassandra

Read Properteis

• Read multiple SSTables • Slower than writes (but still fast) • Seeks can be mitigated with more RAM• Uses probabilistic bloom filters to reduce lookups.• Extensive optional caching

o Key Cacheo Row Cacheo Excellent monitoring

Page 62: NOSQL and Cassandra

Compactions

 

Page 63: NOSQL and Cassandra

Compactions

• Merge keys • Combine columns • Discard tombstones• Use bloom filters bitwise OR operation

Page 64: NOSQL and Cassandra

Gossip

• p2p• Enables seamless nodes addition.• Rebalancing of keys• Fast detection of nodes that goes down.• Every node knows about all others - no

master.

Page 65: NOSQL and Cassandra

Deletions

• Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction 

• Read repair complicates things a little • Eventually consistent complicates things more • Solution: configurable delay before tombstone GC, after

which tombstones are not repaired

Page 66: NOSQL and Cassandra

Extra Long list of subjects

• SEDA (Staged Events Driven Architecture)• Anti Entropy and Merkle Trees• Hinted Handoff• repair on read

Page 67: NOSQL and Cassandra

SEDA

• Mutate• Stream• Gossip• Response• Anti Entropy• Load Balance• Migration 

Page 68: NOSQL and Cassandra

Anti Entropy and Merkle Trees

Page 69: NOSQL and Cassandra

Hinted Handoff

Page 70: NOSQL and Cassandra

References

• http://horicky.blogspot.com/2009/11/nosql-patterns.html• http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon

-dynamo-sosp2007.pdf• http://labs.google.com/papers/bigtable.html• http://bret.appspot.com/entry/how-friendfeed-uses-mysql• http://www.julianbrowne.com/article/viewer/brewers-cap-the

orem• http://www.allthingsdistributed.com/2008/12/eventually_cons

istent.html• http://wiki.apache.org/cassandra/DataModel• http://incubator.apache.org/thrift/• http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf