Introduction to NoSQL and Cassandra

Post on 15-Jan-2015

2.011 views 10 download

Tags:

description

Intro to NoSQL, Cassandra and Hector I gave at Globant Laminar in Buenos Aires Argentina Dec 13th 2012.

Transcript of Introduction to NoSQL and Cassandra

Introduction to NoSQL and Apache Cassandra

Patricio Echagüe patricioe@gmail.com

@patricioe

About me

Present: Relateiq (Data Processing and Scalability)

Hector committer

Past: DataStax (The Cassandra Company)

Cassandra/Hadoop distribution (former Brisk)

Cassandra FS

CQL connection pool

Cassandra contributions

Trends: “NoSQL”

2011

2012

What is “NoSQL” ?

systems able to store and retrieve great quantities of data with none or little information about the relationships between them.

Generally they don't have a SQL like language for data manipulation and their schema is more relaxed than traditional RDBM systems.

Full ACID is not often guaranteed.

Brewer's CAP theorem

Consistency: all replicas agree on the same value

Availability: always get an answer from a replica

Partition Tolerance: the system works even if replicas can't talk

You can have 2 of these

Brewer's CAP theorem

CAP Classification

Consistency

PartitioningAvailability

Types

- Relationals- Key-Value stores- Columnar (column-oriented)- Graph databases- Document

What's eventual consistency?

It is a promise that eventually, in the absence of new writes, all replicas that are responsible for a data item will agree on the same version

How eventual is eventual?Write to 1 replica and Read from 1 replica of a total

of 3

How eventual is eventual?Write to 2 replicas and Read from 2 replicas of a total

of 3

Why is it good?

because, by contacting fewer replicas, read and write operations complete

more quickly, lowering latency.

Cassandra is a distributed , fault tolerant, scalable, column oriented and tunable consistency data store.

Cassandra hasC A PBut C is tunable

What is Apache Cassandra?

Key Concepts

Multi-Master, Multi-DC

Linearly scalable

Integrated Caching

Performs well with Larger-than-memory Datasets

Tunable consistency

Idempotent (client clock)

Schema Optional

No ACID transactions, No Locking

Generally complements another system(s)(Not intended to be one-size-fits-all)

You should always use the right tool for the right job

Speaking Cassandra

Data Model

“4-Dimensional Hash Table”

A Keyspace contains a collection of Column Families(Controls replication)

A Column Family contains Rows

A Row have a key, and each row has columns(No need to define the columns before hand)

Each column has a name and a value and a timestamp

(TTL is optional)

Data Model – (RDBMS)

Keyspace (Schema)

Column Family(CF) (table)

Row (row)

Column (column*) → may not be present in all rows

Data Model – Column Family

Static Column Family- Model my object data

Dynamic Column Family- Precalculated / Prematerialized query results

Nothing stopping you from mixing them!

Data Model – Static Column Family

Data Model – Dynamic CF

stats for a specific date

Data Model – Dynamic CF

Timeline of tweets by a userTimeline of tweets by all of the people a user is followingList of comments sorted by scoreList of friends grouped by stateMetrics for a time bucket

...

Let's store “foo”

...

Let's store “foo”

Foo

But if that node is down?

Foo

...

Let's store “foo” in 3 nodes.This is the Replication Factor(N)

Foo

Foo

Foo

...

Now we need to know what nodes the key was written to so we can read it later

...

The Initial Token specifies the upper value of the key range each node is responsible for

#1<= 'd'

#2<= 'k'

#3<= 'p'

#5<= 'z'

#4<= 'u'

a b c d e f g h I j k l m n …. z

'e f g h I j k '

...

Gossip is the protocol Cassandra uses to interchange information with nodes in the cluster (a.k.a. Ring)

Gossip is the protocol Cassandra uses to interchange information with nodes in the cluster (a.k.a. Ring)

For example, what nodes owns the key “foo”

...

Gossip is the protocol Cassandra uses to interchange information with nodes in the cluster (a.k.a. Ring)

For example, what nodes owns the key “foo”

#1<= 'd'

#2<= 'k'

#3<= 'p'

#5<= 'z'

#4<= 'u'

Client

'foo'

Read 'foo'

'e f g h I j k '

...

A Partitioner is used to transform the key. “foo1” and “foo2” may end up in different nodes

...

A Partitioner is used to transform the key. “foo1” and “foo2” may end up in different nodes

The most commonly used is Random Partitioner

“foo1” md5(“foo1”) “A99A0B....”

...

A Partitioner is used to transform the key. “foo1” and “foo2” may end up in different nodes

The most commonly used is Random Partitioner

#1

#2

#3

#5

#4

'foo1'

'foo2'

...

A Replica Placement Strategy determines which nodes contain replicas

...

A Replica Placement Strategy determines which nodes contain replicas

Simple Strategy place them clockwise

#1

#2

#3

#5

#4

'foo1'

'foo1'

'foo1'

...

A Replica Placement Strategy determines which nodes contain replicas

Network Topology Strategy place them in different DCs

#1

#2#4

#3

#5'foo1'

'foo1'

'foo1'

#1

#2#4

#3

#5'foo1'

DC1:3 DC2:1

...

Consistency Level determines how many replicas to contact to

...

Consistency Level determines how many replicas to contact to

CL = 1

#1

#2

#3

#5

#4

'foo1'

'foo1'

'foo1'

Client

...

Consistency Level determines how many replicas to contact to

CL = QUORUM

#1

#2

#3

#5

#4

'foo1'

'foo1'

'foo1'

Client

Consistency For Writes

ANY

ONE

TWO

THREE

QUORUM

LOCAL_QUORUM

EACH_QUORUM

ALL

Consistency For Reads

ONE

TWO

THREE

QUORUM

LOCAL_QUORUM

EACH_QUORUM

ALL

Consistency In Math Term

(nodes_written + nodes_read) > replication_factor

Cassandra guarantees strong consistency if

R + W > N

Back to the example..

Consistency Level determines how many replicas to contact to

CL = QUORUM

#1

#2

#3

#5

#4

'foo1'

'foo1'

'foo1'

Client

...

But what if node #3 is down?

...

But what if node #3 is down?

#1

#2

#3

#5

#4

'foo1'hint

'foo1'

Client

...

But what if node #3 is down?

The coordinator nodes will store a hint and will replay that mutation when the down node comes back up.

This is known as Hinted Handoff

...

Node #5 will replay the hint to node #3 when it comes back online

#1

#2

#3

#5

#4

'foo1'hint

'foo1'

Client

'foo1'

...

And if node #5 dies before sending the hints to node #3?

#1

#2

#3

#5

#4

'foo1'hint

'foo1'

Client

...

If using Quorum, node #4 will request for 'foo' to all the replicas

#1

#2

#3

#5

#4

'foo1'hint

'foo1'

Client

''

...

If the result received do not match, a Read Repair process is performed in the background

#1

#2

#3

#5

#4

'foo1'hint

'foo1'

Client

''

...

And the missing or not up-to-date value is pushed to the out of date node. #3 in this case

#1

#2

#3

#5

#4

'foo1'hint

'foo1'

Client

'foo''foo' != ''

...

The last feature to achieve consistency is the Anti Entropy Service (AES)

Should run periodically as part of the cluster maintenance or when a node was down

Recap Consistency Features

Read Repair

Anti Entropy Service (AES)

Hinted Handoff

scaling

“z”

“t”

“e”

“o”

“j”

scaling

“z”

“t”

“e”

“o”

“j”

“?”

scaling

“z”

“t”

“e”

“o”

“j”

“g”

Nodetool move ?

Want 2x performance ?!

Add 2x nodes

'No downtime' included!

Want 2x performance ?!

“z”

“t”

“e”

“o”

“j”

Want 2x performance ?!

“z”

“t”

“e”

“o”

“j”

“g”

“l”

“q”

“v”

“b”

With RF= 3 we could lose

“z”

“t”

“e”

“o”

“j”

“g”

“l”

“q”

“v”

“b”

XX

X

With RF= 3 we could lose

“z”

“t”

“e”

“o”

“j”

“g”

“l”

“q”

“v”

“b”

XX

X

X ?

Vs others

z

t

e

o

j

g

lq

v

b

Recap

Replication FactorTokensGossipPartitionerReplica PlacementConsistencyHinted HandoffRead RepairAESClustering

Performance

Reads on par with writes

Scalability

Internals

Read and Write path

Storage - SSTable

- SSTables are sorted

- Immutable (“Merge on read”)

- Newest timestamp wins

Storage – Compaction

Storage – Compaction

Merges SSTables together into a larger SSTables

Removes Tombstones

Rebuild primary and secondary indexes

Storage – Compaction

Two types:

- Size-tiered compaction

- Leveled compaction

Storage – Compaction

Size-tiered compaction

Performance no guaranteedRow may be across many SSTablesWaste of spaceGood for write heavy opsRows are written once100% more space than SSTables

Storage – Compaction

Leveled compaction

Grouped into levelsNo overlapping within a levelEach level is ten times as large90% of reads satisfied with 1 SSTableTwice as much I/O

Recap

SSTableMemtableRow CacheCompaction

Before - 48 Cassandra on m2.4xlarge. 36 EVcache on m2.xlarge

After - 12 Cassandra on hi1.4xlarge

SSDs and caching

API Operations

Five general categories

Retrieving

Write/Update/Remove (all the same op!)Increment counters

Meta Information

Schema Manipulation

CQL Execution

Insertion/Deletion => Mutation

Again: Every mutation is an insert!- Merge on read- Sstables are immutable- Highest timestamp wins

CQL

INSERT INTO Hollywood.NerdMovies (user_uuid, fan) VALUES ('cfd66ccc-d857-4e90-b1e5-df98a3d40cd6', 'johndoe') USING CONSISTENCY LOCAL_QUORUM AND TTL 86400;

Hadoop

Using a Client

- Hector

http://hector-client.org

- Astyanax

https://github.com/Netflix/astyanax

- Pelops

https://github.com/s7/scale7-pelops

Using a Client → Hector

- Most popular Java client

- In use at very large installations

- A number of tools and utilities built on top

- Very active community

- MIT Licensed

Features

- High Level API

- Failover behavior

- High performant connection pool

- JMX counters for management

- Discoverability of new nodes

- Automatic retry of downed hosts

- Suspension of nodes after several timeouts

- Load Balancing: Configurable and extensible

- Locking (Beta)

Hector's Architecture

vs JDBC

Hector is operation-oriented

Whereas

JDBC is connection-oriented

API Abstractions

Thrift

Mutator

Templates

ColumnFamilyTemplate

Familiar, type-safe approach

- based on template-method design pattern

- generic: ColumnFamilyTemplate<K,N>

(K is the key type, N the column name type)

ColumnFamilyTemplate template = new ThriftColumnFamilyTemplate(keyspaceName, columnFamilyName, StringSerializer.get(), StringSerializer.get());

*** (no generics for clarity)

ColumnFamilyTemplate

new ThriftColumnFamilyTemplate(keyspaceName,

columnFamilyName,

StringSerializer.get(),

StringSerializer.get());Key Format

Column Name Format- Cassandra calls this a “comparator”- Remember: defines column order in on-disk format

ColumnFamilyTemplate

ColumnFamilyResult<String, String> res = cft.queryColumns("patricioe");

String value = res.getString("email");

Date startDate = res.getDate(“DateOfBirth”);

Key Format

Column Name Format

ColumnFamilyTemplate

ColumnFamilyUpdater updater = template.createUpdater(”pato");

updater.setString("companyName",”Relateiq");updater.addKey(”sabina");updater.setString("companyName",”Globant");

template.update(updater);

Inserting data with ColumnFamilyUpdater

ColumnFamilyTemplate

template.deleteColumn("zznate", "notNeededStuff");template.deleteColumn("zznate", "somethingElse");template.deleteColumn("patricioe", "aDifferentColumnName");...template.deleteRow(“someuser”);

template.executeBatch();

Deleting Data with ColumnFamilyTemplate

Integrating with existing patterns

Hector Object Mapper -> Apache Gorahttps://github.com/hector-client/hector/tree/master/object-mapper

Hector JPA*:https://github.com/riptano/hector-jpa

Spring IOC

CQL: JDBC Driver and Pool in 1.0!

JdbcTemplate FTW!

Development Resources

Hector Documentation (http://hector-client.org)

Cassandra Unithttps://github.com/jsevellec/cassandra-unit

Cassandra Maven Pluginhttp://mojo.codehaus.org/cassandra-maven-plugin/

CCM localhost cassandra clusterhttps://github.com/pcmanus/ccm

OpsCenterhttp://www.datastax.com/products/opscenter

Cassandra AMIshttps://github.com/riptano/CassandraClusterAMI

Want to contribute?

git clone git@github.com:hector-client/hector.git

Summary

- Take advantage of strengths- idempotence and asynchronicity are your friends- If it's not in the API, you are probably doing it wrong- Seek death is still possible if you model incorrectly- Try Denormalizing (append-only model ?)

Patricio Echagüepatricioe@gmail.com

@patricioe

Credits

Nate McCall

Aaron Morton (http://thelastpickle.com)

Datastax (http://www.datastax.com)

http://www.slideshare.net/mikiobraun/cassandra-an-introduction

Additional Resources

DataStax Documentation: http://www.datastax.com/docs

Apache Cassandra project wiki: http://wiki.apache.org/cassandra/

“The Dynamo Paper”http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

P. Helland. Building on Quicksandhttp://arxiv.org/pdf/0909.1788

P. Helland. Life Beyond Distributed Transactionshttp://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf

S. Anand. “Netflix's Transition to High-Availability Storage Systems”http://media.amazonwebservices.com/Netflix_Transition_to_a_Key_v3.pdf

“The Megastore Paper”http://research.google.com/pubs/archive/36971.pdf