Cassandra internals

Phoenix Cassandra Users Meetup

January 26th, 2015

Narasimhan Sampath

Choice Hotels International

Cassandra Internals

What is Cassandra

SEDA

Data Placement, Replication and Partition Aware Drivers

Read and Write Path

Merkle Trees, SSTables, Read Repair and Compaction

Single and Multi-threaded Operations

Demo

Agenda

Cassandra is a decentralized distributed database No master or slave nodes No single point of failure

Peer-Peer architecture Read / write to any available node Replication and data redundancy built into the architecture Data is eventually consistent across all cluster nodes

Linearly (and massively) scalable Multiple Data Center support built in – a single cluster can span geo locations Adding or removing nodes / data centers is easy and does not require down time Data redistribution / rebalance seamless and non blocking

Runs on commodity hardware Hardware failure is expected and factored into the Architecture

Internal architecture more complex than non-distributed databases

Cassandra

Automatic Sharding (partitioning) Total data to be managed by the cluster is (ideally) divided equally among the cluster nodes

Each node is responsible for a subset of the data

Copies of that subset are stored on other nodes for high availability and redundancy

Data placement design determines node balancing (token assignment, adding and removing nodes)

Data Synchronization within the decentralized cluster is complex, but implementation mostly hidden from the users

Availability and Partition Tolerance given precedence over Consistency (CAP – Data is eventually consistent) Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it succeeded or failed)

Partition tolerance (the system continues to operate despite a part of the system failing)

Brewer’s CAP theorem (For further reading)

Staged Event Driven Architecture – framework for achieving high concurrency and load Uses events, messages and queues to process tasks

Decouples the request and response from the worker threads

Cassandra

http://en.wikipedia.org/wiki/CAP_theorem

http://www.eecs.harvard.edu/~mdw/proj/seda/

Ring – Visual representation of data managed by Cassandra

Node – Individual machine in the ring

Data Center – A collection of related nodes

Cluster – Collection of (geographically separated) data centers

Commitlog – The equivalent of a transaction log file for Durability

Memtable – In Memory structures to store data (per column family)

Keyspace – Container for application data (Analogous to schema)

Table – Structure that holds data in rows and columns

SSTable – An immutable file (for each table) on disk to which data structures in memory are dumped periodically

Cassandra Terminology

Gossip – Peer to Peer protocol to discover and share location and state information on nodes

Tokens – A number used to assign a range of data to a node within a datacenter

Partitioner – A Hashing function for deriving the token

Replication Factor – determines the number of copies of data

Snitch – Snitch informs Cassandra about network topology

Replica – Copies of data on different nodes for redundancy and fault tolerance

Replication Factor – total number of copies on the cluster

Terminology

Cassandra is linearly (horizontal) and massively scalable

Just add or remove nodes to the cluster as load increases or decreases

There is no down time required for this

SEDA – Staged Event Driven Architecture guarantees consistent throughput

Core Strength - Scalability

http://www.datastax.com/documentation/cassandra/2.0/cassandra/gettingStartedCassandraIntro.html

Quantifying Massive

Avoids the pitfalls of Client Server based design

Eliminates storage bottlenecks No single data repository

Redundancy built in

All nodes participate (whether they have requested data or not)

Shared nothing

Transparently add / remove nodes as necessary without downtime

Comes with a trade-off – eventual consistency (CAP)

Newer Staged Event Driven Architecture

How does it Scale?

Legacy systems typically use thread based concurrency models

Programming traditional multi-threaded applications is hard Distributed multithreaded applications are even harder

Leads to severe scalability bottlenecks

A new thread or process is usually created for each request

There is a maximum number of threads a system can support

Challenges with thread execution model Deadlocks

Livelocks (wastes CPU cycles)

starvation (wait for resources)

Overheads – Context switching, synchronization and data movement

Request and response typically handled by the same thread

Sequential execution

Legacy Systems

a

Threads

http://scriptkittens.blogspot.com/2013_11_01_archive.html

http://scriptkittens.blogspot.com/2013_11_01_archive.html

Event Driven Architecture

Evolution of Event Driven Architecture (EDA)

This consists of a set of loosely coupled software components and services

An Event is something that an application can act upon A hotel booking event A check-in event

A listener can pick up a check-in event and act on it In-room entertainment system displays a personalized greeting Partners may get notified and can send personalized offers (Spa / massage/ restaurant

discounts)

This is much more scalable than thread based concurrency models

http://en.wikipedia.org/wiki/Event_(computing)

SEDA is an Architectural approach

An application is broken down into a set of logical stages

These stages are loosely coupled and connected via queues

Decouples event and thread scheduling from DB Engine logic

Prevents resources from being overcommitted under high load

Enables modularity and code reuse

SEDA Explained

Understanding Stage (SEDA)

Understanding Stage

SEDA enables Massive Concurrency No thread deadlocks or livelocks or Starvation to worry about (for most part)

Thread Scheduling and Resource Management abstracted

Supports self tuning / resource allocation / management

Easier to debug and monitor application performance at scale

Distributed debugging / tracing easier

Graceful degradation under excessive load Maintains throughput at the expense of latency

Why SEDA matters

Examples of Stages

Data Placement

http://www.datastax.com/2014/06/replicate-multiple-data-centers-for-cassandra-summit

http://www.datastax.com/2014/06/replicate-multiple-data-centers-for-cassandra-summit

Facebook’s DC

Why is data placement important

https://plus.google.com/+JamesPearn/posts/VaQu9sNxJuY

Cassandra has a listen and broadcast IP address

Snitch maps IP address to Racks and Data Centers

Gossip uses this information to help Cassandra build node location map

Snitch helps Cassandra with replica placement

Helps Cassandra minimize cross data center latency

Role of Snitch

Once built and configured, a cluster is ready to store data

Each node owns a Token Range Can be manually assigned in YAML file

Or Cassandra can manage token assignment - a concept called vNodes

A Keyspace needs to be created with replication options

CREATE KEYSPACE “Choice"

WITH REPLICATION =

{'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};

Cassandra Schema objects are replicated globally to all nodes

This enables each node in the cluster to act as a coordinator node

Data Placement

Data gets replicated as defined in the Keyspace

Within a data center, murmur3hash of PK decides which node owns the data

Replication Strategy determines which nodes contain replicas

Simple Strategy – Replicas are placed in succeeding nodes

Network Topology – Walks the ring clockwise and places each copy on the first node on successive racks

Asymmetric replica groupings are possible (DR / Analytics etc.)

Data Placement

empID empName deptID deptName hiredate

22 Sam 12 Finance 1/22/1996

33 Scott 18 Human Resources 12/8/2006

44 Walter 24 Shipping 11/20/2009

55 Bianca 30 Marketing 1/1/2015

Data Placement

Partition Sample Hash

Finance-2245462676723220000

Human Resources7723358927203680000

Shipping-6723372854036780000

Marketing1168604627387940000

Data Placement

Data Access Cassandra’s location independent Architecture means a user can connect to any node of

the cluster, which then acts as coordinator node Schemas get replicated globally – even to nodes that do not contain a copy of the data

Cassandra offers tunable consistency – an extension of eventual consistency Clients determine how consistent the data should be

They can choose between high availability (CL ONE) and high safety (CL ALL) among other options

Further reading

Request goes through stages – the thread that received the initial request will insert the request into a queue and wait for the next user request

Partition aware drivers help route traffic to the nearest node

Hinted Hand-offs – store and forward write requests

http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html

Data Access

Memtables

Commitlog

SSTables

Tombstones

Compaction

Repair

Reads and Writes

Reads & Writes

Reads and Writes

Write Process Write requests written to a MemTable

When Memtable is full, contents get queued to be flushed to disk

Writes are also simultaneously persisted on Disk to a CommitLog file This helps achieving durable writes

CommitLog entries are purged after MemTable is flushed to disk

MemTables and SSTables are created on a per table basis

Tunable consistency determines how may MemTables and Commitlogs the row has to be written to

SSTables are immutable and cannot be modified once written to

Compaction consolidates SSTables and removes tombstones SizeTiered Compaction

Leveled Compaction

Repair is a process that synchronizes copies located in different nodes Uses Merkle Trees to make this more efficient

Write Path

http://www.datastax.com/

Is a feature that enables high write availability

Has to be enabled / disabled in the YAML file

When a replica node is down

A hint is stored in the coordinator node

Hints are stored for three hours (default)

Hinted writes do not count towards CL

Replaying hints does not affect system performance

Hinted Hand-off

Read Path A row of data will likely exist in multiple locations

Unflushed Memtable Un-compacted and compacted SSTables

Tunable consistency determines how many nodes have to respond

Cassandra does not rewrite entire row to new file on update No read before writes Updated / new columns exists in new file Unmodified columns exist in old file

The timestamped version of the row can be different in each location

All these must be retrieved, reconstructed and processed based on timestamp

Uses Bloom filters to make key lookups more efficient

Row fragments may exist in multiple SSTables

May exist in Memtable as well

Bloom filters speed lookups

Read Path

It is a Probabilistic Bit Vector Data Structure Supports two operations – Test and Add

Cassandra uses Bloom filters to reduce Disk I/O during key lookup

Each SSTable has a bloom filter associated with it

A Bloom filter is used to test if an element is a part of a set

False Positives are possible, but False negatives are not Means a key is “possibly in set” or “definitely not in set”

Check out JasonDavies.com for a cool interactive demo

http://www.jasondavies.com/bloomfilter/

http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

Bloom Filters



Deletes are handled differently when compared to traditional RDBMS

Data to be deleted is marked using Tombstones (using a write operation)

Actual removal takes place later during compaction

Run Repair on each node within 10 days (default)

Repair removes inconsistencies between replicas

Inconsistencies happen because nodes can be down for longer than hinted handoff window, thereby missing deletes/updates

Distributed deletes are hard in a peer to peer system that has no SPOF

Deletes

Distributed Systems are eventually consistent

Only a small number of nodes have to respond for successful (delete) operation

As the delete command propagates through the system, some nodes may be unavailable

The commands are stored (as hinted hand-offs) and will be delivered when the downed node comes online

The delete command may be “lost” if the downed node does not come back within the hinted hand-off window (default 3 hours)

Why are Distributed Deletes hard?

Cassandra does not support in-row updates

Updates are implemented as a delete and an insert

Updated values are written to a new file

Unmodified columns of the original row exist in old file

Compaction consolidates all values and writes row to new file

Updates

Cassandra does not perform in-place updates or deletes

Instead the new data is written to a new SSTable file

Cassandra marks data to be deleted using markers called Tombstones

Tombstones exist for the time period defined by GC_GRACE_SECONDS

Compactions merges data in each SSTable by partition key

Evicts tombstones, deletes data and consolidates SSTables into a single SSTable

Old SSTables are deleted as soon as existing reads complete

Compaction

Compaction


Read Repair and Node Repair

Read Repair synchronizes data requested in a read operation

Node repair synchronizes all data (for a range) in a node with all replicas

Node repair needs to be scheduled to run at least once within the GC_GRACE_SECONDS Window (default 10 days)

Repair

There are two stages to the repair process

Build a Merkle Tree

Each replica will compare the differences between the trees

Once the compare completes, changes stream over

Streams are written to new SSTables

Repair is a resource intensive operation

Read up on Advanced Repair techniques

Repair Process

http://www.datastax.com/dev/blog/advanced-repair-techniques

The distributed, decentralized nature of Cassandra requires repair operations

Repair involves comparing all data elements in each replica and updating the data This happens asynchronously and in the background

Cassandra uses Merkle Trees to detect data inconsistencies quicker and minimize data transferred between different nodes

Merkle Tree is an inverted hash tree structure

Used to compare data stored in different nodes

Partial branches of tree can be compared

Minimizes repair time and traffic between nodes

Merkle Trees

http://en.wikipedia.org/wiki/Merkle_tree

Single threaded Operations

Some Examples of Single threaded operations:

Merkle Tree Comparison

Triggering Repair

Deleting files Obsolete Sstables

Commitlog segments

Gossip

Hinted Handoff (default value = 1)

Message Streaming

This demo is to help get a better understanding on:

Gossip

Replication

Data Manipulation (Inserts, Updates, Deletes)

Role of Memtable, CommitLog and Tombstones

Compaction

Demo

Demo - Steps

Modify core cluster and table settings

Insert Data in one node

Verify Replication

Shut down one node

Continue DML operations

Start the downed node

Understand Outcome

Let’s see it!

Demo Time

Commands issued to Cassandra when one node was down

Demo commnds

Expected results

Actual Results

Results

Demo Recap

What just happened?

Inserts disappeared

Updates rolled back

Deletes reappeared

What happened to Durability?

And this thing called eventual consistency?

All nodes were up and running

Initial writes came in, got persisted and replicated

All nodes have received the data and are in sync.

Memtable Flush, Compaction and SSTables Consolidation This clears the memory and the commit log

None of the 3 nodes have any entries in the commit log for these rows Data exists in SSTables and so query returns data back to user

What really happened?

One node is brought down The state is preserved in that node

Inserts / Updates and Deletes continue in other nodes

Replication and Synchronization happens

Consolidation and Compaction happens on the other 2 nodes

Every time this happens, commit log is cleared and tombstones evicted

gc_grace_seconds & hinted_handoff play a critical role for this demo to work

3rd node that was down is brought up and it starts synchronizing

It still has the original state preserved and sends that copy to the other 2 nodes

Other 2 nodes receive the data and look for commit log entries and Tombstones locally

When the nodes do not find the entries, they proceed to apply that change (as new data) and the system reverts back

What really happened?

http://www.Datastax.com

http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf

http://berb.github.io/diploma-thesis/original/052_threads.html

Choice Hotels is hiring!

Please contact Jeremiah Anderson for details.

[email protected]

References


http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf

http://berb.github.io/diploma-thesis/original/052_threads.html

mailto:[email protected]

Cassandra internals

Technology

Transcript of Cassandra internals