Cassandra Introduction & Features

Cassandra Introduction & Key Features

Meetup Vienna Cassandra Users

13th of January 2014

[email protected]

mailto:[email protected]

Definition

Apache Cassandra is an open source, distributed,decentralized, elastically scalable, highly available,fault-tolerant, tuneably consistent, column-orienteddatabase that bases its distribution design on Amazon’sDynamo and its data model on Google’s Bigtable.Created at Facebook, it is now used at some of the mostpopular sites on the Web [The Definitive Guide, EbenHewitt, 2010]

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 2

History

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Bigtable, 2006 Dynamo, 2007

OpenSource, 2008

3

Key Features

Cassandra

Distributed and

Decentra-lized

Elastic Scalability

High Availability and Fault Tolerance

TuneableConsistency

Column-oriented

Key-Value store

CQL – A SQL like query interface

High Perfor-mance


Distributed and Decentralized

• Distributed: Capable of running on multiple machines

• Decentralized: No single point of failure

No master-slave issues due to peer-to-peer architecture (protocol "gossip")

Single Cassandra cluster may run across geographically dispersed data centers

Read- and write-requests to any node


1

35

4

Datacenter 1

7

9

10

Datacenter 2

26 812

11

Elastic Scalability

• Cassandra scales horizontally, adding more machines that have all or some of the data on

• Adding of nodes increase performance throughput linearly

• De-/ and increasing the nodecount happen seamlessly

Linearly scales to terabytes and

petabytes of data


12

2

8

4

63

4

1

5

7 3Performance throughput = N x 2

Performance throughput = N

Scaling Benchmark By Netflix*


Cassandra scales linearly far beyond our current capacity requirements, and very rapid deploymentautomation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable,

*http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. Each client generated ~20.000w/s having 400byte in size

7

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

High Availability and Fault Tolerance

• High Availability?Multiple networked computers

operating in a cluster

Facility for recognizing node failures

Forward failing over requests to another part of the system

• Cassandra has High AvailabilityNo single point of failuredue to the peer-to-peer

architecture


1

35

4

26

Tunable Consistency

• Choose between strong and eventual consistency

• Adjustable for read- and write-operations separately

• Conflicts are solved during reads, as focus lies on write-performance

Use case dependent level of consistency


Available Consistency

TUNABLE

9

When do we have strong consistency?


• Simple Formula:(nodes_written + nodes_read) >

replication_factor

• Ensures that a read always reflects the most recent write

• If not: Weak consistency Eventually consistent

NW: 2

NR: 2

RF: 3

t2

t2

t1

jsmith t2

t1

t1

10

jsmith jsmith

jsmith

Column-oriented Key-Value Store

• Data is stored in sparse multidimensional hash tables

• A row can have multiple columns –not necessarily the same amount of columns for each row

• Each row has a unique key, which also determines partitioning

• No relations!


Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Row Key1

ColumnKey1

ColumnKey2

ColumnValue1

ColumnValue2

ColumnKey3

ColumnValue3

…

……

Stored sorted by column key/value

Sto

red

sort

edb

yro

wke

y*

* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly

CQL – An SQL-like query interface


• “CQL 3 is the default and primary interface into the Cassandra DBMS” *

• Familiar SQL-like syntax that maps to Cassandras storage engine and simplifies data modelling

* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf

CRETE TABLE songs (

id uuid PRIMARY KEY,

title text,

album text,

artist text,

data blob,

tags set<text>

);

INSERT INTO songs

(id, title, artist,

album, tags)

VALUES(

'a3e64f8f...',

'La Grange',

'ZZ Top',

'Tres Hombres'‚

{'cool', 'hot'});

SELECT *

FROM songs

WHERE id = 'a3e64f8f...';

“SQL-like” but NOT relational SQL

High Performance

• Optimized from the ground up for high throughput

• All disk writes are sequential, append only operations

• No reading before writing

• Cassandra`s threading-concept is optimized for running on multiprocessor/ multicore machines


Optimized for writing, but fast reads are possible as well

13

Benchmark from 2011 (Cassandra 0.7.4)*


*NoSql Benchmarking by Curbithttp://www.cubrid.org/blog/dev-platform/nosql-benchmarking/

Cassandra showed outstanding throughput in “INSERT-only” with 20,000 ops

Insert: Enter 50 million 1K-sized recordsRead: Search key for a one hour period + optional updateHardware: Nehalem 6 Core x 2 CPU, 16GB Memory

ops

14

http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/

Benchmark from 2013 (Cassandra 1.1.6)*


* Benchmarking Top NoSQL Databases by End Point Corporation, http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdfYahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB

15

http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdf

https://github.com/brianfrankcooper/YCSB

When do we need these features?

Large Deployments

Lots of Writes,

Statistics, and Analysis

Geographical Distribution

Evolving Applications


Who is using Cassandra?


ebay Data Infrastructure*


• 10+ clusters• 100+ nodes• > 250 TB provisioned

(local HDD + shared SSD)• > 9 billion writes/day• > 5 billion reads/day

• Thousands of nodes • The world largest cluster

with 2K+ nodes

• Thousands of nodes• > 2K sharded logical host• > 16K tables• > 27K indexes• > 140 billion SQLs/day• > 5 PB provisioned

• Hundreds of nodes• Persistent & in-memory• > 40 billion SQLs/day

Hundreds of nodes> 50 TB> 2 billion ops/day

18

Not replacing RDMBS but complementing!

*by Jay Patel, Cassandra Summit June 2013 San Francisco

Cassandra Use Case at Ebay


Application/Use Case

• Time-series data and real-time insights

• Fraud detection & prevention

• Quality Click Pricing for affiliates

• Order & Shipment Tracking

• …

• Server metrics collection

• Taste graph-based next-gen recommendation system

• Social Signals on eBay Product & Item pages

Why Cassandra?

• Multi-Datacenter (active-active)

• No SPOF

• Easy to scale

• Write performance

• Distributed Counters

Cassandra/Hadoop Deployment


Summary• History

• Key features of Cassandra• Distributed and Decentralized

• Elastic Scalability

• High Availability and Fault Tolerance

• Tunable Consistency

• Column-oriented key-value store

• CQL interface

• High Performance

• Ebay Use Case


Community portal: http://planetcassandra.org

21

Documentation: http://www.datastax.com/docs

Apache project: http://cassandra.apache.org

http://planetcassandra.org/

http://www.datastax.com/docs

http://cassandra.apache.org/

Cassandra Introduction & Features

Technology

Transcript of Cassandra Introduction & Features