Cassandra Introduction & Features

21
Cassandra Introduction & Key Features Meetup Vienna Cassandra Users 13th of January 2014 [email protected]

description

This presentation shortly describes key features of Apache Cassandra. It was held at the Meetup in Vienna in January 2014.

Transcript of Cassandra Introduction & Features

Page 1: Cassandra Introduction & Features

Cassandra Introduction & Key Features

Meetup Vienna Cassandra Users

13th of January 2014

[email protected]

Page 2: Cassandra Introduction & Features

Definition

Apache Cassandra is an open source, distributed,decentralized, elastically scalable, highly available,fault-tolerant, tuneably consistent, column-orienteddatabase that bases its distribution design on Amazon’sDynamo and its data model on Google’s Bigtable.Created at Facebook, it is now used at some of the mostpopular sites on the Web [The Definitive Guide, EbenHewitt, 2010]

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 2

Page 3: Cassandra Introduction & Features

History

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Bigtable, 2006 Dynamo, 2007

OpenSource, 2008

3

Page 4: Cassandra Introduction & Features

Key Features

Cassandra

Distributed and

Decentra-lized

Elastic Scalability

High Availability and Fault Tolerance

TuneableConsistency

Column-oriented

Key-Value store

CQL – A SQL like query interface

High Perfor-mance

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 4

Page 5: Cassandra Introduction & Features

Distributed and Decentralized

• Distributed: Capable of running on multiple machines

• Decentralized: No single point of failure

No master-slave issues due to peer-to-peer architecture (protocol "gossip")

Single Cassandra cluster may run across geographically dispersed data centers

Read- and write-requests to any node

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 5

1

35

4

Datacenter 1

7

9

10

Datacenter 2

26 812

11

Page 6: Cassandra Introduction & Features

Elastic Scalability

• Cassandra scales horizontally, adding more machines that have all or some of the data on

• Adding of nodes increase performance throughput linearly

• De-/ and increasing the nodecount happen seamlessly

Linearly scales to terabytes and

petabytes of data

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 6

12

2

8

4

63

4

1

5

7 3Performance throughput = N x 2

Performance throughput = N

Page 7: Cassandra Introduction & Features

Scaling Benchmark By Netflix*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Cassandra scales linearly far beyond our current capacity requirements, and very rapid deploymentautomation makes it easy to manage. In particular, benchmarking in the cloud is fast, cheap and scalable,

*http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

48, 96, 144 and 288 instances, with 10, 20, 30 and 60 clients respectively. Each client generated ~20.000w/s having 400byte in size

7

Page 8: Cassandra Introduction & Features

High Availability and Fault Tolerance

• High Availability?Multiple networked computers

operating in a cluster

Facility for recognizing node failures

Forward failing over requests to another part of the system

• Cassandra has High AvailabilityNo single point of failuredue to the peer-to-peer

architecture

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 8

1

35

4

26

Page 9: Cassandra Introduction & Features

Tunable Consistency

• Choose between strong and eventual consistency

• Adjustable for read- and write-operations separately

• Conflicts are solved during reads, as focus lies on write-performance

Use case dependent level of consistency

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Available Consistency

TUNABLE

9

Page 10: Cassandra Introduction & Features

When do we have strong consistency?

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

• Simple Formula:(nodes_written + nodes_read) >

replication_factor

• Ensures that a read always reflects the most recent write

• If not: Weak consistency Eventually consistent

NW: 2

NR: 2

RF: 3

t2

t2

t1

jsmith t2

t1

t1

10

jsmith jsmith

jsmith

Page 11: Cassandra Introduction & Features

Column-oriented Key-Value Store

• Data is stored in sparse multidimensional hash tables

• A row can have multiple columns –not necessarily the same amount of columns for each row

• Each row has a unique key, which also determines partitioning

• No relations!

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 11

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

Row Key1

ColumnKey1

ColumnKey2

ColumnValue1

ColumnValue2

ColumnKey3

ColumnValue3

……

Stored sorted by column key/value

Sto

red

sort

edb

yro

wke

y*

* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly

Page 12: Cassandra Introduction & Features

CQL – An SQL-like query interface

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 12

• “CQL 3 is the default and primary interface into the Cassandra DBMS” *

• Familiar SQL-like syntax that maps to Cassandras storage engine and simplifies data modelling

* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf

CRETE TABLE songs (

id uuid PRIMARY KEY,

title text,

album text,

artist text,

data blob,

tags set<text>

);

INSERT INTO songs

(id, title, artist,

album, tags)

VALUES(

'a3e64f8f...',

'La Grange',

'ZZ Top',

'Tres Hombres'‚

{'cool', 'hot'});

SELECT *

FROM songs

WHERE id = 'a3e64f8f...';

“SQL-like” but NOT relational SQL

Page 13: Cassandra Introduction & Features

High Performance

• Optimized from the ground up for high throughput

• All disk writes are sequential, append only operations

• No reading before writing

• Cassandra`s threading-concept is optimized for running on multiprocessor/ multicore machines

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Optimized for writing, but fast reads are possible as well

13

Page 14: Cassandra Introduction & Features

Benchmark from 2011 (Cassandra 0.7.4)*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

*NoSql Benchmarking by Curbithttp://www.cubrid.org/blog/dev-platform/nosql-benchmarking/

Cassandra showed outstanding throughput in “INSERT-only” with 20,000 ops

Insert: Enter 50 million 1K-sized recordsRead: Search key for a one hour period + optional updateHardware: Nehalem 6 Core x 2 CPU, 16GB Memory

ops

14

Page 15: Cassandra Introduction & Features

Benchmark from 2013 (Cassandra 1.1.6)*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

* Benchmarking Top NoSQL Databases by End Point Corporation, http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdfYahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB

15

Page 16: Cassandra Introduction & Features

When do we need these features?

Large Deployments

Lots of Writes,

Statistics, and Analysis

Geographical Distribution

Evolving Applications

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 16

Page 17: Cassandra Introduction & Features

Who is using Cassandra?

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 17

Page 18: Cassandra Introduction & Features

ebay Data Infrastructure*

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

• 10+ clusters• 100+ nodes• > 250 TB provisioned

(local HDD + shared SSD)• > 9 billion writes/day• > 5 billion reads/day

• Thousands of nodes • The world largest cluster

with 2K+ nodes

• Thousands of nodes• > 2K sharded logical host• > 16K tables• > 27K indexes• > 140 billion SQLs/day• > 5 PB provisioned

• Hundreds of nodes• Persistent & in-memory• > 40 billion SQLs/day

Hundreds of nodes> 50 TB> 2 billion ops/day

18

Not replacing RDMBS but complementing!

*by Jay Patel, Cassandra Summit June 2013 San Francisco

Page 19: Cassandra Introduction & Features

Cassandra Use Case at Ebay

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 19

Application/Use Case

• Time-series data and real-time insights

• Fraud detection & prevention

• Quality Click Pricing for affiliates

• Order & Shipment Tracking

• …

• Server metrics collection

• Taste graph-based next-gen recommendation system

• Social Signals on eBay Product & Item pages

Why Cassandra?

• Multi-Datacenter (active-active)

• No SPOF

• Easy to scale

• Write performance

• Distributed Counters

Page 20: Cassandra Introduction & Features

Cassandra/Hadoop Deployment

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk 20

Page 21: Cassandra Introduction & Features

Summary• History

• Key features of Cassandra• Distributed and Decentralized

• Elastic Scalability

• High Availability and Fault Tolerance

• Tunable Consistency

• Column-oriented key-value store

• CQL interface

• High Performance

• Ebay Use Case

13/01/2014 Cassandra Introduction & Key Features by Philipp Potisk

Community portal: http://planetcassandra.org

21

Documentation: http://www.datastax.com/docs

Apache project: http://cassandra.apache.org