About "Apache Cassandra"

APACHE CASSANDRAScalability, Performance and Fault Tolerance

in Distributed databases

Jihyun.An ([email protected])

18, June 2013

mailto:[email protected]

TABLE OF CONTENTS

Preface

Basic Concepts

P2P Architecture

Primitive Data Model & Architecture

Basic Operations

Fault Management

Consistency

Performance

Problem handling

TABLE OF CONTENTS (NEXT TIME)

Maintaining

Cluster Management

Node Management

Problem Handling

Tuning

Playing (for Development, Client stance)

Designing

Client

Thrift

Native

CQL

3rd party

Hector

OCM

Extension

Baas.io

Hadoop

PREFACE

OUR WORLD

Traditional DBMS is very valuable

Storage(+Memory) and Computational Resources cost is cheap (than before)

But we meet new section

Big data

(near) Real time

Complex and various requirement

Recommendation

Find FOAF

…

Event Driven Trigging

User Session

…

OUR WORLD (CONT)

Complex applications combine difference types of problems

Different language -> more productive

ex: Functional language, Multiprocessing optimized language

Polyglot persistent layer

Performance vs Durability?

Reliability?

…

TRADITIONAL DBMS

Relational Model

Well-defined Schema

Access with Selection/Projection

Derived from Joining/Grouping/Aggregating(Counting..)

Small data (from refined)

…

But

Painful data model changes

Hard to scale out

Ineffective in handling large volumes of data

Not considered with hardware

…

TRADITIONAL DBMS (CONT)

Has many constraints for ACID

PK/FK & checking

Domain Type checking

.. checking checking

Lots of IO / Processing

OODBMS, ORDBMS

Good but .. more more checking / processing

Not well with Disk IO

NOSQL

Key-value store

Column : Cassandra, Hbase, Bigtable …

Others : Redis, Dynamo, Voldemort, Hazelcast …

Document oriented

MongoDB, CouchDB …

Graph store

Neo4j, Orient DB, BigOWL, FlockDB ..

NOSQL (CONT)

Benefits

Higher performance

Higher scalability

Flexible Datamodel

More effective for some case

Less administrative overhead

Drawbacks Limited Transactions

Relaxed Consistency

Unconstrained data

Limited ad-hoc query capabilities

Limited administrative aid tools

CAP

Brewer’s theorem

We can pick two of

Consistency

Availability

Partition tolerance

A

C P

Amazon Dynamo derivatives

Cassandra, Voldemort, CouchDB

, Riak

Neo4j, Bigtable

Bigtable derivatives : MongoDB, Hbase

Hypertable, Redis

Relational:

MySQL, MSSQL,

Postgres

Dynamo

(Architecture)

BigTable

(Data model)

Cassandra

(Apache) Cassandra is a free, open-source, high scalable,

distributed database system for managing large amounts of data

Written in JAVA

Running on JVM

References :

BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf)

Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/archive/bigtable-osdi06.pdf

http://web.archive.org/web/20120129154946/http:/s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf

DESIGN GOALS

Simple Key/Value(Column) store

limited on storage

No support anything (aggregating, grouping …) but basic operation (CRUD, Range access)

But extendable

Hadoop (MR, HDFS, Pig, Hive ..)

ESP

Distributed Processing Interface (ex: BSP, MR)

Baas.io

…

DESIGN GOALS (CONT)

High Availability

Decentralized

Everyone can accessor

Replication & Their access

Multi DC support

Eventual consistency

Less write complexity

Audit and repair when read

Possible tuning -> Trade offs between consistency, durability and latency

DESIGN GOALS (CONT)

Incremental scalability

Equal Member

Linear Scalability

Unlimited space

Write / Read throughput increase linearly by add node(member)

Low total cost

Minimize administrative work

Automatic partitioning

Flush / compaction

Data balancing / moving

Virtual nodes (since v1.2)

Middle powered nodes make good performance

Collaborating work will make powerful performance and huge space

FOUNDER & HISTORY

Founder

Avinash Lakshman (one of the authors of Amazon's Dynamo)

Prashant Malik ( Facebook Engineer )

Developer

About 50

History

Open sourced by Facebook in July 2008

Became an Apache Incubator project in March 2009

Graduated to a top-level project in Feb 2010

0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010

0.7 released (added secondary indexes and online schema change) in Jan 2011

0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011

1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011

1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012

1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013

PROMINENT USERS

User Cluster size Node count Usage Now

Facebook >200 ? Inbox search Abandoned,Moved to HBase

Cisco WebEx ? ? User feed, activity OK

Netflix ? ? Backend OK

Formspring ? (26 million account with 10 m responsed per day)

? Social-graph data OK

Urban airship, Rackspace, Open X, Twitter (preparing move to)

BASIC CONCEPTS

P2P ARCHITECTURE

All nodes are same (has equality)

No single point of failure / Decentralized

Compare with

mongoDB

broker structure (cubrid …)

Master / slave

…

P2P ARCHITECTURE

Driven linear scalability

References :

http://dev.kthcorp.com/2011/12/07/cassandra-on-aws-100-million-writ/

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=3YSDAgGnuMHm4M&tbnid=rpuahptcjv4gvM:&ved=0CAUQjRw&url=http://readwrite.com/2011/11/24/netflix-benchmarks-cassandra-o&ei=JfjAUabmMIiQkAX4loDIBQ&bvm=bv.47883778,d.dGI&psig=AFQjCNGBaG1NPmCzZ7tjSKwBgzwboyvxGA&ust=1371687139804572

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=3YSDAgGnuMHm4M&tbnid=rpuahptcjv4gvM:&ved=0CAUQjRw&url=http://readwrite.com/2011/11/24/netflix-benchmarks-cassandra-o&ei=JfjAUabmMIiQkAX4loDIBQ&bvm=bv.47883778,d.dGI&psig=AFQjCNGBaG1NPmCzZ7tjSKwBgzwboyvxGA&ust=1371687139804572

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=bYZ2I3MeFYR8PM&tbnid=v93nfjfUKSBHVM:&ved=0CAUQjRw&url=http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html&ei=oPjAUfHfBsSmkwWV0oGQAQ&bvm=bv.47883778,d.dGI&psig=AFQjCNGBaG1NPmCzZ7tjSKwBgzwboyvxGA&ust=1371687139804572

http://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=bYZ2I3MeFYR8PM&tbnid=v93nfjfUKSBHVM:&ved=0CAUQjRw&url=http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html&ei=oPjAUfHfBsSmkwWV0oGQAQ&bvm=bv.47883778,d.dGI&psig=AFQjCNGBaG1NPmCzZ7tjSKwBgzwboyvxGA&ust=1371687139804572


PRIMITIVE DATA MODEL & ARCHITECTURE

COLUMN

Basic and primitive type (the smallest increment of data)

A tuple containing a name, a value and a timestamp

Timestamp is important

Provided by client

Determine the most recent one

If meet the collision, DBMS chose the latest one

Name

Value

Timestamp

COLUMN (CONT)

Types

Standard: A column has a name (UUID or UTF8 …)

Composite: A column has composite name (UUID+UTF8 …)

Expiring: TTL marked

Counter: Only has name and value, timestamp managed by server

Super: Used to manage wide rows, inferior to using composite

columns (DO NOT USE, All sub-columns serialized)

Counter Name

Value

Name

Name

Value

Timestamp

Name

Value

Timestamp

COLUMN (CONT)

Types (CQL3 based)

Standard: Has one primary key.

Composite: Has more than one primary key,

recommended for managing wide rows.

Expiring: Gets deleted during compaction.

Counter: Counts occurrences of an event.

Super: Used to manage wide rows, inferior to using

composite columns (DO NOT USE, All sub-columns

serialized)

DDL : CREATE TABLE test (

user_id varchar,

article_id uuid,

content varchar,

PRIMARY KEY (user_id, article_id)

);

user_id article_id content

Smith <uuid1> Blah1..

Smith <uuid2> Blah2..

{uuid1,content}

Blah1…

Timestamp

{uuid2,content}

Blah2…

Timestamp

Smith

<Logical>

<Physical>

SELECT user_id,article_id from test order by article_id DESC LIMIT 1;

ROWS

A row containing a represent key and a set of columns

A row key must be unique (usually UUID)

Supports up to 2 billion columns per (physical) row.

Columns are sorted by their name (Column’s Name indexed)

Primitive

Secondary Index

Direct Column Access

Name

Value

Timestamp

Name

Value

Timestamp

Name

Value

Timestamp

Row

Key

COLUMN FAMILY

Container for columns and rows

No fixed schema

Each row is uniquely identified by its row key

Each row can have a different set of columns

Rows are sorted by row key

Comparator / Validator

Static/Dynamic CF

If columns type is super column, CF called “Super Column Familty”

Like “Table” in Relational world

Name

Value

Timestamp

Name

Value

Timestamp

Name

Value

Timestamp

Row

Key

Name

Value

Timestamp

Row

Key

DISTRIBUTION

Row

Row

Row

Row

Row

Row

Server

1

Server

3Server

2

Server

4

How to

map?

TOKEN RING

Node is a instance (typically same as a server)

Used to map between each row and node

Range from 0 to 2127-1

Associated with a row key

Node

Assigned a unique token (ex: token 5 to Node 5)

Range is from previous node token to their token

token 4 < Node 5’range <= token 5

Node 1

Node 2

Node 3

Node 4Node 5

Node 6

Node 7

Node 8

Token 5

Token 4

PARTITIONING

Row

Key

Random

Partitioners

(MD5,

Murmur3)

Order

Preserving

Partitioner /

Byte

Ordered

Partitioner

Default

Row

KeyRow

KeyRow

Key

REPLICATION

Any node has read/write role is called

coordinator node (by client)

Locator determine where located the replica

Replica is used at

Consistency check

Repair

Ensure W + R > N for consistency

Local Cache (Row cache)

Node 1

Node 2

Node 3

Node 4Node 5

Node 6

Node 7

Node 8

Replica Factor is 4 (N-1 will be replicated)Simple Locator treat strategy order as proximity

Locator

(Simple)

Coordinator node

Locating first one

1

2

Here is original

REPLICATION (CONT)

Multi DC support

Allow to Specify how many replcas in each DC

Within DC replicas are placed on different racks

Relies on snitch to place replicas

Strategy (provided from Snitch)

Simple (Single DC)

RackInferringSnitch

PropertyFileSnitch

EC2Snitch

EC2MultiRegionSnitch

DC1DC2

Entire

ADD / REMOVE NODE

Data transfer between nodes called “Streaming”

If add node 5,

node 3 and node 4, 1 (suppose RF is 2) involved in streaming

If remove node 2

node 3(got higher token and their replica container) serve instead

Node 1

Node 2Node 3

Node 4

Node 1

Node 2

Node 3

Node 4

Node 5

Node 1

Node 3

Node 4

VIRTUAL NODES

Support since v1.2

Real time migration support?

Shuffle utility

One node has many tokens

=> one node has many ranges Node 1 Node 2

Number of token is 4

Cluster

Node 2

Node 1

VIRTUAL NODES (CONT)

Less administrative works

Save cost

When Add/Remove node

many node co-works

No need to determine the token

Shuffle to re-balance

Less changing time

Smart balancing

No need to balance

(Sufficiently number of token should be higher)

Number of token is 4

Node 2

Node 1

Cluster

Node 2

Node 1

Node 3

Add node 3

KEYSPACE

A namespace for column families

Authorization

CF? yeah

Replication

Key oriented schema (see right)

{ "row_key1":{

"Users":{ "emailAddress":{"name":"emailAddress","value":"[email protected]"

}, "webSite":{"name":"webSite", "value":http://bar.com} },"Stats":{ "visits":{"name":"visits", "value":"243"} }

}, "row_key2":{

"Users":{ "emailAddress":{"name":"emailAddress", "value":"[email protected]"}, "twitter":{"name":"twitter", "value":"user2"}

} }

}

Row Key

Column Family

Column

http://bar.com/

CLUSTER

Total amount of data managed by the cluster is represented as a

ring

Cluster of nodes

Has multiple(or single) Keyspace

Partitioning Strategy defined

Authentication

GOSSIP

Gossip protocol is used for cluster membership.

Failure detection on service level (Alive or Not)

Responsible

Every node in the system knows every other node’s status

Implemented as

Sync -> Ack -> Ack2

Information : status, load, bootstraping

Basic status is Alive/Dead/Join

Runs every second

Status disseminated in O(logN) (N is the number of nodes)

Seed

PHI is used for auditing dead or alive in time window

(5 -> detecting in 15~16 s)

Data structure

HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap

N1

N2

N3

N4

N6

N5

BASIC OPERATIONS

WRITE / UPDATE

CommitLog

Abstracted Mmaped Type

File & Memory Sync -> On system failure? This is angel for U ^^.

Java NIO

C-Heap used (=Native Heap)

Log Data (Write->Delete? But exists)

Segment Rolling structure

Memtable

In memory buffer and workspace

Sorted order by row key

If reach threshold or period point, written to disk to a persistent table structure(SSTable)

WRITE / UPDATE (LOCAL LEVEL)

Write

CommitLog

Write : “1”:{“name”:”fullname”,”value”:”smith”}

Write : “2”:{“name”:”fullname”,”value”:”mike”}

Delete : “1”

Write : “3”:{“name”:”fullname”,”value”:”osang”}

… Key Name Value

1 fullname smith

2 fullname mike

3 fullname Osang

… … …

Memtable

SSTable SSTable SSTable

1 Write to commitLog

2

Write/Update to Memtable

3Write to Disk (flush)

SSTABLE

SSTable is Sorted String Table

Best for log structured DB

Store large numbers of key-value pairs

Immutable

Create with “Flush”

Merges by (major/minor) compaction

Has one or more column has different version (timestamp)

Choose recent one

READ (LOCAL LEVEL)

Key Name Value

2 fullname mike

3 fullname Osang

… … …

SSTableBF

IDX

SSTableBF

IDX

Read

Memtable

READ (CLUSTER LEVEL, +READ REPAIR)

Replica(Original, Right)

Replica(Right)

Replica(Wrong)

Digest ComparingChoose the right one if digests differ(the most recent)

Recover

Read

Operation

Coordinator

Locator1 Transferred from original/replica node (with consistency level)

2

3

DELETE

Add tomstone (this is some type of column)

Garbage collected when compacting

GC grace seconds : 864000 (default 10 days)

Issue

If the fault node recover after GCGraceSeconds, the deleted data can

be resurrected

FAULT MANAGEMENT

DETECTION

Dynamic threshold for marking nodes

Accrual Detection Mechanism calculates a per-node threshold

Automatic take into account Network condition, workload and

other conditions might affect perceived heartbeat rate.

From 3rd party client

Hector

Failover

HINTED-HANDOFF

The coordinator will store a hint for if the node down or failed to

acknowledge the write

Hint consists of the target replica and the mutation(column

object) to be replayed

Use java heap (might next to be off-heap)

Only saved within limited time (default, 1 hour) after a replica fails

When failed node is alive again, it will begin streaming the miss

writes

REPAIR

Support triangle method

CommitLog Replaying (by administrator)

Read Repair (realtime)

Anti-entropy Repair (by administrator)

READ REPAIR

Background work

Configured per CF

Choose most recently written value if they are inconsistent, and

replace it.

ANTI-ENTROPY REPAIR

Ensure all data on a replica is made consistent

Merkle tree used

Tree of data block’s hashes

Verify inconsistent

Repair node request merkle hash (piece of CF)

to replicas and comparing, streaming from a replica if inconsistent, do Read-repair

Block

1

Block

2

Block

3…

CF

hash hash hash hash

hash hash

hash

CONSISTENCY

BASIC

Full ACID compliance in distributed system is a bad idea.

(network, … )

Single row updates are atomic (include internal indexes),

everything else is not

Relaxing consistency does not equal data corruption

Tunable Consistency

Speed vs precision

Any read and write operation decides how consistent the requested

data should be (from client)

CONDITION

Consistency ensure if

(W + R) > N

W is nodes written (succeed)

R is nodes read

N is replica factor

CONDITION (CONT)

N is 3

Operations

1. Write 3

2. Write 5

3. Write 1

3 5 1

Worst case

W is 1

1 5 1W is 2 3 1 1or

W is 2 1 1 1

R is 1

Possible case

3 5 1or or

R is 21

1 R is 3

Written Read

(W+R)>N ensure that at lease one latest value can be selected

This is eventual consistency

READ CONSISTENCY LEVELS

One

Two

Three

Quorum

Local Quorum

Each Quorum

All

Specify how many replicas must response

before a result is return to the client

Quorum : (Replication Factor / 2) + 1

Local Quorum / Each Quorum is used at Multi-

DC

Round down to a whole number processing

(If satisfied, return right away)

WRITE CONSISTENCY LEVELS

ANY

One

Two

Three

Quorum

Local Quorum

Each Quorum

All

Specify how many replicas must succeed

before returning acknowledge to client

Quorum : (Replication Factor / 2) + 1

Local Quorum / Each Quorum is used at Multi-

DC

ANY level contain hinted-handoff condition

Round down to a whole number processing

(If satisfied, return right away)

PERFORMANCE

CACHE

Key/Row Cache can save their data to files

Key Cache

Accessed Frequently

Hold the location of keys (indicating to columns)

In memory, on JVM heap

Row Cache

Optional

Hold entire columns of the row

In memory, on Off-heap (since v1.1) or JVM heap

If you have huge column, this will make OOME (Out Of Memory Event)

CACHE

Mmaped Disk Access

On 64bit JVM, used for data and index summary (default)

Provide virtual mmaped space in Memory for SSTable

On C-Heap(native heap)

GC make this as cache

Data accessed frequently live long period, otherwise GC will purge that

If the data exists in memory, return it (=cache)

(Problem) GC C-Heap when its full only

(Problem) handle open SSTable, this mean Cassandra can allocate the entire size of open SSTables, otherwise native OOME

If you wanna have efficient Key/Row/Mmaped Access cache, add sufficient nodes to cluster

BLOOM FILTERS

Each SSTable has this

Used to check if a requested row key exists in the SSTable before

doing any seeks (disk)

Per row key, generate several hashes and mark the buckets for

the key

Check each bucket for the key’s hashes, if any is empty the key

does not exists

False positive are possible, but false negative are not

Key 1 Key 2 Key 2

Hash A Hash B Hash C

1 1 1

Same hashes

Only has

INDEX

Primary Index

Per CF

The index of CF’s row key

Efficient access with Index summary (1 row key out of every 128 is

sampled)

In memory, on JVM heap (next move to Off-heap)

Read BF

KeyCache

SSTable

Index

Summary

Primary

Index

Offset

Calculator

INDEX (CONT)

Secondary Index

For Column’s value(s)

Support composite type

Hidden CF

Implemented by CF’name index

Value is the CF’name

Write/Update/Delete operation is atomic

Share value for many rows is good for

On the contrary unique value for indexing is poor (-> use Dynamic CF for

indexing)

COMPACTION

Combines data from SSTables

Merge row fragments

Rebuild primary and secondary indexes

Remove expired columns marked with tomestone

Delete old SSTable if complete

“Minor” only compactions merge SSTables of similar size, “Major” compactions merge all SSTables in a given CF

Size-tiered compaction

Leveled compaction

Since v1.0

Based on LevelDB

Temporary use maximum twice space and spike in disk IO.

ARCHITECTURE

Write : no race conditions, not handled by disk IO

Read : Slow than write, but fast (DHT, cache …)

Load balancing

Virtual Nodes

Replication

Multi-DC

BENCHMARK

References :

http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18

0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-

eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/

Workload A—update heavy: (a) read

operations, (b) update operations.

Throughput in this (and

all figures) represents total operations

per second, including reads and

writes.

Workload B—read heavy: (a) read

operations, (b) update operations

By YCSB (Yahoo Cloud Serving Benchmark)


BENCHMARK (CONT)

References :




Workload E—short scans.


Read performance as cluster size increases.

http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http://68.180.206.246/files/ycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/

BENCHMARK (CONT)

Elastic speedup:

Time series showing

impact of adding

servers online.


References :




http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http://68.180.206.246/files/ycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/

BENCHMARK (CONT)By NoSQLBenchmarking.com

References :

http://www.nosqlbenchmarking.com/2011/02/new-results-for-cassandra-0-7-2//

http://www.cubrid.org/blog/dev-platform/nosql-benchmarking/

http://www.nosqlbenchmarking.com/wp-content/uploads/2011/02/new_cassandra_read_update.png

http://www.nosqlbenchmarking.com/wp-content/uploads/2011/02/new_cassandra_read_update.png

BENCHMARK (CONT)By Cubrid

References :



http://blog.cubrid.org/wp-content/uploads/2011/09/db-test-results.png

http://blog.cubrid.org/wp-content/uploads/2011/09/db-test-results.png

BENCHMARK (CONT)By VLDB

References :

http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/

Read latency Write latencyThroughput (95% read, 5% write)


BENCHMARK (LAST) By VLDB

References :

http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/

Throughput (50% read, 50% write) Throughput (100% write)


PROBLEM HANDLING

RESOURCE

Memory

Off-heap & Heap

OOME Problem

CPU

GC

Hashing

Compression / Compaction

Network Handling

Context Switching

Lazy Problem

IO

Bottleneck for everything

MEMORY

Heap (GC management)

Permanent (-XX:PermSize, -XX:MaxPermSize)

JVM Heap (-Xmx, -Xms, -Xmn)

C-Heap (=Native Heap)

OS Shared

Thread Stack (-Xss)

Objects that access with JNI

Off-Heap

OS Shared

GC managed by Cassandra

MEMORY (CONT)

Heap

Permanent

JVM Heap

Memtable

KeyCache

IndexSummary(move to Off-heap on next release)

Buffer

Transport

Socket

Disk

C-Heap

Thread Stack

File Memory Map (Virtual space)

Data / Index buffer (default)

CommitLogv1.2

Off-Heap (OS shared)

RowCache

BloomFilter

Index->CompressionMetaData->ChuckOffset

MEMORY (CONT)

Memtable

Managed

total size (default 1/3 JVM heap, flush largest memtable for CF if reached)

Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush largest memtable (each time) -> prevent full GC / OOME

KeyCache

Managed

total size (100M or 5% of the max)

Emergency, heap usage above the fraction of the max after full GC(CMS) -> reduce max cache size -> prevent full GC / OOME

RowCache/CommitLog

Managed

total size (default disabled) -> prevent OOME

MEMORY (CONT)

Thread Stack

Not managed

But XSS set as 180k (default)

Check thrift (transport level, RPC server)’s server serving type (sync,

hsha, async(has bugs))

Set min/max threads for connection (default unlimited)

v1.2

MEMORY (CONT)

Transport buffer

Thrift

Support many languages and crossing

Provide server/client interface, serializing

Apache project, created by Facebook

Framed buffer (default max 16M, variable size)

4k, 16k, 32k, … 16M

Determine by client

Per connection

Adjust max frame buffer size (client, server)

Set min/max threads for connection (default unlimited)

v1.2

Data Service

Client

Data Service

Thrift

MEMORY (LAST)

C-Heap/Off-Heap

OS Shared -> Other application possible to make some problem

File Memory Map (Virtual space)

GC when Full GC

0 <= total size <= the size of opened SSTables

If cannot allocate? -> Native OOME

But

Generally access limited space of SSTable

GC make space

Worst case? (If OOME occur)

yaml->disk_access_mode : standard (restart required)

Add sufficient nodes

Yaml->disk_access_mode : auto After joining

v1.2

CPU

GC

CMS

Marking phase : low thread priority -> but high usage rate (it’s not a problem)

CMSInitiatingOccupancyFraction is 75 (default)

UseCMSInitiatingOccupancyOnly

Full GC

Frequency is important -> may has a problem (eg: thrift transport buffer)

Add nodes or analyze memory usage to adjust configuration for

Minor GC

It’s OK

Compaction

If do slow, okay

So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”

High CPU Load -> sustaining? -> When U need to add nodes

SWAPPING

Swapping make big problem for real-time application

IO block -> Thread block -> Gossip/Compaction/Flush … delaying -> make other problem

Disable or Set minimum Swapping

Disable Swap partition

Or Enable JNA + Kernel Configuration

JNA : Mlockall (keep heap memory in physical memory)

Kernel

vm.swappiness=0 (but distress -> possible to swapping)

vm.overcommit_memory=1

Or vm.overcommit_memory=2 (overcommit managed)

vm.overcommit_ratio=? (eg 0.75)

Max memory = swap partition size + ratio*physical memory size

Eg: 8G = 2G + 0.75*8G

MORNITERING

System Monitoring

CPU / Memory / Disk

Nagios, Ganglia, Cacti, Zabbix

Network Monitoring

Per Client

NfSen (network flow monitoring, see: http://nfsen.sourceforge.net/#mozTocId376385)

Cluster Monitoring / Maintaining

OpsCenter

http://nfsen.sourceforge.net/#mozTocId376385

CHECK THREAD

“top” command

“H” key command to spread per thread

“P” key command to sort by CPU usage rate

Choose heavy rate thread’s PID

PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)

“jstack <Parent PID> > filename.log” command to save java stack to file

Search PID in Hex

313C

http://www.binaryhexconverter.com/decimal-to-hex-converter

CHECK HEAP

Use dump file that from “jmap” or OOME

Use “jhat” or another tool to analyze

Check [B

and their reference object

For development, maintaining

Sorry..

I have just two days to write this presentation.

Next time I will write and speak to U.

See U next time

Question or Talk about anything with Cassandra

Thank you

If you have any problem or question for me, please contact my email.

[email protected]

mailto:[email protected]

About "Apache Cassandra"

Technology

Transcript of About "Apache Cassandra"