Download - About "Apache Cassandra"

Transcript
Page 1: About "Apache Cassandra"

APACHE CASSANDRAScalability, Performance and Fault Tolerance

in Distributed databases

Jihyun.An ([email protected])

18, June 2013

Page 2: About "Apache Cassandra"

TABLE OF CONTENTS

Preface

Basic Concepts

P2P Architecture

Primitive Data Model & Architecture

Basic Operations

Fault Management

Consistency

Performance

Problem handling

Page 3: About "Apache Cassandra"

TABLE OF CONTENTS (NEXT TIME)

Maintaining

Cluster Management

Node Management

Problem Handling

Tuning

Playing (for Development, Client stance)

Designing

Client

Thrift

Native

CQL

3rd party

Hector

OCM

Extension

Baas.io

Hadoop

Page 4: About "Apache Cassandra"

PREFACE

Page 5: About "Apache Cassandra"

OUR WORLD

Traditional DBMS is very valuable

Storage(+Memory) and Computational Resources cost is cheap (than before)

But we meet new section

Big data

(near) Real time

Complex and various requirement

Recommendation

Find FOAF

Event Driven Trigging

User Session

Page 6: About "Apache Cassandra"

OUR WORLD (CONT)

Complex applications combine difference types of problems

Different language -> more productive

ex: Functional language, Multiprocessing optimized language

Polyglot persistent layer

Performance vs Durability?

Reliability?

Page 7: About "Apache Cassandra"

TRADITIONAL DBMS

Relational Model

Well-defined Schema

Access with Selection/Projection

Derived from Joining/Grouping/Aggregating(Counting..)

Small data (from refined)

But

Painful data model changes

Hard to scale out

Ineffective in handling large volumes of data

Not considered with hardware

Page 8: About "Apache Cassandra"

TRADITIONAL DBMS (CONT)

Has many constraints for ACID

PK/FK & checking

Domain Type checking

.. checking checking

Lots of IO / Processing

OODBMS, ORDBMS

Good but .. more more checking / processing

Not well with Disk IO

Page 9: About "Apache Cassandra"

NOSQL

Key-value store

Column : Cassandra, Hbase, Bigtable …

Others : Redis, Dynamo, Voldemort, Hazelcast …

Document oriented

MongoDB, CouchDB …

Graph store

Neo4j, Orient DB, BigOWL, FlockDB ..

Page 10: About "Apache Cassandra"

NOSQL (CONT)

Benefits

Higher performance

Higher scalability

Flexible Datamodel

More effective for some case

Less administrative overhead

Drawbacks Limited Transactions

Relaxed Consistency

Unconstrained data

Limited ad-hoc query capabilities

Limited administrative aid tools

Page 11: About "Apache Cassandra"

CAP

Brewer’s theorem

We can pick two of

Consistency

Availability

Partition tolerance

A

C P

Amazon Dynamo derivatives

Cassandra, Voldemort, CouchDB

, Riak

Neo4j, Bigtable

Bigtable derivatives : MongoDB, Hbase

Hypertable, Redis

Relational:

MySQL, MSSQL,

Postgres

Page 12: About "Apache Cassandra"

Dynamo

(Architecture)

BigTable

(Data model)

Cassandra

(Apache) Cassandra is a free, open-source, high scalable,

distributed database system for managing large amounts of data

Written in JAVA

Running on JVM

References :

BigTable (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf)

Dynamo (http://web.archive.org/web/20120129154946/http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)

Page 13: About "Apache Cassandra"

DESIGN GOALS

Simple Key/Value(Column) store

limited on storage

No support anything (aggregating, grouping …) but basic operation (CRUD, Range access)

But extendable

Hadoop (MR, HDFS, Pig, Hive ..)

ESP

Distributed Processing Interface (ex: BSP, MR)

Baas.io

Page 14: About "Apache Cassandra"

DESIGN GOALS (CONT)

High Availability

Decentralized

Everyone can accessor

Replication & Their access

Multi DC support

Eventual consistency

Less write complexity

Audit and repair when read

Possible tuning -> Trade offs between consistency, durability and latency

Page 15: About "Apache Cassandra"

DESIGN GOALS (CONT)

Incremental scalability

Equal Member

Linear Scalability

Unlimited space

Write / Read throughput increase linearly by add node(member)

Low total cost

Minimize administrative work

Automatic partitioning

Flush / compaction

Data balancing / moving

Virtual nodes (since v1.2)

Middle powered nodes make good performance

Collaborating work will make powerful performance and huge space

Page 16: About "Apache Cassandra"

FOUNDER & HISTORY

Founder

Avinash Lakshman (one of the authors of Amazon's Dynamo)

Prashant Malik ( Facebook Engineer )

Developer

About 50

History

Open sourced by Facebook in July 2008

Became an Apache Incubator project in March 2009

Graduated to a top-level project in Feb 2010

0.6 released (added support for integrated caching, and Apache Hadoop MapReduce) in Apr 2010

0.7 released (added secondary indexes and online schema change) in Jan 2011

0.8 released (added the Cassandra Query Language (CQL), self-tuning memtables, and support for zero-downtime upgrades) in Jun 2011

1.0 released (added integrated compression, leveled compaction, and improved read performance) in Oct 2011

1.1 released (added self-tuning caches, row-level isolation, and support for mixed ssd/spinning disk deployments) in Apr 2012

1.2 released (added clustering across virtual nodes, inter-node communication, atomic batches, and request tracing) in Jan 2013

Page 17: About "Apache Cassandra"

PROMINENT USERS

User Cluster size Node count Usage Now

Facebook >200 ? Inbox search Abandoned,Moved to HBase

Cisco WebEx ? ? User feed, activity OK

Netflix ? ? Backend OK

Formspring ? (26 million account with 10 m responsed per day)

? Social-graph data OK

Urban airship, Rackspace, Open X, Twitter (preparing move to)

Page 18: About "Apache Cassandra"

BASIC CONCEPTS

Page 19: About "Apache Cassandra"

P2P ARCHITECTURE

All nodes are same (has equality)

No single point of failure / Decentralized

Compare with

mongoDB

broker structure (cubrid …)

Master / slave

Page 21: About "Apache Cassandra"

PRIMITIVE DATA MODEL & ARCHITECTURE

Page 22: About "Apache Cassandra"

COLUMN

Basic and primitive type (the smallest increment of data)

A tuple containing a name, a value and a timestamp

Timestamp is important

Provided by client

Determine the most recent one

If meet the collision, DBMS chose the latest one

Name

Value

Timestamp

Page 23: About "Apache Cassandra"

COLUMN (CONT)

Types

Standard: A column has a name (UUID or UTF8 …)

Composite: A column has composite name (UUID+UTF8 …)

Expiring: TTL marked

Counter: Only has name and value, timestamp managed by server

Super: Used to manage wide rows, inferior to using composite

columns (DO NOT USE, All sub-columns serialized)

Counter Name

Value

Name

Name

Value

Timestamp

Name

Value

Timestamp

Page 24: About "Apache Cassandra"

COLUMN (CONT)

Types (CQL3 based)

Standard: Has one primary key.

Composite: Has more than one primary key,

recommended for managing wide rows.

Expiring: Gets deleted during compaction.

Counter: Counts occurrences of an event.

Super: Used to manage wide rows, inferior to using

composite columns (DO NOT USE, All sub-columns

serialized)

DDL : CREATE TABLE test (

user_id varchar,

article_id uuid,

content varchar,

PRIMARY KEY (user_id, article_id)

);

user_id article_id content

Smith <uuid1> Blah1..

Smith <uuid2> Blah2..

{uuid1,content}

Blah1…

Timestamp

{uuid2,content}

Blah2…

Timestamp

Smith

<Logical>

<Physical>

SELECT user_id,article_id from test order by article_id DESC LIMIT 1;

Page 25: About "Apache Cassandra"

ROWS

A row containing a represent key and a set of columns

A row key must be unique (usually UUID)

Supports up to 2 billion columns per (physical) row.

Columns are sorted by their name (Column’s Name indexed)

Primitive

Secondary Index

Direct Column Access

Name

Value

Timestamp

Name

Value

Timestamp

Name

Value

Timestamp

Row

Key

Page 26: About "Apache Cassandra"

COLUMN FAMILY

Container for columns and rows

No fixed schema

Each row is uniquely identified by its row key

Each row can have a different set of columns

Rows are sorted by row key

Comparator / Validator

Static/Dynamic CF

If columns type is super column, CF called “Super Column Familty”

Like “Table” in Relational world

Name

Value

Timestamp

Name

Value

Timestamp

Name

Value

Timestamp

Row

Key

Name

Value

Timestamp

Row

Key

Page 27: About "Apache Cassandra"

DISTRIBUTION

Row

Row

Row

Row

Row

Row

Server

1

Server

3Server

2

Server

4

How to

map?

Page 28: About "Apache Cassandra"

TOKEN RING

Node is a instance (typically same as a server)

Used to map between each row and node

Range from 0 to 2127-1

Associated with a row key

Node

Assigned a unique token (ex: token 5 to Node 5)

Range is from previous node token to their token

token 4 < Node 5’range <= token 5

Node 1

Node 2

Node 3

Node 4Node 5

Node 6

Node 7

Node 8

Token 5

Token 4

Page 29: About "Apache Cassandra"

PARTITIONING

Row

Key

Random

Partitioners

(MD5,

Murmur3)

Order

Preserving

Partitioner /

Byte

Ordered

Partitioner

Default

Row

KeyRow

KeyRow

Key

Page 30: About "Apache Cassandra"

REPLICATION

Any node has read/write role is called

coordinator node (by client)

Locator determine where located the replica

Replica is used at

Consistency check

Repair

Ensure W + R > N for consistency

Local Cache (Row cache)

Node 1

Node 2

Node 3

Node 4Node 5

Node 6

Node 7

Node 8

Replica Factor is 4 (N-1 will be replicated)Simple Locator treat strategy order as proximity

Locator

(Simple)

Coordinator node

Locating first one

1

2

Here is original

Page 31: About "Apache Cassandra"

REPLICATION (CONT)

Multi DC support

Allow to Specify how many replcas in each DC

Within DC replicas are placed on different racks

Relies on snitch to place replicas

Strategy (provided from Snitch)

Simple (Single DC)

RackInferringSnitch

PropertyFileSnitch

EC2Snitch

EC2MultiRegionSnitch

DC1DC2

Entire

Page 32: About "Apache Cassandra"

ADD / REMOVE NODE

Data transfer between nodes called “Streaming”

If add node 5,

node 3 and node 4, 1 (suppose RF is 2) involved in streaming

If remove node 2

node 3(got higher token and their replica container) serve instead

Node 1

Node 2Node 3

Node 4

Node 1

Node 2

Node 3

Node 4

Node 5

Node 1

Node 3

Node 4

Page 33: About "Apache Cassandra"

VIRTUAL NODES

Support since v1.2

Real time migration support?

Shuffle utility

One node has many tokens

=> one node has many ranges Node 1 Node 2

Number of token is 4

Cluster

Node 2

Node 1

Page 34: About "Apache Cassandra"

VIRTUAL NODES (CONT)

Less administrative works

Save cost

When Add/Remove node

many node co-works

No need to determine the token

Shuffle to re-balance

Less changing time

Smart balancing

No need to balance

(Sufficiently number of token should be higher)

Number of token is 4

Node 2

Node 1

Cluster

Node 2

Node 1

Node 3

Add node 3

Page 35: About "Apache Cassandra"

KEYSPACE

A namespace for column families

Authorization

CF? yeah

Replication

Key oriented schema (see right)

{ "row_key1":{

"Users":{ "emailAddress":{"name":"emailAddress","value":"[email protected]"

}, "webSite":{"name":"webSite", "value":http://bar.com} },"Stats":{ "visits":{"name":"visits", "value":"243"} }

}, "row_key2":{

"Users":{ "emailAddress":{"name":"emailAddress", "value":"[email protected]"}, "twitter":{"name":"twitter", "value":"user2"}

} }

}

Row Key

Column Family

Column

Page 36: About "Apache Cassandra"

CLUSTER

Total amount of data managed by the cluster is represented as a

ring

Cluster of nodes

Has multiple(or single) Keyspace

Partitioning Strategy defined

Authentication

Page 37: About "Apache Cassandra"

GOSSIP

Gossip protocol is used for cluster membership.

Failure detection on service level (Alive or Not)

Responsible

Every node in the system knows every other node’s status

Implemented as

Sync -> Ack -> Ack2

Information : status, load, bootstraping

Basic status is Alive/Dead/Join

Runs every second

Status disseminated in O(logN) (N is the number of nodes)

Seed

PHI is used for auditing dead or alive in time window

(5 -> detecting in 15~16 s)

Data structure

HeartBeat<Application Status<Endpoint Status<Endpoint StatusMap

N1

N2

N3

N4

N6

N5

Page 38: About "Apache Cassandra"

BASIC OPERATIONS

Page 39: About "Apache Cassandra"

WRITE / UPDATE

CommitLog

Abstracted Mmaped Type

File & Memory Sync -> On system failure? This is angel for U ^^.

Java NIO

C-Heap used (=Native Heap)

Log Data (Write->Delete? But exists)

Segment Rolling structure

Memtable

In memory buffer and workspace

Sorted order by row key

If reach threshold or period point, written to disk to a persistent table structure(SSTable)

Page 40: About "Apache Cassandra"

WRITE / UPDATE (LOCAL LEVEL)

Write

CommitLog

Write : “1”:{“name”:”fullname”,”value”:”smith”}

Write : “2”:{“name”:”fullname”,”value”:”mike”}

Delete : “1”

Write : “3”:{“name”:”fullname”,”value”:”osang”}

… Key Name Value

1 fullname smith

2 fullname mike

3 fullname Osang

… … …

Memtable

SSTable SSTable SSTable

1 Write to commitLog

2

Write/Update to Memtable

3Write to Disk (flush)

Page 41: About "Apache Cassandra"

SSTABLE

SSTable is Sorted String Table

Best for log structured DB

Store large numbers of key-value pairs

Immutable

Create with “Flush”

Merges by (major/minor) compaction

Has one or more column has different version (timestamp)

Choose recent one

Page 42: About "Apache Cassandra"

READ (LOCAL LEVEL)

Key Name Value

2 fullname mike

3 fullname Osang

… … …

SSTableBF

IDX

SSTableBF

IDX

Read

Memtable

Page 43: About "Apache Cassandra"

READ (CLUSTER LEVEL, +READ REPAIR)

Replica(Original, Right)

Replica(Right)

Replica(Wrong)

Digest ComparingChoose the right one if digests differ(the most recent)

Recover

Read

Operation

Coordinator

Locator1 Transferred from original/replica node (with consistency level)

2

3

Page 44: About "Apache Cassandra"

DELETE

Add tomstone (this is some type of column)

Garbage collected when compacting

GC grace seconds : 864000 (default 10 days)

Issue

If the fault node recover after GCGraceSeconds, the deleted data can

be resurrected

Page 45: About "Apache Cassandra"

FAULT MANAGEMENT

Page 46: About "Apache Cassandra"

DETECTION

Dynamic threshold for marking nodes

Accrual Detection Mechanism calculates a per-node threshold

Automatic take into account Network condition, workload and

other conditions might affect perceived heartbeat rate.

From 3rd party client

Hector

Failover

Page 47: About "Apache Cassandra"

HINTED-HANDOFF

The coordinator will store a hint for if the node down or failed to

acknowledge the write

Hint consists of the target replica and the mutation(column

object) to be replayed

Use java heap (might next to be off-heap)

Only saved within limited time (default, 1 hour) after a replica fails

When failed node is alive again, it will begin streaming the miss

writes

Page 48: About "Apache Cassandra"

REPAIR

Support triangle method

CommitLog Replaying (by administrator)

Read Repair (realtime)

Anti-entropy Repair (by administrator)

Page 49: About "Apache Cassandra"

READ REPAIR

Background work

Configured per CF

Choose most recently written value if they are inconsistent, and

replace it.

Page 50: About "Apache Cassandra"

ANTI-ENTROPY REPAIR

Ensure all data on a replica is made consistent

Merkle tree used

Tree of data block’s hashes

Verify inconsistent

Repair node request merkle hash (piece of CF)

to replicas and comparing, streaming from a replica if inconsistent, do Read-repair

Block

1

Block

2

Block

3…

CF

hash hash hash hash

hash hash

hash

Page 51: About "Apache Cassandra"

CONSISTENCY

Page 52: About "Apache Cassandra"

BASIC

Full ACID compliance in distributed system is a bad idea.

(network, … )

Single row updates are atomic (include internal indexes),

everything else is not

Relaxing consistency does not equal data corruption

Tunable Consistency

Speed vs precision

Any read and write operation decides how consistent the requested

data should be (from client)

Page 53: About "Apache Cassandra"

CONDITION

Consistency ensure if

(W + R) > N

W is nodes written (succeed)

R is nodes read

N is replica factor

Page 54: About "Apache Cassandra"

CONDITION (CONT)

N is 3

Operations

1. Write 3

2. Write 5

3. Write 1

3 5 1

Worst case

W is 1

1 5 1W is 2 3 1 1or

W is 2 1 1 1

R is 1

Possible case

3 5 1or or

R is 21

1 R is 3

Written Read

(W+R)>N ensure that at lease one latest value can be selected

This is eventual consistency

Page 55: About "Apache Cassandra"

READ CONSISTENCY LEVELS

One

Two

Three

Quorum

Local Quorum

Each Quorum

All

Specify how many replicas must response

before a result is return to the client

Quorum : (Replication Factor / 2) + 1

Local Quorum / Each Quorum is used at Multi-

DC

Round down to a whole number processing

(If satisfied, return right away)

Page 56: About "Apache Cassandra"

WRITE CONSISTENCY LEVELS

ANY

One

Two

Three

Quorum

Local Quorum

Each Quorum

All

Specify how many replicas must succeed

before returning acknowledge to client

Quorum : (Replication Factor / 2) + 1

Local Quorum / Each Quorum is used at Multi-

DC

ANY level contain hinted-handoff condition

Round down to a whole number processing

(If satisfied, return right away)

Page 57: About "Apache Cassandra"

PERFORMANCE

Page 58: About "Apache Cassandra"

CACHE

Key/Row Cache can save their data to files

Key Cache

Accessed Frequently

Hold the location of keys (indicating to columns)

In memory, on JVM heap

Row Cache

Optional

Hold entire columns of the row

In memory, on Off-heap (since v1.1) or JVM heap

If you have huge column, this will make OOME (Out Of Memory Event)

Page 59: About "Apache Cassandra"

CACHE

Mmaped Disk Access

On 64bit JVM, used for data and index summary (default)

Provide virtual mmaped space in Memory for SSTable

On C-Heap(native heap)

GC make this as cache

Data accessed frequently live long period, otherwise GC will purge that

If the data exists in memory, return it (=cache)

(Problem) GC C-Heap when its full only

(Problem) handle open SSTable, this mean Cassandra can allocate the entire size of open SSTables, otherwise native OOME

If you wanna have efficient Key/Row/Mmaped Access cache, add sufficient nodes to cluster

Page 60: About "Apache Cassandra"

BLOOM FILTERS

Each SSTable has this

Used to check if a requested row key exists in the SSTable before

doing any seeks (disk)

Per row key, generate several hashes and mark the buckets for

the key

Check each bucket for the key’s hashes, if any is empty the key

does not exists

False positive are possible, but false negative are not

Key 1 Key 2 Key 2

Hash A Hash B Hash C

1 1 1

Same hashes

Only has

Page 61: About "Apache Cassandra"

INDEX

Primary Index

Per CF

The index of CF’s row key

Efficient access with Index summary (1 row key out of every 128 is

sampled)

In memory, on JVM heap (next move to Off-heap)

Read BF

KeyCache

SSTable

Index

Summary

Primary

Index

Offset

Calculator

Page 62: About "Apache Cassandra"

INDEX (CONT)

Secondary Index

For Column’s value(s)

Support composite type

Hidden CF

Implemented by CF’name index

Value is the CF’name

Write/Update/Delete operation is atomic

Share value for many rows is good for

On the contrary unique value for indexing is poor (-> use Dynamic CF for

indexing)

Page 63: About "Apache Cassandra"

COMPACTION

Combines data from SSTables

Merge row fragments

Rebuild primary and secondary indexes

Remove expired columns marked with tomestone

Delete old SSTable if complete

“Minor” only compactions merge SSTables of similar size, “Major” compactions merge all SSTables in a given CF

Size-tiered compaction

Leveled compaction

Since v1.0

Based on LevelDB

Temporary use maximum twice space and spike in disk IO.

Page 64: About "Apache Cassandra"

ARCHITECTURE

Write : no race conditions, not handled by disk IO

Read : Slow than write, but fast (DHT, cache …)

Load balancing

Virtual Nodes

Replication

Multi-DC

Page 65: About "Apache Cassandra"

BENCHMARK

References :

http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18

0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-

eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/

Workload A—update heavy: (a) read

operations, (b) update operations.

Throughput in this (and

all figures) represents total operations

per second, including reads and

writes.

Workload B—read heavy: (a) read

operations, (b) update operations

By YCSB (Yahoo Cloud Serving Benchmark)

Page 66: About "Apache Cassandra"

BENCHMARK (CONT)

References :

http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18

0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-

eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/

Workload E—short scans.

By YCSB (Yahoo Cloud Serving Benchmark)

Read performance as cluster size increases.

Page 67: About "Apache Cassandra"

BENCHMARK (CONT)

Elastic speedup:

Time series showing

impact of adding

servers online.

By YCSB (Yahoo Cloud Serving Benchmark)

References :

http://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&cad=rja&sqi=2&ved=0CCsQFjAA&url=http%3A%2F%2F68.18

0.206.246%2Ffiles%2Fycsb.pdf&ei=O_nAUYqlPI2okQWO-ICwCA&usg=AFQjCNGySLHho0zZ-

eMsJIm4VjsoNEOyKw&sig2=6p45QMDvTN963EqbM8YpDg/

Page 70: About "Apache Cassandra"

BENCHMARK (CONT)By VLDB

References :

http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/

Read latency Write latencyThroughput (95% read, 5% write)

Page 71: About "Apache Cassandra"

BENCHMARK (LAST) By VLDB

References :

http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf/

Throughput (50% read, 50% write) Throughput (100% write)

Page 72: About "Apache Cassandra"

PROBLEM HANDLING

Page 73: About "Apache Cassandra"

RESOURCE

Memory

Off-heap & Heap

OOME Problem

CPU

GC

Hashing

Compression / Compaction

Network Handling

Context Switching

Lazy Problem

IO

Bottleneck for everything

Page 74: About "Apache Cassandra"

MEMORY

Heap (GC management)

Permanent (-XX:PermSize, -XX:MaxPermSize)

JVM Heap (-Xmx, -Xms, -Xmn)

C-Heap (=Native Heap)

OS Shared

Thread Stack (-Xss)

Objects that access with JNI

Off-Heap

OS Shared

GC managed by Cassandra

Page 75: About "Apache Cassandra"

MEMORY (CONT)

Heap

Permanent

JVM Heap

Memtable

KeyCache

IndexSummary(move to Off-heap on next release)

Buffer

Transport

Socket

Disk

C-Heap

Thread Stack

File Memory Map (Virtual space)

Data / Index buffer (default)

CommitLogv1.2

Off-Heap (OS shared)

RowCache

BloomFilter

Index->CompressionMetaData->ChuckOffset

Page 76: About "Apache Cassandra"

MEMORY (CONT)

Memtable

Managed

total size (default 1/3 JVM heap, flush largest memtable for CF if reached)

Emergency, heap usage above the fraction of the max after full GC(CMS) -> flush largest memtable (each time) -> prevent full GC / OOME

KeyCache

Managed

total size (100M or 5% of the max)

Emergency, heap usage above the fraction of the max after full GC(CMS) -> reduce max cache size -> prevent full GC / OOME

RowCache/CommitLog

Managed

total size (default disabled) -> prevent OOME

Page 77: About "Apache Cassandra"

MEMORY (CONT)

Thread Stack

Not managed

But XSS set as 180k (default)

Check thrift (transport level, RPC server)’s server serving type (sync,

hsha, async(has bugs))

Set min/max threads for connection (default unlimited)

v1.2

Page 78: About "Apache Cassandra"

MEMORY (CONT)

Transport buffer

Thrift

Support many languages and crossing

Provide server/client interface, serializing

Apache project, created by Facebook

Framed buffer (default max 16M, variable size)

4k, 16k, 32k, … 16M

Determine by client

Per connection

Adjust max frame buffer size (client, server)

Set min/max threads for connection (default unlimited)

v1.2

Data Service

Client

Data Service

Thrift

Page 79: About "Apache Cassandra"

MEMORY (LAST)

C-Heap/Off-Heap

OS Shared -> Other application possible to make some problem

File Memory Map (Virtual space)

GC when Full GC

0 <= total size <= the size of opened SSTables

If cannot allocate? -> Native OOME

But

Generally access limited space of SSTable

GC make space

Worst case? (If OOME occur)

yaml->disk_access_mode : standard (restart required)

Add sufficient nodes

Yaml->disk_access_mode : auto After joining

v1.2

Page 80: About "Apache Cassandra"

CPU

GC

CMS

Marking phase : low thread priority -> but high usage rate (it’s not a problem)

CMSInitiatingOccupancyFraction is 75 (default)

UseCMSInitiatingOccupancyOnly

Full GC

Frequency is important -> may has a problem (eg: thrift transport buffer)

Add nodes or analyze memory usage to adjust configuration for

Minor GC

It’s OK

Compaction

If do slow, okay

So priority down with “-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1”

High CPU Load -> sustaining? -> When U need to add nodes

Page 81: About "Apache Cassandra"

SWAPPING

Swapping make big problem for real-time application

IO block -> Thread block -> Gossip/Compaction/Flush … delaying -> make other problem

Disable or Set minimum Swapping

Disable Swap partition

Or Enable JNA + Kernel Configuration

JNA : Mlockall (keep heap memory in physical memory)

Kernel

vm.swappiness=0 (but distress -> possible to swapping)

vm.overcommit_memory=1

Or vm.overcommit_memory=2 (overcommit managed)

vm.overcommit_ratio=? (eg 0.75)

Max memory = swap partition size + ratio*physical memory size

Eg: 8G = 2G + 0.75*8G

Page 82: About "Apache Cassandra"

MORNITERING

System Monitoring

CPU / Memory / Disk

Nagios, Ganglia, Cacti, Zabbix

Network Monitoring

Per Client

NfSen (network flow monitoring, see: http://nfsen.sourceforge.net/#mozTocId376385)

Cluster Monitoring / Maintaining

OpsCenter

Page 83: About "Apache Cassandra"

CHECK THREAD

“top” command

“H” key command to spread per thread

“P” key command to sort by CPU usage rate

Choose heavy rate thread’s PID

PID convert to in Hex (http://www.binaryhexconverter.com/decimal-to-hex-converter)

“jstack <Parent PID> > filename.log” command to save java stack to file

Search PID in Hex

313C

Page 84: About "Apache Cassandra"

CHECK HEAP

Use dump file that from “jmap” or OOME

Use “jhat” or another tool to analyze

Check [B

and their reference object

Page 85: About "Apache Cassandra"

For development, maintaining

Sorry..

I have just two days to write this presentation.

Next time I will write and speak to U.

See U next time

Page 86: About "Apache Cassandra"

Question or Talk about anything with Cassandra

Page 87: About "Apache Cassandra"

Thank you

If you have any problem or question for me, please contact my email.

[email protected]