Cassandra TK 2014 - Large Nodes

52
CASSANDRA TK 2014 LARGE NODES WITH CASSANDRA Aaron Morton @aaronmorton Co-Founder & Principal Consultant www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

description

A discussion of running cassandra with a large data load per node.

Transcript of Cassandra TK 2014 - Large Nodes

Page 1: Cassandra TK 2014 - Large Nodes

CASSANDRA TK 2014

LARGE NODES WITH CASSANDRA

Aaron Morton @aaronmorton

!

Co-Founder & Principal Consultant www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

Page 2: Cassandra TK 2014 - Large Nodes

About The Last Pickle. Work with clients to deliver and improve

Apache Cassandra based solutions.

Apache Cassandra Committer, DataStax MVP, Hector Maintainer, Apache Usergrid

Committer. Based in New Zealand & USA.

Page 3: Cassandra TK 2014 - Large Nodes

Large Node? !

“Avoid storing more than 500GB per node” !

(Originally said about EC2 nodes.)

Page 4: Cassandra TK 2014 - Large Nodes

Large Node? !

“You may have issues if you have over 1 Billion keys per node.”

Page 5: Cassandra TK 2014 - Large Nodes

Before version 1.2 large nodes had operational and

performance concerns.

Page 6: Cassandra TK 2014 - Large Nodes

After version 1.2 large nodes have fewer operational and

performance concerns.

Page 7: Cassandra TK 2014 - Large Nodes

Issues Pre 1.2 Work Arounds Pre 1.2

Improvements 1.2 to 2.1 !

Page 8: Cassandra TK 2014 - Large Nodes

Memory Management. Some in memory structures grow with number of rows and size of

data.

Page 9: Cassandra TK 2014 - Large Nodes

Bloom Filter Stores bitset used to determine if a key exists

in an SSTable with a certain probability. !

Size depends on number of rows and bloom_filter_fp_chance.

Page 10: Cassandra TK 2014 - Large Nodes

Bloom Filter Allocates pages of 4096 longs in a

long[][] array.

Page 11: Cassandra TK 2014 - Large Nodes

Bloom Filter Size Bl

oom

File

r Size

in M

B

0

300

600

900

1,200

Millions of Rows

1 10 100 1,000

0.01 bloom_filter_fp_chance 0.10 bloom_filter_fp_chance

Page 12: Cassandra TK 2014 - Large Nodes

Compression Metadata Stores long offset into compressed -

Data.db file for each chunk_length_kb (default 64) of uncompressed data.

!

Size depends on the uncompressed data size.

Page 13: Cassandra TK 2014 - Large Nodes

Compression Metadata Allocates pages of 4096 longs in a

long[][] array.

Page 14: Cassandra TK 2014 - Large Nodes

Compression Metadata SizeCo

mpr

ess M

etad

ata

Size

in M

B

0

350

700

1,050

1,400

Uncompressed Size in GB

1 10 100 1,000 10,000

Snappy Compressor

Page 15: Cassandra TK 2014 - Large Nodes

Index Samples Stores offset into -Index.db for every

index_interval (128) keys. !

Size depends on the number of rows and the size of the keys.

!

Page 16: Cassandra TK 2014 - Large Nodes

Index Samples Allocates long[] for offsets and byte[]

[] for row keys. !

(Version 1.2 using on heap structures)

Page 17: Cassandra TK 2014 - Large Nodes

Index Samples Total SizeIn

dex

Sam

ple T

otal

Size

in M

B

0

75

150

225

300

Millions of Rows

1 10 100 1,000

Position Offset Keys (25 bytes long)

Page 18: Cassandra TK 2014 - Large Nodes

Memory Management. Larger Heaps (above 8GB) take

longer to GC. !

Large working set results in frequent prolonged GC.

Page 19: Cassandra TK 2014 - Large Nodes

Bootstrap. The joining node requests data from one replica of each token range it will own.

!

Sending is throttled by stream_throughput_outbound_mega

bits_per_sec (default 200/25MB).

Page 20: Cassandra TK 2014 - Large Nodes

Bootstrap. With RF 3, only three nodes will send data to

a bootstrapping node. !

Maximum send rate is 75 MB/sec (3*25MB).

Page 21: Cassandra TK 2014 - Large Nodes

Moving Nodes. Copy data from existing node to new node.

!

At 50 MB/s transferring 100GB takes 33 minutes.

Page 22: Cassandra TK 2014 - Large Nodes

Disk Management. Need a multi TB volume or use multiple

volumes.

Page 23: Cassandra TK 2014 - Large Nodes

Disk Management with RAID-0. Single disk failure results in total node failure.

Page 24: Cassandra TK 2014 - Large Nodes

Disk Management with RAID-10. Requires double the raw capacity.

Page 25: Cassandra TK 2014 - Large Nodes

Disk Management with Multiple Volumes. Specified via data_file_directories

!

Write load not distributed. !

Single failure will shut down node.

Page 26: Cassandra TK 2014 - Large Nodes

Repair. Compare data between nodes and exchange

differences. !

Page 27: Cassandra TK 2014 - Large Nodes

Comparing Data for Repair. Calculate Merkle Tree hash by reading all

rows in a Table. (Validation Compaction)

!

Single comparator, throttled by compaction_throughput_mb_per_sec

(default 16).

Page 28: Cassandra TK 2014 - Large Nodes

Comparing Data for Repair. Time taken grows as the size of the data per

node grows.

Page 29: Cassandra TK 2014 - Large Nodes

Exchanging Data for Repair. Ranges of rows with differences are

Streamed. !

Sending is throttled by stream_throughput_outbound_mega

bits_per_sec (default 200/25MB).

Page 30: Cassandra TK 2014 - Large Nodes

Compaction. Requires free space to write new SSTables.

Page 31: Cassandra TK 2014 - Large Nodes

SizeTieredCompactionStrategy. Groups SSTables by size, assumes no

reduction in size. !

In theory requires 50% free space, in practice can work beyond 50% though not

recommended.

Page 32: Cassandra TK 2014 - Large Nodes

LeveledCompactionStrategy. Groups SSTables by “level” and groups row

fragments per level. !

Requires approximately 25% free space.

Page 33: Cassandra TK 2014 - Large Nodes

Issues Pre 1.2 Work Arounds Pre 1.2 Improvements 1.2 to 2.1

!

Page 34: Cassandra TK 2014 - Large Nodes

Memory Management Work Arounds. Reduce Bloom Filter size by increasing

bloom_filter_fp_chance from 0.01 to 0.1.

!

May increase read latency.

Page 35: Cassandra TK 2014 - Large Nodes

Memory Management Work Arounds. Reduce Compression Metadata size by

increasing chunk_length_kb. !

May increase read latency.

Page 36: Cassandra TK 2014 - Large Nodes

Memory Management Work Arounds. Reduce Index Samples size by increasing

index_interval to 512. !

May increase read latency.

Page 37: Cassandra TK 2014 - Large Nodes

Memory Management Work Arounds. When necessary use a 12GB

MAX_HEAP_SIZE. !

Keep HEAP_NEWSIZE “reasonable” e.g. less than 1200MB.

Page 38: Cassandra TK 2014 - Large Nodes

Bootstrap Work Arounds. Increase streaming throughput via

nodetool setstreamthroughput whenever possible.

Page 39: Cassandra TK 2014 - Large Nodes

Moving Node Work Arounds. Copy nodetool snapshot while the

original node is operational. !

Copy only a delta when the original node is stopped.

Page 40: Cassandra TK 2014 - Large Nodes

Disk Management Work Arounds. Use RAID-0 and over provision nodes

anticipating failure. !

Use RAID-10 and accept additional costs.

Page 41: Cassandra TK 2014 - Large Nodes

Repair Work Arounds. Only use if data is deleted, rely on Consistently Level for distribution.

!

Frequent small repair using token ranges.

Page 42: Cassandra TK 2014 - Large Nodes

Compaction Work Arounds. Over provision disk capacity when using SizeTieredCompactionStrategy.

!

Reduce min_compaction_threshold (default 4) max_compaction_threshold (default 32) to reduce number of SSTables per compaction.

Page 43: Cassandra TK 2014 - Large Nodes

Compaction Work Arounds. Use LeveledCompactionStrategy

where appropriate.

Page 44: Cassandra TK 2014 - Large Nodes

Issues Pre 1.2 Work Arounds Pre 1.2

Improvements 1.2 to 2.1

Page 45: Cassandra TK 2014 - Large Nodes

Memory Management Improvements. Version 1.2 moved Bloom Filters and

Compression Meta Data off the JVM Heap to Native Memory.

!

Version 2.0 moved Index Samples off the JVM Heap.

Page 46: Cassandra TK 2014 - Large Nodes

Bootstrap Improvements. Virtual Nodes increases the number of Token

Ranges per node from 1 to 256. !

Bootstrapping node can request data from 256 different nodes.

Page 47: Cassandra TK 2014 - Large Nodes

Disk Layout Improvements. “JBOD” support distributes concurrent

writes to multiple data_file_directories.

Page 48: Cassandra TK 2014 - Large Nodes

Disk Layout Improvements. disk_failure_policy adds support for

handling disk failure. !

ignore stop

best_effort

Page 49: Cassandra TK 2014 - Large Nodes

Repair Improvements. “Avoid repairing already-repaired data by

default” CASSANDRA-5351 !

Scheduled for 2.1

Page 50: Cassandra TK 2014 - Large Nodes

Compaction Improvements. “Avoid allocating overly large bloom filters”

CASSANDRA-5906 !

Included in 2.1

Page 51: Cassandra TK 2014 - Large Nodes

Thanks. !

Page 52: Cassandra TK 2014 - Large Nodes

Aaron Morton @aaronmorton

www.thelastpickle.com

Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License