Download - Cassandra at Twitter

Transcript
Page 1: Cassandra at Twitter

Cassandra SF July 11th, 2011

Cassandra @

Page 2: Cassandra at Twitter

Team

@lennox

@stuhood @rk

Chris Goffinet

Stu Hood

Oscar Moll

Alan Liang Melvin Wang

Ryan King

@padauk9@alan

Page 3: Cassandra at Twitter

Measuring ourselves

#prostyle

Page 4: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 5: Cassandra at Twitter

Hardware Platform‣ CPU Core Utilization

‣ Memory bandwidth and consumption

‣ Machine cost

‣ RAID

‣ Filesystems and I/O Schedulers

‣ IOPS

‣ Network bandwidth

‣ Kernel

Page 6: Cassandra at Twitter

‣ CPU Core Utilization

‣ Memory bandwidth and consumption

‣ Machine cost

‣ RAID

‣ Filesystems and I/O Schedulers

‣ IOPS

‣ Network bandwidth

‣ Kernel

Hardware Platform

Page 7: Cassandra at Twitter

‣ Ext4

‣ Data mode = Ordered

‣ Data mode = Writeback

‣ XFS

‣ RAID

‣ 0 and 10

‣ far side vs near side copies

‣ 128 vs 256 vs 512 stripe sizes

Filesystem configurations

Hardware Platform

Page 8: Cassandra at Twitter

I/O Schedulers‣ CFQ vs Noop vs Deadline vs Anticipatory

‣ Workloads

‣ Timeseries

‣ 50/50

‣ Measure

‣ p90

‣ p99

‣ Average

‣ Max

Hardware Platform

Page 9: Cassandra at Twitter

I/O Schedulers

Scheduler p90 p99 Average Max

cfq 73ms 210ms 11.72ms 4940ms

noop 47ms 167ms 9.12ms 4132ms

deadline 75ms 233ms 12.72ms 3718ms

anticipatory 76ms 214ms 12.37ms 5120ms

5050 - Reads

Hardware Platform

Page 10: Cassandra at Twitter

I/O Schedulers

Scheduler p90 p99 Average Max

cfq 2ms 2ms 2.02ms 5927ms

noop 2ms 2ms 2.06ms 3475ms

deadline 2ms 2ms 2.13ms 3718ms

anticipatory 2ms 2ms 2.03ms 5119ms

5050 - Writes

Hardware Platform

Page 11: Cassandra at Twitter
Page 12: Cassandra at Twitter
Page 13: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 14: Cassandra at Twitter

‣ How efficient is our on-disk storage?

‣ Could we do compression?

‣ Do we have CPU to trade?

‣ How do we push for better?

‣ Is it worth it?

Data Storage

Page 15: Cassandra at Twitter

Old New

Easy to Implement

Checksumming

Varint Encoding

Delta Encoding

Type Specific Compression

Fixed Size Blocks

Data Storage

Page 16: Cassandra at Twitter

Old New

Easy to Implement X

Checksumming X

Varint Encoding X

Delta Encoding X

Type Specific Compression X

Fixed Size Blocks X

Data Storage

Page 17: Cassandra at Twitter

How did we do?

Data Storage

Page 18: Cassandra at Twitter

‣ 1.5x?

‣ 2.5x?

‣ 3.5x?

Data Storage

Page 19: Cassandra at Twitter

7.03x

Data Storage

Page 20: Cassandra at Twitter

Rows Columns Size on diskbytes per column

Current Format 10000 250M 16,716,432,189 66.8

New Format 10000 250M 2,375,027,696 9.5

10,00o rows; 250M columns

Data Storage

Timeseries

LongType column names

CounterColumnType values

Page 21: Cassandra at Twitter

Data Storage‣ compression

‣ type specific

‣ fine-grained corruption detection

‣ index promotion

‣ normalizing narrow and wide rows

‣ predictable performance

‣ no double-pass on compaction

‣ range and slice deletes

Page 22: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 23: Cassandra at Twitter

‣ What are our issues?

‣ Compaction Performance?

‣ Caching?

‣ Too many disk seeks?

‣ Garbage Collection?

Latency and Throughput

Page 24: Cassandra at Twitter

‣ Compaction

Latency and Throughput

Page 25: Cassandra at Twitter

‣ Compaction

Latency and Throughput

Page 26: Cassandra at Twitter

‣ Multithread Compaction + Throttling

‣ Compact each bucket in parallel

‣ Throttle across all buckets

‣ Compaction running all the time

‣ CASSANDRA-2191

‣ CASSANDRA-2156

Latency and Throughput

Page 27: Cassandra at Twitter

‣ Measure latency

‣ p99

‣ p999

‣ No averages!

‣ Every customer has p99 and p999 targets we must hit

‣ 24x7 on-call rotation

Latency and Throughput

Page 28: Cassandra at Twitter

Latency and Throughput‣ Caching?

‣ In-heap

‣ Off-heap

‣ Pluggable cache

‣ Memcache

Page 29: Cassandra at Twitter

Case Study: Tweet Button

Page 30: Cassandra at Twitter

‣ Growth was requiring entire dataset in memory. Why?

‣ How big is the active dataset within 24hours?

‣ What happens when dataset outgrows memory?

‣ Could other storage solutions do better?

‣ What are we missing here?

Case Study: Tweet Button

Page 31: Cassandra at Twitter

‣ Key Size Variable length (each one a url)

‣ Implement hashing on keys

‣ Can we do better?

‣ But... the cache in Java isn’t very efficient...

‣ or is it?

Case Study: Tweet Button

Page 32: Cassandra at Twitter

Case Study: Tweet Button‣ On-heap

‣ Requires us to scale the JVM heap with cache

‣ Off-heap

‣ Store pointers to data allocated out of the JVM

‣ Memcache

‣ Out of process

Page 33: Cassandra at Twitter

Case Study: Tweet Button‣ On-heap

‣ Data + CLHM overhead (87GB)

‣ Off-heap

‣ CLHM overhead (67GB just the pointers!)

‣ Memcache

‣ Internal overhead + data (48GB!)

‣ * CLHM (Concurrent Linked HashMap)

Page 34: Cassandra at Twitter

Case Study: Tweet Button

Cassandra

Memcache

Cassandra

Memcache

Cassandra

Memcache

Cassandra

Memcache

‣ Co-locate memcache on each node

‣ Routing + Cache replication

‣ Write through LRU

‣ Rolling restarts do not cause degraded performance states

Page 35: Cassandra at Twitter

Case Study: Tweet Button‣ In production today

‣ Stats

‣ 99th percentile went before 200ms - 800ms when data > memory

‣ 99th percentile now - 2.5ms

Page 36: Cassandra at Twitter

‣ New observability stack

‣ Replaces Ganglia

‣ Collect metrics for graphing in real-time

‣ Scale based on machine count + defined metrics

‣ Heavy write throughput requirements

‣ SLA Target

‣ All metrics written under 60 seconds

Case Study: Cuckoo

Page 37: Cassandra at Twitter

‣ 1.3 million writes/second

‣ 112 billion writes a day

‣ 3.2 gigabit/s over the network

‣ 492GB of new data per hour

‣ 140MB/s writes across cluster

‣ 70MB/s reads across cluster

Case Study: Cuckoo

Page 38: Cassandra at Twitter

‣ 36,000 writes/second

‣ persistently to disk on each node

‣ 36 nodes without RF (Replication Factor)

‣ Replication Factor = 3

‣ 30-35% cpu utilization

‣ FSync Commit Log every 10s

Case Study: Cuckoo

Page 39: Cassandra at Twitter

‣ Garbage Collection Challenge

‣ 30-60 second pauses multiple times per hour on each node

‣ Why?

‣ Heap fragmentation

Case Study: Cuckoo

Page 40: Cassandra at Twitter

time

value

5.0e+08

1.0e+09

1.5e+09

5.0e+08

1.0e+09

1.5e+09

1000 2000 3000 4000

free_spacemax_chunk

Case Study: Cuckoo

Page 41: Cassandra at Twitter

‣ Slab Allocation

‣ Fixed sized chunks (2MB)

‣ Copy byte[] into slabs using CAS (Compare & Swap)

‣ Largely reduced fragmentation

‣ CASSANDRA-2252

Case Study: Cuckoo

Page 42: Cassandra at Twitter

No Slab Slab

GC Pause Avg Time 30-60 seconds

Frequency of pause Every hour

Case Study: Cuckoo

Page 43: Cassandra at Twitter

No Slab Slab

GC Pause Avg Time 30-60 seconds 5 seconds

Frequency of pause Every hour 3 days 10 hours

Case Study: Cuckoo

Page 44: Cassandra at Twitter

‣ Pluggable Compaction

‣ Custom strategy for retention support

‣ Used for our timeseries

‣ Drop SSTables after N days

‣ Make it easy to implement more interesting and intelligent compaction strategies

‣ SSTable Min/Max Timestamp

‣ Read time optimization

Case Study: Cuckoo

Page 45: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 46: Cassandra at Twitter

Operational Efficiency‣ Automated infrastructure burn-in process

‣ Rack awareness to handle switch failures

‣ Grow clusters per rack, not per node

‣ Lower Server RPC timeout (200ms to 1s)

‣ Fail fast

‣ Split out RPC timeouts by read & writes

‣ CASSANDRA-2819

Page 47: Cassandra at Twitter

‣ Fault tolerance at the disk level

‣ Eject from cluster if raid array fails

‣ CASSANDRA-2118

‣ No swap and dedicated commit log

‣ Multiple hard drive vendors

‣ 300+ nodes in production

‣ Run on cheap commodity hardware

‣ Design for failure

Operational Efficiency

Page 48: Cassandra at Twitter

‣ Bad memory that causes corruption

‣ Multiple disks dying on same hosts within hours

‣ Rack switch failures

‣ Memory allocation delays causing JVM to encounter higher latency GC collections (mlockall recommended)

‣ Stop the world pauses if traffic patterns change

Operational EfficiencyWhat failures do we see in production?

Page 49: Cassandra at Twitter

‣ Network cards sometimes negotiating down to 100Mbit

‣ Machines randomly die and never come back

‣ Disks auto-ejecting themselves from the raid array

Operational EfficiencyWhat failures do we see in production?

Page 50: Cassandra at Twitter

Operational EfficiencyDeploy Process

Cass

Driver Hudson Git

Cass

Cass

Cass Cass

Cass

Cass

Cass

Page 51: Cassandra at Twitter

Operational EfficiencyDeploy Process

‣ Deploy to hundreds of nodes in under 20s

‣ Roll the cluster

‣ Disable Gossip on a node

‣ Check ring on all nodes to ensure ‘Down’ state

‣ Drain

‣ Restart

Page 52: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 53: Cassandra at Twitter

Capacity Planning‣ In-house capacity planning tool

‣ Collect input from sources:

‣ hardware platform (kernel, hw data)

‣ on-disk serialization overhead

‣ cost of read/write (seeks, index overhead)

‣ query cost (cpu, memory usage)

‣ requirements from customers

Page 54: Cassandra at Twitter

Capacity Planning

spec = { 'read_qps': 500, 'write_qps': 1000, 'replication_factor': 3, 'dataset_hot_percent': 0.05, 'latency_95': 350.0, 'latency_99': 250.0, 'read_growth_percentage': 0.1, 'write_growth_percentage': 0.1, ......}

Input Example

Page 55: Cassandra at Twitter

Capacity Planning

90 days datasize: 14.49T page cache size: 962.89G number of disks: 68 disk capacity: 15.22T iops: 6800.00/s replication_factor: 3 servers: 51 servers (w/o replication): 17 read_ops: 2323 write_ops: 991066 servers: 57 servers (w/o replication): 19 read_ops: 2877 write_ops: 1143171

Output Example

Page 56: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 57: Cassandra at Twitter

Developer Integration‣ Cassie

‣ Light-weight Cassandra Client

‣ Cluster member auto discovery

‣ Uses Finagle (http://github.com/twitter/finagle)

‣ Scala + Java support

‣ Open sourcing

Page 58: Cassandra at Twitter

Measuring ourselves‣ Hardware Platform

‣ Data Storage

‣ Latency and Throughput

‣ Operational Efficiency

‣ Capacity Planning

‣ Developer Integration

‣ Testing

Page 59: Cassandra at Twitter

Testing‣ Distributed Testing Harness

‣ Open sourced to community

‣ Custom Internal Build of YCSB

‣ Performance Benchmarking

‣ Custom workloads such as timeseries

‣ Performance Framework

Page 60: Cassandra at Twitter

Performance Framework‣ Custom framework that uses YCSB

‣ What we do:

‣ Collect as much data as possible

‣ Measure

‣ Do it again

‣ Generate reports per build

Page 61: Cassandra at Twitter

Performance Framework‣ Read/Insert/Update Combinations: 30

‣ Request Targeting (per second): 8

‣ 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000

‣ Payload Sizes: 5

‣ 100, 500, 1000, 2000, 4000 bytes

‣ Single node vs cluster

Page 62: Cassandra at Twitter

Performance Framework

Total test combinations: 1,200

Page 63: Cassandra at Twitter

Summary‣ Understand your hardware and operating

system

‣ Rigorously exercise your entire stack

‣ Capacity plan with math not guesswork

‣ Measure everything, then do it again

‣ Invest in your storage technology

‣ Automate

‣ Expect everything to fail

Page 64: Cassandra at Twitter

We’re hiring

@jointheflock