C* Summit 2013: Cassandra at Instagram by Rick Branson

79
CASSANDRA AT INSTAGRAM Rick Branson, Infrastructure Engineer @rbranson 2013 Cassandra Summit #cassandra13 June 12, 2013 San Francisco, CA

description

Speaker: Rick Branson, Infrastructure Engineer at Instagram Cassandra is a critical part of Instagram's large scale site infrastructure that supports more than 100 million active users. This talk is a practical deep dive into data models, systems architecture, and challenges encountered during the implementation process.

Transcript of C* Summit 2013: Cassandra at Instagram by Rick Branson

Page 1: C* Summit 2013: Cassandra at Instagram by Rick Branson

CASSANDRAAT INSTAGRAMRick Branson, Infrastructure Engineer@rbranson

2013 Cassandra Summit#cassandra13 June 12, 2013

San Francisco, CA

Page 2: C* Summit 2013: Cassandra at Instagram by Rick Branson

September 2012Redis fillin' up.

Page 3: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 4: C* Summit 2013: Cassandra at Instagram by Rick Branson

What sucks?

Page 5: C* Summit 2013: Cassandra at Instagram by Rick Branson

THE OBVIOUSMemory is expensive.

Page 6: C* Summit 2013: Cassandra at Instagram by Rick Branson

LESS OBVIOUS:In-memory "degrades" poorly

Page 7: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 8: C* Summit 2013: Cassandra at Instagram by Rick Branson

• Flat namespace. What's in there?

• Heap fragmentation

• Single threaded

Page 9: C* Summit 2013: Cassandra at Instagram by Rick Branson

BGSAVE

Page 10: C* Summit 2013: Cassandra at Instagram by Rick Branson

• Boils down to centralized logging

• VERY high skew of writes to reads (1,000:1)

• Ever growing data set

• Durability highly valued

The Data

Page 11: C* Summit 2013: Cassandra at Instagram by Rick Branson

• Cassandra 1.1

• 3 EC2 m1.xlarge (2-core, 15GB RAM)

• RAIDed ephemerals (1.6TB of SATA)

• RF=3

• 6GB Heap, 200MB NewSize

• HSHA

The Setup

Page 12: C* Summit 2013: Cassandra at Instagram by Rick Branson

It worked. Mostly.

Page 13: C* Summit 2013: Cassandra at Instagram by Rick Branson

The horriblecool thing about Chef...

Page 14: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit a1489a34d2aa69316b010146ab5254895f7b9141Author: Rick BransonDate: Thu Oct 18 20:05:16 2012 -0700

Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake

Page 15: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 16: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit 41c96f3243a902dd6af4ea29ef6097351a16494aAuthor: Rick BransonDate: Tue Oct 30 17:12:00 2012 -0700

Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+

Page 17: C* Summit 2013: Cassandra at Instagram by Rick Branson

November 2013Doubled to 6 nodes.

18,000 connections. Spread those more evenly.

Page 18: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0Author: Rick BransonDate: Wed Nov 21 09:50:21 2012 -0800

Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.

Page 19: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786Author: Rick BransonDate: Mon Dec 24 12:41:13 2012 -0800

Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap

Page 20: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 21: C* Summit 2013: Cassandra at Instagram by Rick Branson

1.2.1.It went well.well... until...

Page 22: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 23: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit 84982635d5c807840d625c22a8bd4407c1879ebaAuthor: Rick BransonDate: Thu Jan 31 09:43:56 2013 -0800

Switch Cassandra from tokens to vnodes

commit e990acc5dc69468c8a96a848695fca56e79f8b83Author: Rick BransonDate: Sun Feb 10 20:26:32 2013 -0800

We aren't ready for vnodes yet guys

Page 24: C* Summit 2013: Cassandra at Instagram by Rick Branson

TAKEAWAYLet stupidenterprising, experienced operators that

will submit patches take the first few bullets on brand-new major versions.

Page 25: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit acb02daea57dca889c2aa45963754a271fa51566Author: Rick BransonDate: Sun Feb 10 20:36:34 2013 -0800

Doubled C* cluster

Page 26: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670Author: Rick BransonDate: Thu Mar 14 16:23:18 2013 -0700

Subtract token from C*ua7 to replace the node

Page 27: C* Summit 2013: Cassandra at Instagram by Rick Branson

pycassa exceptions (last 6 months)

Page 28: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 29: C* Summit 2013: Cassandra at Instagram by Rick Branson

• 3.4TB

• Will try vnode migration again soon...

Page 30: C* Summit 2013: Cassandra at Instagram by Rick Branson

TAKEAWAYAdopt a technology by understanding what it's best at and letting it do that first, then expand...

Page 31: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 32: C* Summit 2013: Cassandra at Instagram by Rick Branson

• Sharded Redis

• 32x68GB (m2.4xlarge)

• Space (memory) bound

• Resharding sucks

• Let's get some better availability...

Page 33: C* Summit 2013: Cassandra at Instagram by Rick Branson

user_id: [ activity, activity, ...]

Page 34: C* Summit 2013: Cassandra at Instagram by Rick Branson

user_id: [ activity, activity, ...]

Thrift Serialized Activity

Page 35: C* Summit 2013: Cassandra at Instagram by Rick Branson

Bound the Sizeuser_id: [ activity1, activity2, ... activity100, activity101, ...]

LTRIM <user_id> 0 99

Page 36: C* Summit 2013: Cassandra at Instagram by Rick Branson

Undo

user_id: [ activity1, activity2, activity3, ...]

LREM <user_id> 0 <activity2>

Page 37: C* Summit 2013: Cassandra at Instagram by Rick Branson

C* data model

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

Page 38: C* Summit 2013: Cassandra at Instagram by Rick Branson

Bound the Size

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])

Page 39: C* Summit 2013: Cassandra at Instagram by Rick Branson

The great destroyer of systems shows up. Tombstones abound.

Page 40: C* Summit 2013: Cassandra at Instagram by Rick Branson

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id

timestamp1 timestamp2

...

timestamp101

TimeUUID = timestamp

Page 41: C* Summit 2013: Cassandra at Instagram by Rick Branson

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id

timestamp1 timestamp2

...

timestamp101

delete(<user_id>, timestamp=<timestamp101>)

Row DeleteDeletes any data on a row with a timestampvalue equal to or less than the timestamp provided in the delete operation.

Page 42: C* Summit 2013: Cassandra at Instagram by Rick Branson

Optimizes Reads

SSTable

max_ts=100

SSTable

max_ts=200

SSTable

max_ts=300

SSTable

max_ts=400

SSTable

max_ts=500

SSTable

max_ts=600

SSTable

max_ts=700

SSTable

max_ts=800

Contains row tombstonewith timestamp 350

Safely ignoredusing in-memorymetadata

Page 43: C* Summit 2013: Cassandra at Instagram by Rick Branson

~10% of actions are undos.

Page 44: C* Summit 2013: Cassandra at Instagram by Rick Branson

Undo Support

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Page 45: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 46: C* Summit 2013: Cassandra at Instagram by Rick Branson

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Simple Race ConditionThe state of the row may have changed between these two operations.

💩

Page 47: C* Summit 2013: Cassandra at Instagram by Rick Branson

Replica[A, B]

Replica[A]

Writer Writer

insert B read [A]OK

Replica[A, B]

FAIL

"like Z" undo "like Z"

Diverging Replicas

Page 48: C* Summit 2013: Cassandra at Instagram by Rick Branson

SuperColumn = Old/Busted AntiColumn = New/Hotness

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

"Anti-Column"Contains an MD5 hash of the activity data it is marking as deleted.

Page 49: C* Summit 2013: Cassandra at Instagram by Rick Branson

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

Composite ColumnFirst component is zero for anti-columns,splitting the row into two independent lists,and ensuring the anti-columns always appearat the head.

Page 50: C* Summit 2013: Cassandra at Instagram by Rick Branson

Replica[A, B, C]

Replica[A, C]

Writer Writer

insert B insert COK

Replica[A, B, C]

FAIL

"like Z" undo "like Z"

Diverging Replicas: Solved

OK

Page 51: C* Summit 2013: Cassandra at Instagram by Rick Branson

TAKEAWAYRead-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the

data into place.

Page 52: C* Summit 2013: Cassandra at Instagram by Rick Branson

• Keep 30% "buffer" for trims.

• Undo without read. (thumbsup)

• Large lists suck for this. (thumbsdown)

• CASSANDRA-5527

Page 53: C* Summit 2013: Cassandra at Instagram by Rick Branson

Built in two days.Experience pays.

Page 54: C* Summit 2013: Cassandra at Instagram by Rick Branson

Reusability is key to rapid rollout.Great documentation eases concerns.

Page 55: C* Summit 2013: Cassandra at Instagram by Rick Branson

• C* 1.2.3

• vnodes, LeveledCompactionStrategy

• 12 hi1.4xlarge (8-core, 60GB, SSD)

• 3 AZs, RF=3, W=2, R=1

• 8GB heap, 800MB NewSize

Page 56: C* Summit 2013: Cassandra at Instagram by Rick Branson

1. Dial up Double Writes

2. Test with "Shadow" Reads

3. Dial up "Real" Reads

Rollout

Page 57: C* Summit 2013: Cassandra at Instagram by Rick Branson

commit 1c3d99a9e337f9383b093009dba074b8ade20768Author: Rick BransonDate: Mon May 6 14:58:54 2013 -0700

Bump C* inbox heap size 8G -> 10G, seeing heap pressure

Page 58: C* Summit 2013: Cassandra at Instagram by Rick Branson

Bootstrapping sucked because compacting10,000 SSTables takes forever.

sstable_size_in_mb: 5 => 25

Page 59: C* Summit 2013: Cassandra at Instagram by Rick Branson

Come in on Monday, one of the nodeswas unable to flush and has builtup 8,000+ commit log segments.

Page 60: C* Summit 2013: Cassandra at Instagram by Rick Branson

"Normal" Rebuild Process

1. /etc/init.d/cassandra stop

2. mv /data/cassandra /data/cassandra.old

3. /etc/init.d/cassandra start

Page 61: C* Summit 2013: Cassandra at Instagram by Rick Branson

For "non-vnode" clusters, best practiceis to set the initial_token in cassandra.yaml.

Page 62: C* Summit 2013: Cassandra at Instagram by Rick Branson

for vnode clusters, multiple tokens are selected randomly when a node is

bootstrapped.

Page 63: C* Summit 2013: Cassandra at Instagram by Rick Branson

IP address is effectively the "primary key"for nodes in a ring.

Page 64: C* Summit 2013: Cassandra at Instagram by Rick Branson

What had happened was.

1. Rebuilding node generated entirely new tokens and joined cluster.

2. Rest of cluster dropped the previously stored token data associated with the rebuilding node's IP address.

3. Token ranges shifted massively.

Page 65: C* Summit 2013: Cassandra at Instagram by Rick Branson

UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 1.0;

stats.inbox.empty

Page 66: C* Summit 2013: Cassandra at Instagram by Rick Branson

Kicked off "nodetool repair" and waited... and

waited...

Page 67: C* Summit 2013: Cassandra at Instagram by Rick Branson

LeveledCompactionStrategy + vnodes = tragedy.

Page 68: C* Summit 2013: Cassandra at Instagram by Rick Branson

kill -3 <cassandra>"AntiEntropyStage:1" java.lang.Thread.State: RUNNABLE <...> at io.sstable.SSTableReader.decodeKey(SSTableReader.java:1014) at io.sstable.SSTableReader.getPosition(SSTableReader.java:802) at io.sstable.SSTableReader.getPosition(SSTableReader.java:717) at io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:664) at streaming.StreamOut.createPendingFiles(StreamOut.java:155) at streaming.StreamOut.transferSSTables(StreamOut.java:140) at streaming.StreamingRepairTask.initiateStreaming(StreamingRepairTask.java:133) at streaming.StreamingRepairTask.run(StreamingRepairTask.java:115) <...>

Every repair task was scanning everySSTable file to find ranges to repair.

Page 69: C* Summit 2013: Cassandra at Instagram by Rick Branson

Scan all the things.

• Standard Compaction: Only a few dozen SSTables.

• Non-VNodes: Repair is done once per token, and there is only one token.

Page 70: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 71: C* Summit 2013: Cassandra at Instagram by Rick Branson

~20X increase in repair performance.

Page 72: C* Summit 2013: Cassandra at Instagram by Rick Branson

TAKEAWAYIf you want to use VNodes and

LeveledCompactionStrategy, wait until the 1.2.6 release when CASSANDRA-5569 is merged in.

Page 73: C* Summit 2013: Cassandra at Instagram by Rick Branson

Where were we?It was a bad thing to not know data was

inconsistent until we saw an increase in user reported problems.

Page 74: C* Summit 2013: Cassandra at Instagram by Rick Branson

CASSANDRA-5618

$ nodetool netstatsMode: NORMALNot sending any streams.Not receiving any streams.Read Repair Statistics:Attempted: 3192520Mismatch (Blocking): 0Mismatch (Background): 11584Pool Name Active Pending CompletedCommands n/a 0 1837765727Responses n/a 1 1750784545

UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 0.01;

99.63% consistent

Page 75: C* Summit 2013: Cassandra at Instagram by Rick Branson

TAKEAWAYThe way to rebuild a box in a vnode cluster is to

build a brand new node, then remove the old one with "nodetool removenode."

Page 76: C* Summit 2013: Cassandra at Instagram by Rick Branson
Page 77: C* Summit 2013: Cassandra at Instagram by Rick Branson

Fetch & Deserialize Time (measured from app)

Mean vs P90 (ms), trough-to-peak

Page 78: C* Summit 2013: Cassandra at Instagram by Rick Branson

Column Family: InboxActivitiesByUserIDSSTable count: 3264SSTables in each level: [1, 10, 105/100, 1053/1000, 2095, 0, 0]Space used (live): 80114509324Space used (total): 80444164726Memtable Columns Count: 2315159Memtable Data Size: 112197632Memtable Switch Count: 1312Read Count: 316192445Read Latency: 1.982 ms.Write Count: 1581610760Write Latency: 0.031 ms.Pending Tasks: 0Bloom Filter False Positives: 481617Bloom Filter False Ratio: 0.08558Bloom Filter Space Used: 54723960Compacted row minimum size: 25Compacted row maximum size: 545791Compacted row mean size: 3020

Page 79: C* Summit 2013: Cassandra at Instagram by Rick Branson

Thank you!We're hiring!