Download - Cassandra at Instagram (August 2013)

Transcript
Page 1: Cassandra at Instagram (August 2013)

CASSANDRAAT INSTAGRAMRick Branson, Infrastructure Engineer@rbranson

SF Cassandra MeetupAugust 29, 2013

Disqus HQ

Page 2: Cassandra at Instagram (August 2013)

September 2012Redis fillin' up.

Page 3: Cassandra at Instagram (August 2013)
Page 4: Cassandra at Instagram (August 2013)

What sucks?

Page 5: Cassandra at Instagram (August 2013)

THE OBVIOUSMemory is expensive.

Page 6: Cassandra at Instagram (August 2013)

LESS OBVIOUS:In-memory "degrades" poorly

Page 7: Cassandra at Instagram (August 2013)
Page 8: Cassandra at Instagram (August 2013)

• Flat namespace. What's in there?

• Heap fragmentation

• Single threaded

Page 9: Cassandra at Instagram (August 2013)

BGSAVE

Page 10: Cassandra at Instagram (August 2013)

• Boils down to centralized logging

• VERY high skew of writes to reads (1,000:1)

• Ever growing data set

• Durability highly valued

• Dumb to store it in RAM, basically...

The Data

Page 11: Cassandra at Instagram (August 2013)

• Cassandra 1.1

• 3 EC2 m1.xlarge (2-core, 15GB RAM)

• RAIDed ephemerals (1.6TB of SATA)

• RF=3

• 6GB Heap, 200MB NewSize

• HSHA

The Setup

Page 12: Cassandra at Instagram (August 2013)

It worked. Mostly.

Page 13: Cassandra at Instagram (August 2013)

The horriblecool thing about Chef...

Page 14: Cassandra at Instagram (August 2013)

commit a1489a34d2aa69316b010146ab5254895f7b9141Author: Rick BransonDate: Thu Oct 18 20:05:16 2012 -0700

Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake

Page 15: Cassandra at Instagram (August 2013)
Page 16: Cassandra at Instagram (August 2013)

commit 41c96f3243a902dd6af4ea29ef6097351a16494aAuthor: Rick BransonDate: Tue Oct 30 17:12:00 2012 -0700

Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+

Page 17: Cassandra at Instagram (August 2013)

November 2013Doubled to 6 nodes.

18,000 connections. Spread those more evenly.

Page 18: Cassandra at Instagram (August 2013)

commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0Author: Rick BransonDate: Wed Nov 21 09:50:21 2012 -0800

Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.

Page 19: Cassandra at Instagram (August 2013)

commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786Author: Rick BransonDate: Mon Dec 24 12:41:13 2012 -0800

Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap

Page 20: Cassandra at Instagram (August 2013)
Page 21: Cassandra at Instagram (August 2013)

1.2.1.It went well.well... until...

Page 22: Cassandra at Instagram (August 2013)
Page 23: Cassandra at Instagram (August 2013)

commit 84982635d5c807840d625c22a8bd4407c1879ebaAuthor: Rick BransonDate: Thu Jan 31 09:43:56 2013 -0800

Switch Cassandra from tokens to vnodes

commit e990acc5dc69468c8a96a848695fca56e79f8b83Author: Rick BransonDate: Sun Feb 10 20:26:32 2013 -0800

We aren't ready for vnodes yet guys

Page 24: Cassandra at Instagram (August 2013)

TAKEAWAYLet stupidenterprising, experienced operators that

will submit patches take the first few bullets on brand-new major versions.

Page 25: Cassandra at Instagram (August 2013)

commit acb02daea57dca889c2aa45963754a271fa51566Author: Rick BransonDate: Sun Feb 10 20:36:34 2013 -0800

Doubled C* cluster

Page 26: Cassandra at Instagram (August 2013)

commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670Author: Rick BransonDate: Thu Mar 14 16:23:18 2013 -0700

Subtract token from C*ua7 to replace the node

Page 27: Cassandra at Instagram (August 2013)

pycassa exceptions (last 6 months)

Page 28: Cassandra at Instagram (August 2013)
Page 29: Cassandra at Instagram (August 2013)

• 3.4TB

• vnode migration still pending

Page 30: Cassandra at Instagram (August 2013)

TAKEAWAYAdopt a technology by understanding what it's best at and letting it do that first, then expand...

Page 31: Cassandra at Instagram (August 2013)
Page 32: Cassandra at Instagram (August 2013)

• Sharded master/slave Redis

• 32x68GB (m2.4xlarge)

• Space (memory) bound

• Resharding sucks

• Failover is manual, wakes us up at night

Page 33: Cassandra at Instagram (August 2013)

user_id: [ activity, activity, ...]

Page 34: Cassandra at Instagram (August 2013)

user_id: [ activity, activity, ...]

Thrift Serialized Activity

Page 35: Cassandra at Instagram (August 2013)

Bound the Sizeuser_id: [ activity1, activity2, ... activity100, activity101, ...]

LTRIM <user_id> 0 99

Page 36: Cassandra at Instagram (August 2013)

Undo

user_id: [ activity1, activity2, activity3, ...]

LREM <user_id> 0 <activity2>

Page 37: Cassandra at Instagram (August 2013)

C* data model

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

Page 38: Cassandra at Instagram (August 2013)

Bound the Size

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])

Page 39: Cassandra at Instagram (August 2013)

The great destroyer of systems shows up. Tombstones abound.

Page 40: Cassandra at Instagram (August 2013)

user_id

TimeUUID1 TimeUUID2

...

TimeUUID2

user_id <activity> <activity> ... [tombstone]user_id

timestamp1 timestamp2

...

timestamp2

Cassandra internally stores deletes as tombstones, which mark data for a given column as deleted at-or-before a timestamp.

Column Delete

tombstone timestamp is >= live column timestamp, so it will be

hidden from queries and compacted away.

Page 41: Cassandra at Instagram (August 2013)

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id

timestamp1 timestamp2

...

timestamp101

TimeUUID = timestamp

To avoid tombstones, exploit that the timestamp embedded in our TimeUUID (ordering) is the same as the column timestamp.

Page 42: Cassandra at Instagram (August 2013)

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id

timestamp1 timestamp2

...

timestamp101

delete(<user_id>, timestamp=<timestamp101>)

Row DeleteCassandra can also store row tombstones, which delete all data from a row at-or-before the timestamp provided.

Page 43: Cassandra at Instagram (August 2013)

Optimizes Reads

SSTable

max_ts=100

SSTable

max_ts=200

SSTable

max_ts=300

SSTable

max_ts=400

SSTable

max_ts=500

SSTable

max_ts=600

SSTable

max_ts=700

SSTable

max_ts=800

Contains row tombstonewith timestamp 350

Safely ignoredusing in-memorymetadata

Page 44: Cassandra at Instagram (August 2013)

~10% of actions are undos.

Page 45: Cassandra at Instagram (August 2013)

Undo Support

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Page 46: Cassandra at Instagram (August 2013)
Page 47: Cassandra at Instagram (August 2013)

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Simple Race ConditionThe state of the row may have changed between these two operations.

💩

Page 48: Cassandra at Instagram (August 2013)

Replica[A, B]

Replica[A]

Writer

insert B OK

Replica[A, B]

FAIL

Like

Diverging Replicas

Page 49: Cassandra at Instagram (August 2013)

Replica[A, B]

Replica[A]

Writer

read [A]

Replica[A, B]

Undo Like

Diverging Replicas

Replica is missing B, so if a read is required to find B before deleting it, it's going to fail.

Page 50: Cassandra at Instagram (August 2013)

SuperColumn = Old/Busted AntiColumn = New/Hotness

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

"Anti-Column"Borrowing from the idea of Cassandra's by-name tombstones, Contains an MD5 hash of the activity data "value" it is marking as deleted.

Page 51: Cassandra at Instagram (August 2013)

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

Composite ColumnFirst component is zero for anti-columns,splitting the row into two independent lists,and ensuring the anti-columns always appearat the head.

Page 52: Cassandra at Instagram (August 2013)

Replica[A, B]

Replica[A]

Writer

insert B OK

Replica[A, B]

FAIL

Like

Diverging Replicas: Solved

Page 53: Cassandra at Instagram (August 2013)

Replica[A, B, C]

Replica[A, C]

Writer

insert C

Replica[A, B, C]

Undo Like

Diverging Replicas: Solved

OK

Instead of read-before-write, an anti-column is inserted to mark the activity as deleted.

Page 54: Cassandra at Instagram (August 2013)

TAKEAWAYRead-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the

data into place.

Page 55: Cassandra at Instagram (August 2013)

• Keep 30% "buffer" for trims.

• Undo without read. (thumbsup)

• Large lists suck for this. (thumbsdown)

• CASSANDRA-5527

Page 56: Cassandra at Instagram (August 2013)

Built in two days.Experience paid off.

Page 57: Cassandra at Instagram (August 2013)

Reusability is key to rapid rollout.Great documentation eases concerns.

Page 58: Cassandra at Instagram (August 2013)

• C* 1.2.3

• vnodes, LeveledCompactionStrategy

• 12 hi1.4xlarge (8-core, 60GB, 2T SSD)

• 3 AZs, RF=3, CL W=TWO R=ONE

• 8G heap, 800M NewSize

Initial Setup

Page 59: Cassandra at Instagram (August 2013)

1. Dial up Double Writes

2. Test with "Shadow" Reads

3. Dial up "Real" Reads

Rollout

Page 60: Cassandra at Instagram (August 2013)

commit 1c3d99a9e337f9383b093009dba074b8ade20768Author: Rick BransonDate: Mon May 6 14:58:54 2013 -0700

Bump C* inbox heap size 8G -> 10G, seeing heap pressure

Page 61: Cassandra at Instagram (August 2013)

Bootstrapping sucked because compacting10,000 SSTables takes forever.

sstable_size_in_mb: 5 => 25

Page 62: Cassandra at Instagram (August 2013)

Monitor Consistency

$ nodetool netstatsMode: NORMALNot sending any streams.Not receiving any streams.Read Repair Statistics:Attempted: 3192520Mismatch (Blocking): 0Mismatch (Background): 11584Pool Name Active Pending CompletedCommands n/a 0 1837765727Responses n/a 1 1750784545

UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 0.01;

99.63% consistent

Page 63: Cassandra at Instagram (August 2013)

SSTable Size (again)Saw lots of GC pressure related to buffer

garbage. Eventually they landed on a new default in 1.2.9+ (160MB).

sstable_size_in_mb: 25 => 128

Page 64: Cassandra at Instagram (August 2013)
Page 65: Cassandra at Instagram (August 2013)

Fetch & Deserialize Time (measured from app)

Mean vs P90 (ms), trough-to-peak

Page 66: Cassandra at Instagram (August 2013)

Space used (live): 180114509324Space used (total): 180444164726Memtable Columns Count: 2315159Memtable Data Size: 112197632Memtable Switch Count: 1312Read Count: 316192445Read Latency: 1.982 ms.Write Count: 1581610760Write Latency: 0.031 ms.Pending Tasks: 0Bloom Filter False Positives: 481617Bloom Filter False Ratio: 0.08558Bloom Filter Space Used: 54723960Compacted row minimum size: 25Compacted row maximum size: 545791Compacted row mean size: 3020

Page 67: Cassandra at Instagram (August 2013)

20K 200-column slice reads/sec

30K 1-column mutations/sec

30% CPU utilization48K clients

Peak Stats

Page 68: Cassandra at Instagram (August 2013)

Exciting Future Things

• Python Native Protocol Driver

• Read CPU Consumption Work

• Mass CQL Adoption

• Triggers

• CAS (for limited use cases)

Page 69: Cassandra at Instagram (August 2013)

Next 6 Months...

• Node repair visibility & monitoring

• Objects & Associations Storage API on C* + memcache

• Migrate more from Redis

• New major use case

• Cassandra 2.0?

Page 70: Cassandra at Instagram (August 2013)

We're hiring!