Cassandra at Instagram (August 2013)

70
CASSANDRA AT INSTAGRAM Rick Branson, Infrastructure Engineer @rbranson SF Cassandra Meetup August 29, 2013 Disqus HQ

description

A brief history of Instagram's adoption cycle of the open source distributed database Apache Cassandra, in addition to details about it's use case and implementation. This was presented at the San Francisco Cassandra Meetup at the Disqus HQ in August 2013.

Transcript of Cassandra at Instagram (August 2013)

Page 1: Cassandra at Instagram (August 2013)

CASSANDRAAT INSTAGRAMRick Branson, Infrastructure Engineer@rbranson

SF Cassandra MeetupAugust 29, 2013

Disqus HQ

Page 2: Cassandra at Instagram (August 2013)

September 2012Redis fillin' up.

Page 3: Cassandra at Instagram (August 2013)
Page 4: Cassandra at Instagram (August 2013)

What sucks?

Page 5: Cassandra at Instagram (August 2013)

THE OBVIOUSMemory is expensive.

Page 6: Cassandra at Instagram (August 2013)

LESS OBVIOUS:In-memory "degrades" poorly

Page 7: Cassandra at Instagram (August 2013)
Page 8: Cassandra at Instagram (August 2013)

• Flat namespace. What's in there?

• Heap fragmentation

• Single threaded

Page 9: Cassandra at Instagram (August 2013)

BGSAVE

Page 10: Cassandra at Instagram (August 2013)

• Boils down to centralized logging

• VERY high skew of writes to reads (1,000:1)

• Ever growing data set

• Durability highly valued

• Dumb to store it in RAM, basically...

The Data

Page 11: Cassandra at Instagram (August 2013)

• Cassandra 1.1

• 3 EC2 m1.xlarge (2-core, 15GB RAM)

• RAIDed ephemerals (1.6TB of SATA)

• RF=3

• 6GB Heap, 200MB NewSize

• HSHA

The Setup

Page 12: Cassandra at Instagram (August 2013)

It worked. Mostly.

Page 13: Cassandra at Instagram (August 2013)

The horriblecool thing about Chef...

Page 14: Cassandra at Instagram (August 2013)

commit a1489a34d2aa69316b010146ab5254895f7b9141Author: Rick BransonDate: Thu Oct 18 20:05:16 2012 -0700

Follow the rules for Cassandra listen_address so I don't burn a whole day fixing my retarded mistake

Page 15: Cassandra at Instagram (August 2013)
Page 16: Cassandra at Instagram (August 2013)

commit 41c96f3243a902dd6af4ea29ef6097351a16494aAuthor: Rick BransonDate: Tue Oct 30 17:12:00 2012 -0700

Use 256k JVM stack size for C* -- fixes a bug that got integrated with 1.1.6 packaging + Java 1.6.0_u34+

Page 17: Cassandra at Instagram (August 2013)

November 2013Doubled to 6 nodes.

18,000 connections. Spread those more evenly.

Page 18: Cassandra at Instagram (August 2013)

commit 3f2e4f2e5da6fe99d7f3fc13c0da09b464b3a9e0Author: Rick BransonDate: Wed Nov 21 09:50:21 2012 -0800

Drop key cache size on C*UA cluster: was causing heap issues, and apparently 1GB is _WAY_ outside of the normal range of operation for nodes of this size.

Page 19: Cassandra at Instagram (August 2013)

commit 5926aa5ce69d48e5f2bb7c0d0e86b411645bc786Author: Rick BransonDate: Mon Dec 24 12:41:13 2012 -0800

Lower memtable sizes on C* UA cluster to make more room for compression metadata / bloom filters on heap

Page 20: Cassandra at Instagram (August 2013)
Page 21: Cassandra at Instagram (August 2013)

1.2.1.It went well.well... until...

Page 22: Cassandra at Instagram (August 2013)
Page 23: Cassandra at Instagram (August 2013)

commit 84982635d5c807840d625c22a8bd4407c1879ebaAuthor: Rick BransonDate: Thu Jan 31 09:43:56 2013 -0800

Switch Cassandra from tokens to vnodes

commit e990acc5dc69468c8a96a848695fca56e79f8b83Author: Rick BransonDate: Sun Feb 10 20:26:32 2013 -0800

We aren't ready for vnodes yet guys

Page 24: Cassandra at Instagram (August 2013)

TAKEAWAYLet stupidenterprising, experienced operators that

will submit patches take the first few bullets on brand-new major versions.

Page 25: Cassandra at Instagram (August 2013)

commit acb02daea57dca889c2aa45963754a271fa51566Author: Rick BransonDate: Sun Feb 10 20:36:34 2013 -0800

Doubled C* cluster

Page 26: Cassandra at Instagram (August 2013)

commit cc13a4c15ee0051bb7c4e3b13bd6ae56301ac670Author: Rick BransonDate: Thu Mar 14 16:23:18 2013 -0700

Subtract token from C*ua7 to replace the node

Page 27: Cassandra at Instagram (August 2013)

pycassa exceptions (last 6 months)

Page 28: Cassandra at Instagram (August 2013)
Page 29: Cassandra at Instagram (August 2013)

• 3.4TB

• vnode migration still pending

Page 30: Cassandra at Instagram (August 2013)

TAKEAWAYAdopt a technology by understanding what it's best at and letting it do that first, then expand...

Page 31: Cassandra at Instagram (August 2013)
Page 32: Cassandra at Instagram (August 2013)

• Sharded master/slave Redis

• 32x68GB (m2.4xlarge)

• Space (memory) bound

• Resharding sucks

• Failover is manual, wakes us up at night

Page 33: Cassandra at Instagram (August 2013)

user_id: [ activity, activity, ...]

Page 34: Cassandra at Instagram (August 2013)

user_id: [ activity, activity, ...]

Thrift Serialized Activity

Page 35: Cassandra at Instagram (August 2013)

Bound the Sizeuser_id: [ activity1, activity2, ... activity100, activity101, ...]

LTRIM <user_id> 0 99

Page 36: Cassandra at Instagram (August 2013)

Undo

user_id: [ activity1, activity2, activity3, ...]

LREM <user_id> 0 <activity2>

Page 37: Cassandra at Instagram (August 2013)

C* data model

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

Page 38: Cassandra at Instagram (August 2013)

Bound the Size

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID101>, <TimeUUID102>, <TimeUUID103>, ...])

Page 39: Cassandra at Instagram (August 2013)

The great destroyer of systems shows up. Tombstones abound.

Page 40: Cassandra at Instagram (August 2013)

user_id

TimeUUID1 TimeUUID2

...

TimeUUID2

user_id <activity> <activity> ... [tombstone]user_id

timestamp1 timestamp2

...

timestamp2

Cassandra internally stores deletes as tombstones, which mark data for a given column as deleted at-or-before a timestamp.

Column Delete

tombstone timestamp is >= live column timestamp, so it will be

hidden from queries and compacted away.

Page 41: Cassandra at Instagram (August 2013)

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id

timestamp1 timestamp2

...

timestamp101

TimeUUID = timestamp

To avoid tombstones, exploit that the timestamp embedded in our TimeUUID (ordering) is the same as the column timestamp.

Page 42: Cassandra at Instagram (August 2013)

user_id

TimeUUID1 TimeUUID2

...

TimeUUID101

user_id <activity> <activity> ... <activity>user_id

timestamp1 timestamp2

...

timestamp101

delete(<user_id>, timestamp=<timestamp101>)

Row DeleteCassandra can also store row tombstones, which delete all data from a row at-or-before the timestamp provided.

Page 43: Cassandra at Instagram (August 2013)

Optimizes Reads

SSTable

max_ts=100

SSTable

max_ts=200

SSTable

max_ts=300

SSTable

max_ts=400

SSTable

max_ts=500

SSTable

max_ts=600

SSTable

max_ts=700

SSTable

max_ts=800

Contains row tombstonewith timestamp 350

Safely ignoredusing in-memorymetadata

Page 44: Cassandra at Instagram (August 2013)

~10% of actions are undos.

Page 45: Cassandra at Instagram (August 2013)

Undo Support

user_idTimeUUID1 TimeUUID2

...TimeUUID101

user_id<activity> <activity>

...<activity>

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Page 46: Cassandra at Instagram (August 2013)
Page 47: Cassandra at Instagram (August 2013)

get(<user_id>)delete(<user_id>, columns=[<TimeUUID2>])

Simple Race ConditionThe state of the row may have changed between these two operations.

💩

Page 48: Cassandra at Instagram (August 2013)

Replica[A, B]

Replica[A]

Writer

insert B OK

Replica[A, B]

FAIL

Like

Diverging Replicas

Page 49: Cassandra at Instagram (August 2013)

Replica[A, B]

Replica[A]

Writer

read [A]

Replica[A, B]

Undo Like

Diverging Replicas

Replica is missing B, so if a read is required to find B before deleting it, it's going to fail.

Page 50: Cassandra at Instagram (August 2013)

SuperColumn = Old/Busted AntiColumn = New/Hotness

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

"Anti-Column"Borrowing from the idea of Cassandra's by-name tombstones, Contains an MD5 hash of the activity data "value" it is marking as deleted.

Page 51: Cassandra at Instagram (August 2013)

user_id(0, <TimeUUID>) (1, <TimeUUID>) (1, <TimeUUID>)

user_idanti-column activity activity

Composite ColumnFirst component is zero for anti-columns,splitting the row into two independent lists,and ensuring the anti-columns always appearat the head.

Page 52: Cassandra at Instagram (August 2013)

Replica[A, B]

Replica[A]

Writer

insert B OK

Replica[A, B]

FAIL

Like

Diverging Replicas: Solved

Page 53: Cassandra at Instagram (August 2013)

Replica[A, B, C]

Replica[A, C]

Writer

insert C

Replica[A, B, C]

Undo Like

Diverging Replicas: Solved

OK

Instead of read-before-write, an anti-column is inserted to mark the activity as deleted.

Page 54: Cassandra at Instagram (August 2013)

TAKEAWAYRead-before-write is a smell. Try to model data as a log of user "intent" rather than manhandling the

data into place.

Page 55: Cassandra at Instagram (August 2013)

• Keep 30% "buffer" for trims.

• Undo without read. (thumbsup)

• Large lists suck for this. (thumbsdown)

• CASSANDRA-5527

Page 56: Cassandra at Instagram (August 2013)

Built in two days.Experience paid off.

Page 57: Cassandra at Instagram (August 2013)

Reusability is key to rapid rollout.Great documentation eases concerns.

Page 58: Cassandra at Instagram (August 2013)

• C* 1.2.3

• vnodes, LeveledCompactionStrategy

• 12 hi1.4xlarge (8-core, 60GB, 2T SSD)

• 3 AZs, RF=3, CL W=TWO R=ONE

• 8G heap, 800M NewSize

Initial Setup

Page 59: Cassandra at Instagram (August 2013)

1. Dial up Double Writes

2. Test with "Shadow" Reads

3. Dial up "Real" Reads

Rollout

Page 60: Cassandra at Instagram (August 2013)

commit 1c3d99a9e337f9383b093009dba074b8ade20768Author: Rick BransonDate: Mon May 6 14:58:54 2013 -0700

Bump C* inbox heap size 8G -> 10G, seeing heap pressure

Page 61: Cassandra at Instagram (August 2013)

Bootstrapping sucked because compacting10,000 SSTables takes forever.

sstable_size_in_mb: 5 => 25

Page 62: Cassandra at Instagram (August 2013)

Monitor Consistency

$ nodetool netstatsMode: NORMALNot sending any streams.Not receiving any streams.Read Repair Statistics:Attempted: 3192520Mismatch (Blocking): 0Mismatch (Background): 11584Pool Name Active Pending CompletedCommands n/a 0 1837765727Responses n/a 1 1750784545

UPDATE COLUMN FAMILYInboxActivitiesByUserIDWITH read_repair_chance = 0.01;

99.63% consistent

Page 63: Cassandra at Instagram (August 2013)

SSTable Size (again)Saw lots of GC pressure related to buffer

garbage. Eventually they landed on a new default in 1.2.9+ (160MB).

sstable_size_in_mb: 25 => 128

Page 64: Cassandra at Instagram (August 2013)
Page 65: Cassandra at Instagram (August 2013)

Fetch & Deserialize Time (measured from app)

Mean vs P90 (ms), trough-to-peak

Page 66: Cassandra at Instagram (August 2013)

Space used (live): 180114509324Space used (total): 180444164726Memtable Columns Count: 2315159Memtable Data Size: 112197632Memtable Switch Count: 1312Read Count: 316192445Read Latency: 1.982 ms.Write Count: 1581610760Write Latency: 0.031 ms.Pending Tasks: 0Bloom Filter False Positives: 481617Bloom Filter False Ratio: 0.08558Bloom Filter Space Used: 54723960Compacted row minimum size: 25Compacted row maximum size: 545791Compacted row mean size: 3020

Page 67: Cassandra at Instagram (August 2013)

20K 200-column slice reads/sec

30K 1-column mutations/sec

30% CPU utilization48K clients

Peak Stats

Page 68: Cassandra at Instagram (August 2013)

Exciting Future Things

• Python Native Protocol Driver

• Read CPU Consumption Work

• Mass CQL Adoption

• Triggers

• CAS (for limited use cases)

Page 69: Cassandra at Instagram (August 2013)

Next 6 Months...

• Node repair visibility & monitoring

• Objects & Associations Storage API on C* + memcache

• Migrate more from Redis

• New major use case

• Cassandra 2.0?

Page 70: Cassandra at Instagram (August 2013)

We're hiring!