Voldemort : Prototype to Production

42
Voldemort : Prototype to Production A Journey to 1M Operations/Sec

description

Discusses how Voldemort 'grew up' in Linkedin.

Transcript of Voldemort : Prototype to Production

Page 1: Voldemort : Prototype to Production

Recruiting Solutions Recruiting Solutions Recruiting Solutions

Voldemort : Prototype to Production A Journey to 1M Operations/Sec

Page 2: Voldemort : Prototype to Production

Voldemort Intro

●  Amazon Dynamo style NoSQL k-v store ○  get(k) ○  put(k,v) ○  getall(k1,k2,...) ○  delete(k)

●  Tunable Consistency ●  Highly Available ●  Automatic Partitioning

Page 3: Voldemort : Prototype to Production

Voldemort Intro

●  Pluggable Storage ○  BDB-JE - Primary OLTP store ○  Read Only - Reliable serving layer for Hadoop datasets ○  MySQL - Good ‘ol MySQL without native replication ○  InMemory - Backed by Java ConcurrentHashMap

●  Clients ○  Native Java Client ○  REST Coordinator Service

●  Open source ●  More at project-voldemort.com

Page 4: Voldemort : Prototype to Production

Agenda

○ High Level Overview ○  Usage At LinkedIn ○  Storage Layer ○  Cluster Expansion

Page 5: Voldemort : Prototype to Production

Architecture

Coordinator Service

Server 1

Server 2

Native Java Client

get() put() getall()

Server 3

Server 4

bdb

bdb

bdb

bdb

“k1” p1 s1,s2

“k2” p2 s3,s4

“k3” p1 s1,s2

“k4” p2 s3,s4

Client Service

Client Service

Client Service

Client Service

Client Service

Page 6: Voldemort : Prototype to Production

Consistent Hashing

▪  Consistent Hashing Idea

▪  Divide key space into partitions –  Partitions: A,B,C,…,H –  hash(key) mod # partitions = pkey

▪  Randomly map partitions to servers

▪  Locate servers from keys –  K1 => A => S1,

–  K2 => C => S3

A

B

C

D E

F

G

H S

1

S1

S2

S3

S3

S4

S2

S4

K1

K2

Voldemort Intro

Page 7: Voldemort : Prototype to Production

Consistent Hashing with Replication

▪  Replication factor (RF) –  how many replicas to have

▪  Replica selection –  Find the primary partition –  Walk the ring to create preference list

▪  Find RF-1 additional servers ▪  Skip servers already in list

▪  Examples: RF = 3 –  K1: S1, S2, S3

–  K2: S3,S1,S4

A

B

C

D E

F

G

H S

1

S1

S2

S3

S3

S4

S2

S4

K1 [S1, S2, S3]

K2 [S3,S1,S4]

Voldemort Intro

Page 8: Voldemort : Prototype to Production

Zone Aware Replication

▪  Servers divided into zones ▪  Zone = Data Center ▪  Per zone replication factor ▪  Local zone vs. remote zones

–  Local zone (LZ) is where client is ▪  Two zones example:

–  LZ = 1 –  Zone1: S1 S3; RF=2 –  Zone2: S2 S4; RF=1 –  Preference lists:

▪  K1: Z1: S1, S3; Z2: S2

▪  K2: Z1: S3, S1; Z2: S4

A

B

C

D E

F

G

H S

1

S1

S2

S3

S3

S4

S2

S4

K1 [ Z1 [S1, S3], Z2 [S2] ]

K2 [ Z1 [S3, S1], Z2 [S4] ]

Voldemort Intro

Page 9: Voldemort : Prototype to Production

Voldemort @ LinkedIn

385 Stores

238 R-O Stores

147 R-W Stores

3 Zones

14 Clusters

~200 TB

~750 Servers

Page 10: Voldemort : Prototype to Production

Voldemort @ LinkedIn

~1M Storage ops/s

22% R-O

78% R-W

Page 11: Voldemort : Prototype to Production

Voldemort @ LinkedIn

●  17% of all LinkedIn Services ○  embed a direct client

●  Fast (95th percentile < 20ms) for almost all clients

Page 12: Voldemort : Prototype to Production

Voldemort @ LinkedIn

●  Front Facing ○  Search (Recruiter + Site) ○  People You May Know ○  inShare ○  Media thumbnails ○  Notifications ○  Endorsements ○  Skills ○  Frequency capping Ads ○  Custom Segments ○  Who Viewed Your Profile ○  People You Want to Hire

●  Internal Services ○  Email cache ○  Email delivery stack ○  Recommendation Services ○  Personalization Services ○  Mobile Auth

●  Not exhaustive!

Page 13: Voldemort : Prototype to Production

Growth Since 2011

Page 14: Voldemort : Prototype to Production

●  Berkeley DB Java Edition ○  Embedded ○  100% Java ○  ACID compliant ○  Log structured

●  Voldemort uses ○  Vanilla k-v apis ○  Cursors for scans

Storage Layer

Page 15: Voldemort : Prototype to Production

Storage Layer Rewrite

Where We Wanted To Be

●  Predictable online performance ●  Scan jobs

○  Non Intrusive, Fast ●  Elastic

○  Recover failed nodes in minutes ○  Add hardware overnight

Page 16: Voldemort : Prototype to Production

Storage Layer Rewrite

Where We Really Were

1.  GC Issues a.  Unpredictable GC Churn b.  Scan jobs cause Full GCs

2.  Slow Scans (even on SSDs) a.  Daily Retention Job/Slop Pusher b.  Not Partition Aware

3.  Memory Management a.  0-Control over a single store’s share

4.  Managing Multiple Versions a.  Lock Contention b.  Additional bdb-delete() cost during put()

5.  Weaker Durability on Crash a.  Dirty Writes in heap

Page 17: Voldemort : Prototype to Production

Storage Layer Rewrite

BDB Cache on JVM

Disk

Index

Index

Index Index Index

Index

Index

...

... ...

Leaf Index Leaf Index

Leaf Leaf

Server Thread

BDB-Checkpointer

BDB-Cleaner

BDB-JE

Page 18: Voldemort : Prototype to Production

Storage Layer Rewrite

JVM Heap BDB Cache

Store A’s B+Tree

Store D’s B+Tree Store C’s B+Tree

Store B’s B+Tree

Server Threads Cleaner-A Checkpointer-A Cleaner-A

Cleaner-A Cleaner-A

Checkpointer-A Checkpointer-A

Checkpointer-A

Multi-Tenant Example

Page 19: Voldemort : Prototype to Production

Storage Layer Rewrite

Road To Recovery

●  Move data off heap ○  Only Index sits on heap

●  Cache Control to reduce scan impact ●  Partition Aware Storage

○  Range scans to the rescue

●  Dynamic Cache Partitioning ○  Control how much heap goes to a single store

●  SSD Aware Optimizations ○  Checkpointing ○  Cache Policy

●  Manage versions directly ○  Treat BDB as plain k-v store

Page 20: Voldemort : Prototype to Production

Storage Layer Rewrite

Moving Data Off Heap

●  Much improved GC ○  memory churn ○  promotions

●  SSD Aware hit-the-disk design

●  Strong Durability on Crash ○  Runaway heap

SSD/Page Cache

Index

put(k,v)

Leafold Leafnew

1

2

JVM Heap

Page 21: Voldemort : Prototype to Production

Storage Layer Rewrite

Reducing Scan Impact

●  Massive Cache Pollution ○  Throttling not an option

●  Exercise cursor level control

●  Sustained rates upto 30-40K/sec

Page 22: Voldemort : Prototype to Production

Storage Layer Rewrite

Managing Versions Directly

●  No more extra delete() ●  No more separate duplicate tree

○  Much improved locking performance ●  More compact storage

BIN

DIN

DBIN

V1 V2

BIN

V1,V2

Page 23: Voldemort : Prototype to Production

Storage Layer Rewrite

SSD Aware Optimizations

●  Checkpoints on SSD ○  Age-old recovery time vs performance

tradeoff

●  Predictability ○  Level based policy

●  Streaming Writes ○  Turn off checkpointer

●  BDB5 Support ○  Much better compaction ○  Much less index metadata

Checkpointer Interval vs Recovery Time

Page 24: Voldemort : Prototype to Production

Storage Layer Rewrite

Partition Aware Storage

“Key” “Key” Partition-id

Root

Subtree

k5 k6 k7 k8

k1 k2 k3 k4

Subtree

k9 k10 k11 k12

k13 k14 k15 k16

Root

P1 Subtree P0 Subtree

k1 k3 k5 k7

k9 k11 k13 k15

k2 k4 k6 k8

k10 k12 k14 k16

Page 25: Voldemort : Prototype to Production

Storage Layer Rewrite

Speed Up

Percentage Of Partitions Scanned

●  Restore ○  1 Day -> 1 hour ●  Rebalancing

○  ~Week -> Hours

Page 26: Voldemort : Prototype to Production

Storage Layer Rewrite

Dynamic Cache Partitioning

●  Control share of heap per store ○  Dynamically add/reduce memory ○  Currently isolating bursty store

●  Improve Capacity Model ○  More production validation? ○  Auto tuning mechanisms?

●  Isolate at the JVM level? ○  Rethink deployment model

Page 27: Voldemort : Prototype to Production

Storage Layer Rewrite

Wins In Production

Rein in GC

Storage Latency Way Down

Page 28: Voldemort : Prototype to Production

Cluster Expansion Rewrite

●  Basis of scale-out philosophy ●  Cluster Expansion

○  Add servers to existing cluster ●  0 Downtime operation ●  Transparent to client

○  Functionality ○  Mostly Performance too

Page 29: Voldemort : Prototype to Production

Cluster Expansion Rewrite

Types Of Clusters

●  Zoned Read Write ○  Zone = DataCenter

●  Non Zoned ○  Read-Write ○  Read-Only (Hadoop BuildAndPush)

Page 30: Voldemort : Prototype to Production

Zone 1 Zone 2

Zone 1 Zone 2

Server 1 Server 1 Server 2 Server 2

Server 2 Server 1 Server 3 Server 2 Server 3 Server 1

Expansion Example

P1

S2

P3

S4

S1

P2

S3

P4

P1

S2

P3

S4

S1

P2

S3

P4

P1 S2

P3 S4

S1 P2

S3 P4

P1 S2

S4

P2

S3 P4

S1

P3

Server 4

New New

Server 4

New New

Page 31: Voldemort : Prototype to Production

Expansion In Action 1: Change Cluster Topology

Cluster Expansion Rewrite

Server 1 New Server Server 2

New Server Server 2 Server 1

Rebalance Controller

1

Zone 1 Zone 2

Page 32: Voldemort : Prototype to Production

Expansion In Action 2: Setup Proxy Bridges

Cluster Expansion Rewrite

Server 1 New Server Server 2

New Server Server 2 Server 1

Rebalance Controller

Proxy Bridge

1

2

1. Change cluster topology

Zone 1 Zone 2

Page 33: Voldemort : Prototype to Production

Expansion In Action 3: Client Picks Up New Topology

Cluster Expansion Rewrite

Client

Server 1 New Server Server 2

New Server Server 2 Server 1

Rebalance Controller

Proxy Bridge

1

2

1. Change cluster topology 2. Proxy request based on old topology

3

Zone 1 Zone 2

Page 34: Voldemort : Prototype to Production

Expansion In Action 4: Move Partitions

Cluster Expansion Rewrite

Client

Server 1 New Server Server 2

New Server Server 2 Server 1

Rebalance Controller

Local Move

Proxy Bridge

Client

Cross DC Move

1

3

4

41. Change cluster topology 2. Proxy request based on old topology 3. Client picks up change

2

Zone 1 Zone 2

Page 35: Voldemort : Prototype to Production

Expansion In Action

Cluster Expansion Rewrite

Client

Server 1 New Server Server 2

New Server Server 2 Server 1

Rebalance Controller

Local Move

Proxy Bridge

Client

Cross DC Move

1

3

4

41. Change cluster topology 2. Client picks up change 3. Proxy request based on old topology 4. Move partitions

2

Zone 1 Zone 2

Page 36: Voldemort : Prototype to Production

Problems

Cluster Expansion Rewrite

●  One Ring Spanning Data Centers ○  Cross datacenter data moves/proxies

●  Not Safely Abortable ○  Additional cleanup/consolidation

●  Cannot Add New Data Centers ●  Opaque Planner Code

○  No special treatment of Zones ●  Lack of tools

○  Skew Analysis ○  Repartitioning/Balancing Utility

Page 37: Voldemort : Prototype to Production

Zone 1 Zone 2

Server 1

Redesign: Zone N-ary Philosophy

P1

S1

Server 4

New New

Data Move

Old nth Replica of P in Zone Z

New nth Replica of P in Zone Z

Donor Stealer Proxy Bridge

● Given a partition P, whose mapping has changed

Server 3 Server 2 Server 1

P1

S1

Server 4 Server 3 Server 2

Page 38: Voldemort : Prototype to Production

Redesign: Advantages

Cluster Expansion Rewrite

●  Simple, yet powerful ●  Feasible alternative to breaking the ring

○  Expensive to rewrite all of DR ●  No more cross datacenter moves ●  Aligns proxy bridges mechanism with planner logic ●  Principally applied ○  Abortable Rebalances ○  Zone Expansion

Page 39: Voldemort : Prototype to Production

Abortable Rebalance

Cluster Expansion Rewrite

●  Plans go wrong ●  Introducing proxy puts

○  Safely rollback to old topology

●  Avoid Data Loss & adhoc repairs

●  Double write load during rebalance

Stealer Donor

put(k,v) proxy-get(k)

vold

local-put(k,vold)

local-put(k,v)

Success proxy-put(k,v)

Page 40: Voldemort : Prototype to Production

Zone Expansion

Cluster Expansion Rewrite

●  Builds upon Zone N-ary idea ●  Fetch data from an existing zone ●  No proxy bridges

○  No donors in same zone ●  Cannot read from new zone until complete

Page 41: Voldemort : Prototype to Production

New Rebalance Utilities

Cluster Expansion Rewrite

•  PartitionAnalysis ○  Determine skewness of a cluster

●  Repartitioner ○  Improve partition balance ○  Greedy-Random swapping

●  RebalancePlanner ○  Incorporate Zone N-Ary logic ○  Operational Insights: storage overhead,probability client will pick up

new metadata ●  Rebalance Controller

○  Cleaner reimplementation based on new planner/scheduler

Page 42: Voldemort : Prototype to Production

Wins In Production

Cluster Expansion Rewrite

•  7 Zoned RW Clusters expanded into new zone ○  Hiccups resolved overnight ○  Abortability is handy

●  Small Details -> Big Difference ○  Proxy Pause period ○  Accurate Progress reporting ○  Proxy get/getall optimization