Beyond Consistent Hashing and TCP - SNIA · 2019. 12. 21. · 2015 Storage Developer Conference. ©...

25
Beyond Consistent Hashing and TCP: Vastly Scalable Load Balanced Storage Clustering Alex Aizman, Caitlin Bestler Nexenta Systems

Transcript of Beyond Consistent Hashing and TCP - SNIA · 2019. 12. 21. · 2015 Storage Developer Conference. ©...

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Beyond Consistent Hashing and TCP: Vastly Scalable Load Balanced Storage Clustering

Alex Aizman, Caitlin Bestler

Nexenta Systems

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

The Scale of Disruption

2

This presentation:

Existing technology vs. massive scale

There are in fact better alternatives..

Conventional

New

Data

Distribution

Mechanism

Consistent Hashing (CH)

Storage

Transport

Unicast (TCP)

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Distribution & Replication Mechanism:

Data & Metadata

3

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

TOE (data storage)

4

Root of Everything

Unstructured

Local Distributed

Namespace Stripe

DLM MDS Consistent

Hash

Structured

SQL

Local Distributed

NoSQL

1980s 2015

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Increasing Decentralization

5

1980s

1. Allocate block or stripe

2. Record the location in the

parent’s table

3. Write…

2010s

1. Map name or chunk

to target

2. Let the target take

care of the rest

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Storage Targets

I/O pipeline(*): unstructured, distributed and striped

6

Data Stripes

(Blocks, Chunks, Shards)

Map to locations

Read Write Replicate

Metadata (references unnamed chunks)

Map to locations

Read Write Replicate

Name

Map to metadata

Name,

Metadata,

Chunk,

Stripe,

Block

Chunk,

0xb13f7…

0x85ac4…

0xfe291…

0x35d1e…

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Quick Recap

7

Conventional

New

Data

Distribution

Mechanism

Consistent Hashing Group Hashing (NGH)

Storage

Transport

TCP

Multicast (Replicast™)

Next:

CH vs. NGH

TCP/Unicast vs. Replicast

Simulation: model, results, charts

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Consistent Hashing

Openstack's Object Storage Service Swift

Amazon's storage system Dynamo

Apache Cassandra

Voldemort

Riak

GlusterFS

Ceph*

8

Name,

Metadata,

Chunk,

0xb13f7…

0x85ac4…

0xfe291…

0x35d1e…

Trademarks listed here and elsewhere in this presentation are property of their respective owners

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Ceph vs. Consistent Hashing Input:

CRUSH MAP aka Cluster Map

Placement Groups

OSDs aka devices

OSD weights

Rules: data redundancy

Replica

Output:

Pseudo-random uniform hash across failure

domains – binomial(n, p)

Problems:

Same as CH – see next 9

• Simplified CRUSH: HASH(node, replica)

• Rendezvous hashing:

HASH(node, replica) • Consistent hashing:

HASH(replica)

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Consistent Hashing: the good, the bad… The good

better than hash(chunk) % n-servers

even better when used with virtual nodes (partitions)

The bad

Balls-into-Bins: load log(𝑛): congestion “ripples”

Subject to binomial spikes

The ugly

Hashing determinism vs. Load balancing: either/OR

Worse for larger clusters

Much worse at a high utilization

Is too consistent.. 10

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

NGH: How it works

11

GW NGx

GW

switch

NGx

Si Sj Sk

Min(*) size NG = 6

Simulating 9 targets per NG

Must select 3 best bids

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Why NGH

12

Admit: we simply don’t know enough about the state on the

edge

Accept it.

Main principles and motivations:

Serial transmit: one chunk at a time

Per-disk dedicated thread: one chunk at a time

Edge-based load balancing as part of the datapath

Edge-based reservation of link bandwidth:

Storage Target to Gateway:

begin = NOW + F(queue-

depth, cache-size, link bw)

end = begin + G(chunk-size) a BID

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Storage Transport:

Data & Control

13

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Unicast vs. Persistent Redundancy

Typical: put(…) blocks until 3 copies get persistently stored on

3 storage servers (variations: RAIN, EE, 2 cached, etc.)

IO pipeline: redundancy – first, completion – second

Daisy Chain Replication

14

put r3 put r2 put r1 get r1 put (replicate) r1 Storage

Server switch

Sustained get() rate G

+ Sustained single-replica put() rate P

+ Daisy chain

New get() rate (%):

G * 100/ (G + P)

Gateway

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

TCP: the good, the bad…

The good:

Predominant, totally integrated and reliable

The bad:

Was never designed for scale-out storage clusters

The ugly:

Unicast – previous slide

TCP incast – “is a catastrophic TCP throughput collapse…”

TCP “outcast” – converged and hyper-converged clusters

Time to converge >> RTO

However:

DCB helps

Custom/proprietary TCP kernels will help as well.. 15

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

TCP Simulation

16

Each chunk put: 3 successive

TCP transmissions

ACK every received replica

ACK delivery time = chunk reception time + ½ * RTT

Future-perfect TCP Congestion Protocol: Instant

Convergence

N flows to one target, each instantly adjusting to use 1/Nth

of the 10GE bandwidth

TCP receive window and Tx buffer sizes unlimited

Zero per-chunk overhead

Simulating TCP as a quintessential perfect Unicast

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Replicast Simulation

17

The (simulated) penalty is on Replicast: • +RTT per-chunk overhead • Note: TCP is benchmarked

with (0) zero overhead • Replicast still wins..

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Tire Kicking

https://github.com/Nexenta/nedge-sim

./build-ndebug

./do_benchmarks.sh

Example: ./test ngs 128 chunk_size 128 gateways 512 duration 100

Extensive comments inside

Detailed log tracking of each transaction

Logs for the cited runs can be reproduced at ff5fbda

Default Comment

#define CLUSTER_TRIP_TIME 10000 // approximately 1us; RTT = 2us

#define N_REPLICAS 3 // replicas of each stored chunk

#define MBS_SEC_PER_TARGET_DRIVE 400 // 400MB/s (eMLC)

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

I/O Pipelines: CH/TCP and NGH/Replicast

CHUNK

PUT

READY

REP

CHUNK

PUT

REQUEST

RECEIVED

REP

CHUNK PUT

RESPONSE

RECEIVED

REP

CHUNK

PUT

ACCEPT

RECEIVED

REP

RENDEZVOUS

XFER

RECEIVED

DISK

WRITE

START

DISK

WRITE

COMPLETION

REPLICA

PUT

ACK

CHUNK

PUT

ACK

19

CHUNK

PUT

READY

TCP

XMIT

RECEIVED

TCP

RECEPTION

COMPLETE

TCP

RECEPTION

ACK

DISK

WRITE

START

DISK

WRITE

COMPLETION

REPLICA

PUT

ACK

CHUNK

PUT

ACK

Event loop executing Tick by Tick, where one Tick = 1 bit-time at 10GbE speed

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Simulation Results

20

0

200

400

600

800

1000

1200

1400

1600

1800

1 3 5 7 9 11 13 15 17 19 21 23

NGH-R Chunks/ms

CH-T Chunks/ms

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15 17 19 21 23

Busy Targets

Busy Targets

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

1 3 5 7 9 11 13 15 17 19 21 23

NGH-R Chunks/ms

CH-T Chunks/ms

0%

5%

10%

15%

20%

25%

30%

1 3 5 7 9 11 13 15 17 19 21 23

Busy Targets

Busy Targets

gw256c128K

gw64c4K

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Simulation Results (cont-d)

21

0

100

200

300

400

500

600

700

800

1 3 5 7 9 11 13 15 17 19 21 23

NGH-R Chunks/ms

CH-T Chunks/ms

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

1 3 5 7 9 11 13 15 17 19 21 23

Busy Targets

Busy Targets

gw64c128K

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 3 5 7 9 11 13 15 17 19 21 23

NGH-R Chunks/ms

CH-T Chunks/ms

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 3 5 7 9 11 13 15 17 19 21 23

Busy Targets

Busy Targets

gw512c128K

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Simulation Results (last)

22

0

10000

20000

30000

40000

50000

60000

1 3 5 7 9 11 13 15 17 19 21 23

NGH-R Chunks/ms

CH-T Chunks/ms

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8 9 1011121314151617181920212223

Busy Targets

Busy Targets

gw256c4K

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Conclusions

Distributed storage clusters require technology that can

simultaneously optimize:

available IOPS (budget) of the backend

storage capacity

network utilization

To address all of the above, the mechanism must load

balance at runtime

Consistent Hashing = blind selection

Storage transport must be designed for scale-out and

replication (3 replicas, etc.)

Hence, NGH/Replicast

Source on the github..

23

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

24

What’s Next

Unicast

Multicast

Consistent Hash + Best Effort Group Hash + sub-Group Load Balancing

CH/Unicast

NGH/Replicast

2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.

Q & A

25