Beyond Consistent Hashing and TCP - SNIA · 2019. 12. 21. · 2015 Storage Developer Conference. ©...
Transcript of Beyond Consistent Hashing and TCP - SNIA · 2019. 12. 21. · 2015 Storage Developer Conference. ©...
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Beyond Consistent Hashing and TCP: Vastly Scalable Load Balanced Storage Clustering
Alex Aizman, Caitlin Bestler
Nexenta Systems
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
The Scale of Disruption
2
This presentation:
Existing technology vs. massive scale
There are in fact better alternatives..
Conventional
New
Data
Distribution
Mechanism
Consistent Hashing (CH)
Storage
Transport
Unicast (TCP)
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Distribution & Replication Mechanism:
Data & Metadata
3
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
TOE (data storage)
4
Root of Everything
Unstructured
Local Distributed
Namespace Stripe
DLM MDS Consistent
Hash
Structured
SQL
Local Distributed
NoSQL
…
1980s 2015
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Increasing Decentralization
5
1980s
1. Allocate block or stripe
2. Record the location in the
parent’s table
3. Write…
2010s
1. Map name or chunk
to target
2. Let the target take
care of the rest
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Storage Targets
I/O pipeline(*): unstructured, distributed and striped
6
Data Stripes
(Blocks, Chunks, Shards)
Map to locations
Read Write Replicate
Metadata (references unnamed chunks)
Map to locations
Read Write Replicate
Name
Map to metadata
Name,
Metadata,
Chunk,
…
Stripe,
Block
Chunk,
…
0xb13f7…
0x85ac4…
0xfe291…
0x35d1e…
…
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Quick Recap
7
Conventional
New
Data
Distribution
Mechanism
Consistent Hashing Group Hashing (NGH)
Storage
Transport
TCP
Multicast (Replicast™)
Next:
CH vs. NGH
TCP/Unicast vs. Replicast
Simulation: model, results, charts
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Consistent Hashing
Openstack's Object Storage Service Swift
Amazon's storage system Dynamo
Apache Cassandra
Voldemort
Riak
GlusterFS
Ceph*
…
8
Name,
Metadata,
Chunk,
…
0xb13f7…
0x85ac4…
0xfe291…
0x35d1e…
Trademarks listed here and elsewhere in this presentation are property of their respective owners
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Ceph vs. Consistent Hashing Input:
CRUSH MAP aka Cluster Map
Placement Groups
OSDs aka devices
OSD weights
Rules: data redundancy
Replica
Output:
Pseudo-random uniform hash across failure
domains – binomial(n, p)
Problems:
Same as CH – see next 9
• Simplified CRUSH: HASH(node, replica)
• Rendezvous hashing:
HASH(node, replica) • Consistent hashing:
HASH(replica)
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Consistent Hashing: the good, the bad… The good
better than hash(chunk) % n-servers
even better when used with virtual nodes (partitions)
The bad
Balls-into-Bins: load log(𝑛): congestion “ripples”
Subject to binomial spikes
The ugly
Hashing determinism vs. Load balancing: either/OR
Worse for larger clusters
Much worse at a high utilization
Is too consistent.. 10
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
NGH: How it works
11
GW NGx
GW
switch
NGx
Si Sj Sk
Min(*) size NG = 6
Simulating 9 targets per NG
Must select 3 best bids
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Why NGH
12
Admit: we simply don’t know enough about the state on the
edge
Accept it.
Main principles and motivations:
Serial transmit: one chunk at a time
Per-disk dedicated thread: one chunk at a time
Edge-based load balancing as part of the datapath
Edge-based reservation of link bandwidth:
Storage Target to Gateway:
begin = NOW + F(queue-
depth, cache-size, link bw)
end = begin + G(chunk-size) a BID
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Storage Transport:
Data & Control
13
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Unicast vs. Persistent Redundancy
Typical: put(…) blocks until 3 copies get persistently stored on
3 storage servers (variations: RAIN, EE, 2 cached, etc.)
IO pipeline: redundancy – first, completion – second
Daisy Chain Replication
14
put r3 put r2 put r1 get r1 put (replicate) r1 Storage
Server switch
Sustained get() rate G
+ Sustained single-replica put() rate P
+ Daisy chain
New get() rate (%):
G * 100/ (G + P)
Gateway
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
TCP: the good, the bad…
The good:
Predominant, totally integrated and reliable
The bad:
Was never designed for scale-out storage clusters
The ugly:
Unicast – previous slide
TCP incast – “is a catastrophic TCP throughput collapse…”
TCP “outcast” – converged and hyper-converged clusters
Time to converge >> RTO
However:
DCB helps
Custom/proprietary TCP kernels will help as well.. 15
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
TCP Simulation
16
Each chunk put: 3 successive
TCP transmissions
ACK every received replica
ACK delivery time = chunk reception time + ½ * RTT
Future-perfect TCP Congestion Protocol: Instant
Convergence
N flows to one target, each instantly adjusting to use 1/Nth
of the 10GE bandwidth
TCP receive window and Tx buffer sizes unlimited
Zero per-chunk overhead
Simulating TCP as a quintessential perfect Unicast
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Replicast Simulation
17
The (simulated) penalty is on Replicast: • +RTT per-chunk overhead • Note: TCP is benchmarked
with (0) zero overhead • Replicast still wins..
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Tire Kicking
https://github.com/Nexenta/nedge-sim
./build-ndebug
./do_benchmarks.sh
Example: ./test ngs 128 chunk_size 128 gateways 512 duration 100
Extensive comments inside
Detailed log tracking of each transaction
Logs for the cited runs can be reproduced at ff5fbda
Default Comment
#define CLUSTER_TRIP_TIME 10000 // approximately 1us; RTT = 2us
#define N_REPLICAS 3 // replicas of each stored chunk
#define MBS_SEC_PER_TARGET_DRIVE 400 // 400MB/s (eMLC)
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
I/O Pipelines: CH/TCP and NGH/Replicast
CHUNK
PUT
READY
REP
CHUNK
PUT
REQUEST
RECEIVED
REP
CHUNK PUT
RESPONSE
RECEIVED
REP
CHUNK
PUT
ACCEPT
RECEIVED
REP
RENDEZVOUS
XFER
RECEIVED
DISK
WRITE
START
DISK
WRITE
COMPLETION
REPLICA
PUT
ACK
CHUNK
PUT
ACK
19
CHUNK
PUT
READY
TCP
XMIT
RECEIVED
TCP
RECEPTION
COMPLETE
TCP
RECEPTION
ACK
DISK
WRITE
START
DISK
WRITE
COMPLETION
REPLICA
PUT
ACK
CHUNK
PUT
ACK
Event loop executing Tick by Tick, where one Tick = 1 bit-time at 10GbE speed
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Simulation Results
20
0
200
400
600
800
1000
1200
1400
1600
1800
1 3 5 7 9 11 13 15 17 19 21 23
NGH-R Chunks/ms
CH-T Chunks/ms
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 3 5 7 9 11 13 15 17 19 21 23
Busy Targets
Busy Targets
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1 3 5 7 9 11 13 15 17 19 21 23
NGH-R Chunks/ms
CH-T Chunks/ms
0%
5%
10%
15%
20%
25%
30%
1 3 5 7 9 11 13 15 17 19 21 23
Busy Targets
Busy Targets
gw256c128K
gw64c4K
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Simulation Results (cont-d)
21
0
100
200
300
400
500
600
700
800
1 3 5 7 9 11 13 15 17 19 21 23
NGH-R Chunks/ms
CH-T Chunks/ms
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
1 3 5 7 9 11 13 15 17 19 21 23
Busy Targets
Busy Targets
gw64c128K
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 3 5 7 9 11 13 15 17 19 21 23
NGH-R Chunks/ms
CH-T Chunks/ms
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 3 5 7 9 11 13 15 17 19 21 23
Busy Targets
Busy Targets
gw512c128K
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Simulation Results (last)
22
0
10000
20000
30000
40000
50000
60000
1 3 5 7 9 11 13 15 17 19 21 23
NGH-R Chunks/ms
CH-T Chunks/ms
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6 7 8 9 1011121314151617181920212223
Busy Targets
Busy Targets
gw256c4K
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
Conclusions
Distributed storage clusters require technology that can
simultaneously optimize:
available IOPS (budget) of the backend
storage capacity
network utilization
To address all of the above, the mechanism must load
balance at runtime
Consistent Hashing = blind selection
Storage transport must be designed for scale-out and
replication (3 replicas, etc.)
Hence, NGH/Replicast
Source on the github..
23
2015 Storage Developer Conference. © Nexenta Systems, Inc. All Rights Reserved.
24
What’s Next
Unicast
Multicast
Consistent Hash + Best Effort Group Hash + sub-Group Load Balancing
CH/Unicast
NGH/Replicast