Amazon’s Key-Value Store: Dynamo
description
Transcript of Amazon’s Key-Value Store: Dynamo
![Page 1: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/1.jpg)
UCSB CS271 1
AMAZON’S KEY-VALUE STORE: DYNAMO
DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007
Adapted from Amazon’s Dynamo Presentation
![Page 2: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/2.jpg)
UCSB CS271 2
Motivation
• Reliability at a massive scale• Slightest outage significant financial consequences• High write availability• Amazon’s platform: 10s of thousands of servers and
network components, geographically dispersed• Provide persistent storage in spite of failures• Sacrifice consistency to achieve performance,
reliability, and scalability
![Page 3: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/3.jpg)
UCSB CS271 3
Dynamo Design rationale
• Most services need key-based access:– Best-seller lists, shopping carts, customer
preferences, session management, sales rank, product catalog, and so on.
• Prevalent application design based on RDBMS technology will be catastrophic.
• Dynamo therefore provides primary-key only interface.
![Page 4: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/4.jpg)
UCSB CS271 4
Dynamo Design Overview
• Data partitioning using consistent hashing• Data replication• Consistency via version vectors• Replica synchronization via quorum protocol• Gossip-based failure-detection and
membership protocol
![Page 5: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/5.jpg)
UCSB CS271 5
System Requirements• Data & Query Model:
– Read/write operations via primary key– No relational schema: use <key, value> object– Object size < 1 MB, typically.
• Consistency guarantees:– Weak– Only single key updates– Not clear if read-modify-write isolate
• Efficiency:– SLA 99.9 percentile of operations
• Notes:– Commodity hardware– Minimal security measures since for internal use
![Page 6: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/6.jpg)
Service Level Agreements (SLA)
• Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds.
• Example SLA: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second.
UCSB CS271 6
![Page 7: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/7.jpg)
UCSB CS271 7
System Interface
• Two basic operations:– Get(key):
• Locates replicas• Returns the object + context (encodes meta data
including version)– Put(key, context, object):
• Writes the replicas to the disk• Context: version (vector timestamp)
• Hash(key) 128-bit identifier
![Page 8: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/8.jpg)
Partition Algorithm• Consistent hashing: the
output range of a hash function is treated as a fixed circular space or “ring” a la Chord.
• “Virtual Nodes”: Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution)
UCSB CS271 8
![Page 9: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/9.jpg)
UCSB CS271 9
Virtual Nodes
![Page 10: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/10.jpg)
Advantages of using virtual nodes• The number of virtual nodes that a node
is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
• A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node.
• If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.
• When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
UCSB CS271 10
![Page 11: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/11.jpg)
Replication
• Each data item is replicated at N hosts.
• preference list: The list of nodes that is responsible for storing a particular key.
• Some fine-tuning to account for virtual nodes
UCSB CS271 11
![Page 12: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/12.jpg)
UCSB CS271 12
Replication
![Page 13: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/13.jpg)
UCSB CS271 13
Replication
![Page 14: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/14.jpg)
UCSB CS271 14
Preference Lists
• List of nodes responsible for storing a particular key.
• Due to failures, preference list contains more than N nodes.
• Due to virtual nodes, preference list skips positions to ensure distinct physical nodes.
![Page 15: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/15.jpg)
UCSB CS271 15
Data Versioning
• A put() call may return to its caller before the update has been applied at all the replicas
• A get() call may return many versions of the same object.
• Challenge: an object may have distinct versions• Solution: use vector clocks in order to capture
causality between different versions of same object.
![Page 16: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/16.jpg)
UCSB CS271 16
Vector Clock
• A vector clock is a list of (node, counter) pairs.• Every version of every object is associated with
one vector clock.• If the all counters on the first object’s clock are
less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten.
• Application reconciles divergent versions and collapses into a single new version.
![Page 17: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/17.jpg)
Vector clock example
UCSB CS271 17
![Page 18: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/18.jpg)
UCSB CS271 18
Routing requests• Route request through a generic load
balancer that will select a node based on load information.
• Use a partition-aware client library that routes requests directly to relevant node.
• A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories.
![Page 19: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/19.jpg)
UCSB CS271 19
Sloppy Quorum
• R and W is the minimum number of nodes that must participate in a successful read/write operation.
• Setting R + W > N yields a quorum-like system.• In this model, the latency of a get (or put)
operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability.
![Page 20: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/20.jpg)
UCSB CS271 20
Highlights of Dynamo
• High write availability• Optimistic: vector clocks for resolution• Consistent hashing (Chord) in controlled
environment• Quorums for relaxed consistency.
![Page 21: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/21.jpg)
CASSANDRA (FACEBOOK)
Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009
UCSB CS271 21
![Page 22: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/22.jpg)
Data Model
• Key-value store—more like Bigtable.• Basically, a distributed multi-dimensional map
indexed by a key.• Value is structured into Columns, which are
grouped into Column Families: simple and super (column family within a column family).
• An operation is atomic on a single row.• API: insert, get and delete.
UCSB CS271 22
![Page 23: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/23.jpg)
System Architecture
• Like Dynamo (and Chord).• Uses order preserving hash function on a fixed
circular space. Node responsible for a key is called the coordinator.
• Non-uniform data distribution: keep track of data distribution and reorganize if necessary.
UCSB CS271 23
![Page 24: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/24.jpg)
Replication
• Each item is replicated at N hosts.• Replicas can be: Rack Unaware; Rack Aware
(within a data center); Datacenter Aware.• System has an elected leader.• When a node joins the system, the leader
assigns it a range of data items and replicas.• Each node is aware of every other node in the
system and the range they are responsible for.
UCSB CS271 24
![Page 25: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/25.jpg)
Membership and Failure Detection• Gossip-based mechanism to maintain cluster membership.• A node determines which nodes are up and down using a
failure detector.• The Φ accrual failure detector returns a suspicion level, Φ,
for each monitored node.• Say a node suspects A when Φ=1, 2, 3, then the likelihood
of a mistake is 10%, 1% and .1%.• Every node maintains a sliding window of interarrival times
of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution.
UCSB CS271 25
![Page 26: Amazon’s Key-Value Store: Dynamo](https://reader036.fdocuments.net/reader036/viewer/2022062302/5681652e550346895dd7b39f/html5/thumbnails/26.jpg)
Operations• Use quorums: R and W• If R+W < N then read will return latest value.
– Read operations return value with highest timestamp, so may return older versions
– Read Repair: with every read, send newest version to any out-of-date replicas.
– Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive)
• Each write: first into a persistent commit log, then an in-memory data structure.
UCSB CS271 26