1 Structured P2P overlay. 2 Outline Introduction Chord I. Stoica, R. Morris and D. Karger,“Chord:...
-
Upload
darcy-bell -
Category
Documents
-
view
232 -
download
0
Transcript of 1 Structured P2P overlay. 2 Outline Introduction Chord I. Stoica, R. Morris and D. Karger,“Chord:...
1
Structured P2P overlay
2
Outline Introduction Chord
I. Stoica, R. Morris and D. Karger,“Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications”, SIGCOMM 2001
Can S. RATNASAMY, P. FRANCIS, M. HANDLEY, R. KARP, and S. SHENKER, “A
Scalable Content-Addressable Network”, SIGCOMM2001 Yapper
P. Ganesan, Q. Sun, and H. Garcia-Molina, “YAPPERS: A peer-to-peer lookup service over arbitrary topology”, INFOCOM 2003
Pastry A. Rowstron, and P. Druschel, "Pastry: Scalable, Distributed Object Location
and Routing for Large-Scale Peer-to-Peer Systems“, 18th IFIP/ACM Int'l Conf. on Distributed Systems Platforms
3
Introduction: P2P-Overlay Networks
+: distribute costs bandwidth maintaining
-: additional costs traffic overhead
TCP/IPTCP/IP
TCP/IPTCP/IP
TCP/IPTCP/IP
Peers Overlay Network
Underlying Network
4
Peer B Peer CPeer A
Distributed Hash Tables (DHTs)
DHT: Peers contain the buckets of the hash table DHT is a structured overlay that offers extreme
scalability and hash-table-like lookup interface Each peer responsible for a set of hashed IDs
hash(42) hash
(23)
5
Principle: Index references are placed on the peers responsible for their hashed identifiers. Node A (provider) indexes object at responsible peer B.
Structured Overlay Networks: Example
1. Publish link at responsible Peer
3. P2P com-munication.
Get link to object.
2. “Routing” to / Lookup of desired Object
?
The object reference is routed to B.
Node C looking for object sends query to the network.
Query is routed to the responsible node.
Node B replies to C by sending contact information of A
6
Peer-Neighborhood
Routing Table(who do I know)
PeerA: Peer-ID, IP-AddressPeerB: Peer-ID, IP-AddressPeerC: Peer-ID, IP-Address
Content(what do I have)
File “abc”File “xyz”
...
Distributed Index(what do I know)
File “foo”: IP-AddressFile “bar”: IP-Address
…
Peer A Peer B
Peer C
7
Two important issues
Load balancing
Neighbor table consistency preserving
8
Chord
A Scalable Peer-to-peer Lookup Service for Internet Applications
Prepared by Ali Yildiz
9
What is Chord?
In short: a peer-to-peer lookup systemGiven a key (data item), it maps the key
onto a node (peer).Uses consistent hashing to assign keys to
nodes .Solves problem of locating key in a
collection of distributed nodes.Maintains routing information as nodes join
and leave the system
10
What is Chord? - Addressed Problems
Load balance: distributed hash function, spreading keys evenly over nodes
Decentralization: chord is fully distributed, no node more important than other, improves robustness
Scalability: logarithmic growth of lookup costs with number of nodes in network, even very large systems are feasible
Availability: chord automatically adjusts its internal tables to ensure that the node responsible for a key can always be found
11
What is Chord? - Example Application
Highest layer provides a file-like interface to user including user-friendly naming and authentication
This file systems maps operations to lower-level block operations
Block storage uses Chord to identify responsible node for storing a block and then talk to the block storage server on that node
File System
Block Store
Chord
Block Store
Chord
Block Store
Chord
Client Server Server
12
Consistent Hashing
Consistent hash function assigns each node and key an m-bit identifier.
SHA-1 is used as a base hash function.A node’s identifier is defined by hashing the
node’s IP address.A key identifier is produced by hashing the
key (chord doesn’t define this. Depends on the application). ID(node) = hash(IP, Port) ID(key) = hash(key)
13
Consistent Hashing
In an m-bit identifier space, there are 2m identifiers.
Identifiers are ordered on an identifier circle modulo 2m.
The identifier ring is called Chord ring.Key k is assigned to the first node whose
identifier is equal to or follows (the identifier of) k in the identifier space.
This node is the successor node of key k, denoted by successor(k).
14
6
1
2
6
0
4
26
5
1
3
7
2identifier
circle
identifier
node
X key
Consistent Hashing - Successor Nodes
successor(1) = 1
successor(2) = 3successor(6) = 0
16
Consistent Hashing – Join and Departure
When a node n joins the network, certain keys previously assigned to n’s successor now become assigned to n.
When node n leaves the network, all of its assigned keys are reassigned to n’s successor.
17
Consistent Hashing – Node Join
0
4
26
5
1
3
7
keys1
keys2
keys
keys
7
5
18
Consistent Hashing – Node Dep.
0
4
26
5
1
3
7
keys1
keys2
keys
keys6
7
20
A Simple Key Lookup
A very small amount of routing information suffices to implement consistent hashing in a distributed environment
If each node knows only how to contact its current successor node on the identifier circle, all node can be visited in linear order.
Queries for a given identifier could be passed around the circle via these successor pointers until they encounter the node that contains the key.
21
A Simple Key Lookup
Pseudo code for finding successor:// ask node n to find the successor of id
n.find_successor(id)
if (id (n, successor])
return successor;
else
// forward the query around the circle
return successor.find_successor(id);
22
A Simple Key Lookup
The path taken by a query from node 8 for key 54:
23
Scalable Key Location To accelerate lookups, Chord maintains
additional routing information. Finger Tables
Each node n’ maintains a routing table with up to m entries (which is in fact the number of bits in identifiers), called finger table.
The ith entry in the table at node n contains the identity of the first node s that succeeds n by at least 2i-1 on the identifier circle.
s = successor(n+2i-1).s is called the ith finger of node n, denoted
by n.finger(i)
24
Scalable Key Location – Finger Tables
0
4
26
5
1
3
7
124
130
finger tablestart succ.
keys1
235
330
finger tablestart succ.
keys2
457
000
finger tablestart succ.
keys6
0+20
0+21
0+22
For.
1+20
1+21
1+22
For.
3+20
3+21
3+22
For.
25
Scalable Key Location – Finger Tables
A finger table entry includes both the Chord identifier and the IP address (and port number) of the relevant node.
The first finger of n is the immediate successor of n on the circle.
26
Scalable Key Location – Example query
The path a query for key 54 starting at node 8:
27
Scalable Key Location – A characteristic
Since each node has finger entries at power of two intervals around the identifier circle, each node can forward a query at least halfway along the remaining distance between the node and the target identifier. From this intuition follows a theorem:
Theorem: With high probability, the number of nodes that must be contacted to find a successor in an N-node network is O(logN).
28
Node Joins and Stabilizations
The most important thing is the successor pointer.
If the successor pointer is ensured to be up to date, which is sufficient to guarantee correctness of lookups, then finger table can always be verified.
Each node runs a “stabilization” protocol periodically in the background to update successor pointer and finger table.
29
Node Joins and Stabilizations
“Stabilization” protocol contains 6 functions: create() join() stabilize() notify() fix_fingers() check_predecessor()
30
Node Joins – join()
When node n first starts, it calls n.join(n’), where n’ is any known Chord node.
The join() function asks n’ to find the immediate successor of n.
join() does not make the rest of the network aware of n.
31
Node Joins – join()
// create a new Chord ring.n.create()
predecessor = nil;successor = n;
// join a Chord ring containing node n’.n.join(n’)
predecessor = nil;successor = n’.find_successor(n);
33
Node Joins – stabilize()
stabilize() notifies node n’s successor of n’s existence, giving the successor the chance to change its predecessor to n.
The successor does this only if it knows of no closer predecessor than n.
34
Node Joins – Join and Stabilization
np
su
cc(n
p)
= n
s
ns
n
pre
d(n
s)
= n
p
n joins
predecessor = nil n acquires ns as successor via some n’
n runs stabilize n notifies ns being the new predecessor
ns acquires n as its predecessor
np runs stabilize
np asks ns for its predecessor (now n)
np acquires n as its successor
np notifies n
n will acquire np as its predecessor
all predecessor and successor pointers are now correct
fingers still need to be fixed, but old fingers will still work
nil
pre
d(n
s)
= n
su
cc(n
p)
= n
35
Node Joins – fix_fingers()
Each node periodically calls fix fingers to make sure its finger table entries are correct.
It is how new nodes initialize their finger tables
It is how existing nodes incorporate new nodes into their finger tables.
36
Node Joins – fix_fingers()
// called periodically. refreshes finger table entries.n.fix_fingers()
next = next + 1 ;if (next > m)
next = 1 ;finger[next] = find_successor(n + 2next-1);
// checks whether predecessor has failed.n.check_predecessor()
if (predecessor has failed)predecessor = nil;
37
Node Failures Key step in failure recovery is maintaining correct successor
pointers
To help achieve this, each node maintains a successor-list of its r nearest successors on the ring
If node n notices that its successor has failed, it replaces it with the first live entry in the list
Successor lists are stabilized as follows: node n reconciles its list with its successor s by copying s’s
successor list, removing its last entry, and prepending s to it. If node n notices that its successor has failed, it replaces it
with the first live entry in its successor list and reconciles its successor list with its new successor.
38
Chord – The Math
Every node is responsible for about K/N keys (N nodes, K keys)
When a node joins or leaves an N-node network, only O(K/N) keys change hands (and only to and from joining or leaving node)
Lookups need O(log N) messages
To reestablish routing invariants and finger tables after node joining or leaving, only O(log2N) messages are required
39
Sylvia Ratnasamy, Paul Francis, Mark Handley,
Richard Karp, Scott Shenker
A Scalable, Content-Addressable Network
ACIRI U.C.Berkeley Tahoe Networks
1 2 3
1,2 3 1
1,2 1
40
Content-Addressable Network(CAN)
CAN: Internet-scale hash table
Interface insert(key,value) value = retrieve(key)
Properties scalable operationally simple good performance
41
K V
CAN: basic idea
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
42
CAN: basic idea
insert(K1,V1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
43
CAN: basic idea
insert(K1,V1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
44
CAN: basic idea
(K1,V1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
45
CAN: basic idea
retrieve (K1)
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
K V
46
CAN: solution
virtual Cartesian coordinate space
entire space is partitioned amongst all the nodes every node “owns” a zone in the overall space
abstraction CAN store data at “points” in the space CAN route from one “point” to another
point = node that owns the enclosing zone
47
CAN: simple example
1
48
CAN: simple example
1 2
49
CAN: simple example
1
2
3
50
CAN: simple example
1
2
3
4
51
CAN: simple example
52
CAN: simple example
I
53
CAN: simple example
node I::insert(K,V)
I
54
(1) a = hx(K)
CAN: simple example
x = a
node I::insert(K,V)
I
55
(1) a = hx(K)
b = hy(K)
CAN: simple example
x = a
y = b
node I::insert(K,V)
I
56
(1) a = hx(K)
b = hy(K)
CAN: simple example
(2) route(K,V) -> (a,b)
node I::insert(K,V)
I
57
CAN: simple example
(2) route(K,V) -> (a,b)
(3) (a,b) stores (K,V)
(K,V)
node I::insert(K,V)
I(1) a = hx(K)
b = hy(K)
58
CAN: simple example
(2) route “retrieve(K)” to (a,b)
(K,V)
(1) a = hx(K)
b = hy(K)
node J::retrieve(K)
J
59
Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address)
CAN
60
CAN: routing table
61
CAN: routing
(a,b)
(x,y)
62
A node only maintains state for its immediate neighboring nodes
CAN: routing
64
CAN: node insertion
I
new node1) discover some node “I” already in CAN
65
CAN: node insertion
2) pick random point in space
I
(p,q)
new node
66
CAN: node insertion
(p,q)
3) I routes to (p,q), discovers node J
I
J
new node
67
CAN: node insertion
newJ
4) split J’s zone in half… new owns one half
68
Inserting a new node affects only a single other node and its immediate neighbors
CAN: node insertion
69
CAN: node failures
Need to repair the space
recover database soft-state updates use replication, rebuild database from replicas
repair routing takeover algorithm
70
CAN: takeover algorithm
Simple failures know your neighbor’s neighbors when a node fails, one of its neighbors takes
over its zone
More complex failure modes simultaneous failure of multiple adjacent nodes scoped flooding to discover neighbors hopefully, a rare event
71
Only the failed node’s immediate neighbors are required for recovery
CAN: node failures
72
Evaluation
Scalability
Low-latency
Load balancing
Robustness
73
CAN: scalability
For a uniformly partitioned space with n nodes and d dimensions per node, number of neighbors is 2d average routing path is (dn1/d)/4 hops simulations show that the above results hold in practice
Can scale the network without increasing per-node state
Chord log(n) nbrs with log(n) hops
74
CAN: low-latency
Problem latency stretch = (CAN routing delay)
(IP routing delay) application-level routing may lead to high stretch
Solution increase dimensions heuristics
RTT-weighted routing multiple nodes per zone (peer nodes)
75
CAN: low-latency
#nodes
Late
ncy
str
etc
h
0
20
40
60
80
100
120
140
160
180
16K 32K 65K 131K
#dimensions = 2
w/o heuristics
w/ heuristics
76
0
2
4
6
8
10
CAN: low-latency
#nodes
Late
ncy
str
etc
h
16K 32K 65K 131K
#dimensions = 10
w/o heuristics
w/ heuristics
77
CAN: load balancing
Two pieces
Dealing with hot-spots popular (key,value) pairs nodes cache recently requested entries overloaded node replicates popular entries at
neighbors
Uniform coordinate space partitioning uniformly spread (key,value) entries uniformly spread out routing load
78
Uniform Partitioning
Added check at join time, pick a zone check neighboring zones pick the largest zone and split that one
79
Routing resilience
destination
source
80
Routing resilience
81
Routing resilience
destination
82
Routing resilience
83
Node X::route(D)
If (X cannot make progress to D) • check if any neighbor of X can make
progress• if yes, forward message to one such nbr
Routing resilience
84
Routing resilience
85
Routing resilience
0
20
40
60
80
100
2 4 6 8 10dimensions
Pr(
suc c
es s
ful ro
uti
ng
)
CAN size = 16K nodesPr(node failure) = 0.25
86
Routing resilience
0
20
40
60
80
100
0 0.25 0.5 0.75
CAN size = 16K nodes#dimensions = 10
Pr(node failure)
Pr(
suc c
es s
ful ro
uti
ng
)
87
Summary
CAN an Internet-scale hash table
Scalability O(d) per-node state
Low-latency routing simple heuristics help a lot
Robust decentralized, can route around trouble
88
YAPPERS: A Peer-to-Peer Lookup Service over Arbitrary Topology
Qixiang SunPrasanna Ganesan
Hector Garcia-Molina
Stanford University
89
Problem
Where is X?
90
Problem (2)
1. Search
2. Node join/leave
3. Register/remove content
A
B
C
91
Background
Gnutella-style join anywhere in the overlay
register do nothing
search flood the overlay
92
Background (2)
Distributed hash table (DHT)
join a unique location in the
overlay
register place pointer at a
unique node
search route towards the
unique node
. . .Chord
CAN
93
Background (3)
Gnutella-style+ Simple+ Local control+ Robust+ Arbitrary topology
Inefficient Disturbs many nodes
DHT+ Efficient search
Restricted overlay Difficulty with dynamism
94
Motivation
Best of both worlds Gnutella’s local interactions DHT-like efficiency
95
Partition Nodes
Given any overlay, first partition nodes into buckets (colors) based on hash of IP
96
Partition Nodes
Given any overlay, first partition nodes into buckets (colors) based on hash of IP
97
Partition Nodes (2)
Around each node, there is at least one node of each color
X Y
May require backup color assignments
98
Register Content
Partition content space into buckets (colors) and register pointer at “nearby” nodes.
Z
register red content locally
register blue content at a blue node
Nodes aroundZ form a smallhash table!
99
Searching Content
Start at a “nearby” colored node, search other nodes of the same color.
V
U
X Y
Z
W
100
Searching Content (2)
A smaller overlay for each color and use Gnutella-style flood
Fan-out = degree of nodes in the smaller overlay
101
Recap
node join anywhere in the overlay
register content at nearby node(s) of the appropriate color
search start at a nearby node of the search color and
then flood nodes of the same color.
102
Design Issues
How to build a small hash table around each node, i.e., assign colors?
How to connect nodes of the same color?
103
NeighborsIN : immediate neighborhoodEN: extended neighborhoodF(A): frontier of node A
104
Small-scale Hash Table
Small = all nodes within h hops (e.g., h=2, IN) Consistent across overlapping hash tables Stable when nodes enter/leave
A B
XC
105
Small-scale Hash Table (2)
Fixed number of buckets (colors)
Determine bucket (color) based on the hash value of node IP addresses
Multiple nodes of the same color Pick any one of these nodes to store the key
No nodes of a color Backup assignment
106
Searching the Overlay
Find another node of the same color in a “nearby” hash table
All nodeswithin h hops
A
C
B
Frontier Node
Need to track all nodes within 2h+1 hops
107
Searching the Overlay (2)
For a color C and each frontier node v,
1. determine which nodes v mightcontact to search for color C
2. contact these nodes
Theorem: Regardless of starting node, one can search all nodes of all color.
108
Recap
Hybrid approach Around each node, act like a hash table Flood the relevant nodes in the entire network
What do we gain? Respect original overlay Efficient search for popular data Avoid disturbing nodes unnecessarily
109
Brief Evaluation
Using a 24,702 nodes Gnutella snapshotas the underlying overlay
We study Number of nodes contacted per query when
searching the entire network
Trade-off in using our hybrid approach when flooding the entire network
110
Nodes Searched per Query
0
0.05
0.1
0.15
0.2
0.25
0.3
0 10 20 30 40 50
Number of Buckets (Colors)
Fra
ctio
n o
f N
od
es S
earc
hed Limited by the number
of nodes “nearby”
111
Trade-off
Fan-out = degree of each colored node when flooding “nearby” nodes of the same color
Average Fan-out
Vanilla 835
Heuristics 82
• Good in searching nearby nodes quickly.
• Bad in searching the entire network
112
Overloading a Node
A node may have many colors even if it has a large neighborhood.
A
X
113
Enhancements
Prune fringe nodes (low connectivity node)Biased backup node assignment
Node X can assigh a backup color to node Y if and only if *|IN(X)| > |IN(Y)|, where controls the relative sizes of the neighborhoods
Forbid a node with a small immediate neighborhood assigning backup colors to a node with a large immediate neighborhood
114
Conclusion
Does YAPPERS work? YES
Respects the original overlay Searches efficiently in small area Disturbs fewer nodes than Gnutella Handles node arrival/departure better than DHT
NO Large fan-out (vanilla flooding won’t work)
115
For More Information
A short position paper advocating locally-organized P2P systemshttp://dbpubs.stanford.edu/pub/2002-60
Other P2P work at Stanfordhttp://www-db.stanford.edu/peers