1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan...
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of 1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan...
1
Distributed Data Structures for a
Peer-to-peer system
Advisor: James AspnesCommittee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge, UK]
Gauri Shah
2
P2P system
• Very large number of peers (nodes).
• Store resources identified by keys.•
• Peers subject to crash failures.•
• Question: how to locate resources efficiently?
Resources
Peers
Key
3
A brief history
Shawn Fanning starts Napster.
June 1999 Dec. 1999
RIAA sues Napster forcopyright
infringement.
July 2001
Napster is shut down!
Napster
KaZaAGnutella
MorpheusMojoNatio
n……Napster clones
CANChordPastry
TapestrySkip graphs
……
Academic Research
SETI@homefolding@home
……
Distributed computing
4
Answer: Central server?
Napster
? x
• Central server bottleneck•
• Wasted power at clients•
• No fault tolerance
x
?x
Using server farms?
7
What would we like?
• Data availability• Decentralization
• Scalability
• Load balancing
• Fault-tolerance
• Network maintenance• Dynamic node addition/deletion
• Repair mechanism
• Efficient searching• Incorporating proximity
• Incorporating locality
8
Distributed Hash Tables
Node
Physical Link
HASH
Resource
v2
v1
v4
v3Virtual Link
VIRTUAL OVERLAY NETWORK
PHYSICAL NETWORK
v3 v1 v4
v1 v3 v4
Virtual RouteActual Route Node Ids
and keys
9
3 5
7
8
2
(0,0) (1,0)
(0,1) (1,1)
d=2
CAN [RFHKS ’01]
427 768
135
365
123
Pastry [RD ’01]Tapestry [ZKJ ’01]
Existing DHT systems
0
2
3
7
6
5
0
Chord [SMKKB ’01]
4
1
33
66
0
3
6
0 O(log n) time per search
O(log n) spaceper node
327
360
368
10
What does this give us?
• Data availability• Decentralization
• Scalability
• Load balancing
• Fault-tolerance
• Network maintenance• Dynamic node addition/deletion
• Repair mechanism
• Efficient searching• Incorporating proximity
• Incorporating locality
11
Analytical model[Aspnes-Diamadi-Shah, PODC 2002]
Questions:
• Performance with failures?
• Optimal link distribution for greedy routing?
• Construction and dynamic maintenance?
12
Our approach (Based on [Kleinberg 1999])
Simple metric space: 1D line.Hash(key) = Metric space location.
2 short-hop links: immediate neighbors.k long-hops links: inverse-distance distribution.
Pr[edge(u,v)] = 1/d(u,v) /
Greedy Routing: forward message to neighbor closest to target in metric space.
uv'1/d(u,v’)
13
Performance with failures
Without failures: Routing time: O((log2n)/k).
With failures: Each node/link fails with prob. p.Routing time: O((log2n)/[(1-p).k]).
p (1-p) Time
Each node has k [1..log n] long-hop links.
14
Search with random failures
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Failed
Random Reroute
Backtrack
n = 131072 nodeslog n = 17 links
Fract
ion o
f fa
iled
searc
hes
Probability of node failure
Fraction of failed searches [Non-faulty source and target]
15
Lower bounds?
Is it possible to design a link distribution that beats the O(log2n) bound for routing
given by 1/d distribution?
Lower bound on routing time as a functionof number of links per node.
16
Lower boundsRandom graph G. Node x has k links on average, each chosenindependently. x links to (x-1) and (x+1). Let target = 0.Expected time to reach 0 from any point chosen uniformly from 1..n:
(log2n) worse than O(log n) for a tree: cost of assuming symmetry between nodes.
Ω
*
* Probability of choosing links symmetric about 0 and unimodal.
Routing time: Ω(log2n/k log log n)
17
Heuristic for construction
New node chooses neighbors using inverse-distance distribution. Links to live nodes closest to chosen ones.Selects older nodes to point to it.
absent node
adjusted link
initial link
new link
older node
y x
Same strategy for repairing broken links.
new node
ideal link
18
n=16384 nodeslog n=14 links
Derived link distribution
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000 100000
Length of link
Prob
abili
ty o
f lin
k
DerivedIdeal
19
So far...
• Data availability• Decentralization
• Scalability•
• Load balancing•
• Fault-tolerance
• Network maintenance• Dynamic node addition/deletion•
• Repair mechanism
• Efficient searching• Incorporating proximity•
• Incorporating locality
20
Disadvantage of DHTs• No support for locality.
User requests www.cnn.com
Likely to requestwww.cnn.com/weather
System should use information from first search to improve performance of second search.
• No support for complex queries.
DHTs cannot do this as hashing destroys locality.
21
Skip list [Pugh ’90]
Data structure based on a linked list.
A G J M R W
HEAD TAIL
1 0 1 1 00
0 01
Each element linked at higher level with probability 1/2.
Level 0
A J M
Level 1
J
Level 2
22
Searching in a skip list
A G J M R W
HEAD TAIL
A J
J
Search for key ‘R’
M
Level 0
Level 1
Level 2
- +
successfailure
Time for search: O(log m) on average.Number of pointers per element: O(1) on average.
[m = number of elements in skip list]
23
Skip lists for P2P?
• Cannot reduce load on top-level elements.• Cannot survive partitioning by failures.
Disadvantages
Advantages
• Takes O(log m) expected search time.• Retains locality.• Supports dynamic additions/deletions.
Problem: Lack of redundancy.
24
A skip graph[Aspnes-Shah, SODA 2003]
A
000
J
001
M
011
G100
W101
R110
Level 1
G
R
W
A J M000 001 011
101
110
100Level 2
A G J M R W
000 001 011100 110 101
Level 0
Membership vectors
Link at level i to elements with matching prefix of length i.Average O(log m) pointers per element [m = number of resources].
25
Search: expected O (log m)
Same performance as skip lists and DHTs.
A J MG WR
Level 1
GR
WA J MLe
vel 2
A G J R W
Level 0
Restricting to the lists containing the starting element of the search, we get a skip list.
M
26
Resources vs. nodes
Skip graphs: Elements are resources.
DHTs: Elements are nodes.
C
B D
A E
Does not affect search performance or load balancing. But increases pointers at each node.
DHT
Skip graphPhysicalNetwork
PhysicalNetwork
AE
CBD
Level 0
27
com.apple
com.sun
com.ibm
com.microsoft
com.ibm/m1 com.ibm/m2 com.ibm/m3
m3
com.ibm/m4
m2m1
m4
r.htma.htm……
f.htmg.htm……
Level 0
SkipNet[HJSTW’03]
DistributedHash Table
28
So far...
• Data availability• Decentralization
• Scalability•
• Load balancing•
• Fault-tolerance
• Network maintenance• Dynamic node addition/deletion•
• Repair mechanism
• Efficient searching• Incorporating proximity•
• Incorporating locality
29
Insertion – 1
A
000
M
011
G
100
W
101
R
110
Level 1
G
R
W
A M
000 011
101
110
100Level 2
A G M R W
000 011100 110 101Level 0
J
001
Starting at buddy, find nearest key at level 0:range query looking for key closest to new key.
Takes O(log m) on average.
buddynew
element
30
Insertion - 2
A
000
M
011
G
100
W
101
R
110
Level 1
G
R
W
A M
000 011
101
110
100Level 2
A G M R W
000 011100 110 101Level 0
J
001
J
001
J
001
Adds O(1) time per level.Total time for insertion: O(log m).
Same as most DHTS.
Search for matching prefix of increasing length.
31
So far...
• Data availability• Decentralization
• Scalability•
• Load balancing•
• Fault-tolerance
• Network maintenance• Dynamic node addition/deletion•
• Repair mechanism
• Efficient searching• Incorporating proximity•
• Incorporating locality
32
Locality and range queries
• Find any key < F, > F.• Find largest key < F.• Find least key > F.
• Find all keys in interval [D..O].
• Initial element insertion at level 0.
D F A I
D F A I L O S
33
Further applications of locality
news:05/14
e.g. find latest news before today. find largest key < news: 05/14.
news:03/18 news:04/03news:03/01news:01/31Level 0
1. Version Control
34
e.g. find any copy of some Britney Spears song: search for britney*.
britney03 britney04britney02Level 0
2. Data Replication
Level 1
Level 2
Provides hot-spot management and survivability.
35
What’s left?
• Data availability• Decentralization•
• Scalability•
• Load balancing•
• Fault-tolerance
• Network maintenance• Dynamic node addition/deletion•
• Repair mechanism
• Efficient searching• Incorporating proximity•
• Incorporating locality
36
Fault tolerance
How do failures affect skip graph performance?
Random failures: Randomly chosen elements fail. Experimental results.
[Experiments may not necessarily give worst failure pattern.]
Adversarial failures: Adversary carefully chooses elements that fail. Theoretical results.
37
Random failures
Size of largest connected component
as fraction of live elements
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0.0
0
0.0
5
0.1
0
0.1
5
0.2
0
0.2
5
0.3
0
0.3
5
0.4
0
0.4
5
0.5
0
0.5
5
0.6
0
0.6
5
0.7
0
0.7
5
0.8
0
0.8
5
0.9
0
0.9
5
Probability of f ailure
Siz
e 131072 elements
38
Searches with random failures
0.00
0.05
0.10
0.15
0.20
0.25
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Probability of f ailure
Fai
led s
earc
hes
131072 elements10000 messages
Fraction of failed searches [Non-faulty source and target]
39
Adversarial failures
Theorem: A skip graph with m elements has expansion ratio = (1/log m) whp.Ω
A
dAdA = elements adjacent to A but not in A.
Expansion ratio = min |dA|/|A|, 1 <= |A| <= m/2.
f failures can isolate only O(f•log m) elements.
# of failuresisolated elements
>= |dA| >=|A|
1log m
40
Need for repair mechanism
A J MG WR
Level 1
GR
WA J MLe
vel 2
A G J M R W
Level 0
Node failures can leave skip graph in inconsistent state.
41
3
3
Basic repair actionIf an element detects a missing neighbor, it tries
to patch the link using other levels.
1 2 4 5 6
1 5 6
1 5
Also relink at other lower levels.
Eventually each connected component of the disruptedskip graph reorganizes itself into an ideal skip graph.
42
Ideal skip graph
Let xRi (xLi) be the right (left) neighbor of x at level i.
xLi < x < xRi.xLiRi = xRiLi = x.
Invariant
If xLi, xRi exist: Successor constraints
x
Level i
Level i-1
xRi
xRi-1 xRi-1
x
..00..
..01.. ..00..
xRi = xRi-1, for some k’.
xLi = xLi-1, for some k.k
k’
1 2
43
Constraint violationNeighbor at level i not present at level (i-1).
Level i-1
Level i
..00.. ..01.. ..01.. ..01.. ..00.. ..01.. ..01....01..
Level i+1
merge..00.. ..01..
..01.. ..00.. ..01..
..01..
Level i-1
Level i
45
Network congestionInterested in average traffic through any element u i.e. the number of searches from source s to destination t that use u.
Theorem: Let dist (u, t) = d. Then the probability that a search from s to t passes through u is < 2/(d+1).
where V = {elements v: u <= v <= t} and |V| = d+1.
Elements near popular target get loaded but effect drops off rapidly.
46
76400 76450 76500 76550 766000.0
Location
Fract
ion
of
mess
ag
es
Predicted vs. real load
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Predicted loadActual loadDestination = 76500
47
Knowledge of key space
DHTs require knowledge of key space size initially.Skip graphs do not!
E Z
1 0
E ZJ
insert
Level 0
Level 1
Old elements extend membership vector as required with new arrivals.
E Z
1 0
E Z
J
0
J00 0 1
ZJ
Level 0
Level 1
Level 2
New bit
48
Similarities with DHTs
• Data availability• Decentralization•
• Scalability•
• Load balancing•
• Fault-tolerance [Random failures]
• Network maintenance• Dynamic node addition/deletion•
• Repair mechanism
• Efficient searching• Incorporating proximity•
• Incorporating locality
49
Property DHTsSkip
GraphsTolerance of
adversarial faultsNot yet Yes
Locality No Yes
Key space size Reqd. Not reqd.
Proximity Partially No
Differences
50
Open Problems
• Design more efficient repair mechanism.
• Incorporate proximity.
• Study effect of byzantine/selfish behavior.
• Provide locality and state-minimization
Some promising approaches:
• Soln: Composition of data structures [AS’03, ZSZ’03]•
• Tool: Locality-sensitive hashing [LS’96, IMRV’97]