1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan...

1

Distributed Data Structures for a

Peer-to-peer system

Advisor: James AspnesCommittee: Joan Feigenbaum Arvind Krishnamurthy Antony Rowstron [MSR, Cambridge, UK]

Gauri Shah

2

P2P system

• Very large number of peers (nodes).

• Store resources identified by keys.•

• Peers subject to crash failures.•

• Question: how to locate resources efficiently?

Resources

Peers

Key

3

A brief history

Shawn Fanning starts Napster.

June 1999 Dec. 1999

RIAA sues Napster forcopyright

infringement.

July 2001

Napster is shut down!

Napster

KaZaAGnutella

MorpheusMojoNatio

n……Napster clones

CANChordPastry

TapestrySkip graphs

……

Academic Research

SETI@homefolding@home

……

Distributed computing

4

Answer: Central server?

Napster

? x

• Central server bottleneck•

• Wasted power at clients•

• No fault tolerance

x

?x

Using server farms?

5

Answer: Flooding?Gnutella

• Too much traffic•

• Available resources ‘out-of-reach’

6

Answer: Super-peers?

KaZaA/Morpheus

Inherently unscalable

Super-peers

7

What would we like?

• Data availability• Decentralization

• Scalability

• Load balancing

• Fault-tolerance

• Network maintenance• Dynamic node addition/deletion

• Repair mechanism

• Efficient searching• Incorporating proximity

• Incorporating locality

8

Distributed Hash Tables

Node

Physical Link

HASH

Resource

v2

v1

v4

v3Virtual Link

VIRTUAL OVERLAY NETWORK

PHYSICAL NETWORK

v3 v1 v4

v1 v3 v4

Virtual RouteActual Route Node Ids

and keys

9

3 5

7

8

2

(0,0) (1,0)

(0,1) (1,1)

d=2

CAN [RFHKS ’01]

427 768

135

365

123

Pastry [RD ’01]Tapestry [ZKJ ’01]

Existing DHT systems

0

2

3

7

6

5

0

Chord [SMKKB ’01]

4

1

33

66

0

3

6

0 O(log n) time per search

O(log n) spaceper node

327

360

368

10

What does this give us?


• Scalability

• Load balancing

• Fault-tolerance

• Network maintenance• Dynamic node addition/deletion


• Efficient searching• Incorporating proximity


11

Analytical model[Aspnes-Diamadi-Shah, PODC 2002]

Questions:

• Performance with failures?

• Optimal link distribution for greedy routing?

• Construction and dynamic maintenance?

12

Our approach (Based on [Kleinberg 1999])

Simple metric space: 1D line.Hash(key) = Metric space location.

2 short-hop links: immediate neighbors.k long-hops links: inverse-distance distribution.

Pr[edge(u,v)] = 1/d(u,v) /

Greedy Routing: forward message to neighbor closest to target in metric space.

uv'1/d(u,v’)

13

Performance with failures

Without failures: Routing time: O((log2n)/k).

With failures: Each node/link fails with prob. p.Routing time: O((log2n)/[(1-p).k]).

p (1-p) Time

Each node has k [1..log n] long-hop links.

14

Search with random failures

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Failed

Random Reroute

Backtrack

n = 131072 nodeslog n = 17 links

Fract

ion o

f fa

iled

searc

hes

Probability of node failure

Fraction of failed searches [Non-faulty source and target]

15

Lower bounds?

Is it possible to design a link distribution that beats the O(log2n) bound for routing

given by 1/d distribution?

Lower bound on routing time as a functionof number of links per node.

16

Lower boundsRandom graph G. Node x has k links on average, each chosenindependently. x links to (x-1) and (x+1). Let target = 0.Expected time to reach 0 from any point chosen uniformly from 1..n:

(log2n) worse than O(log n) for a tree: cost of assuming symmetry between nodes.

Ω

*

* Probability of choosing links symmetric about 0 and unimodal.

Routing time: Ω(log2n/k log log n)

17

Heuristic for construction

New node chooses neighbors using inverse-distance distribution. Links to live nodes closest to chosen ones.Selects older nodes to point to it.

absent node

adjusted link

initial link

new link

older node

y x

Same strategy for repairing broken links.

new node

ideal link

18

n=16384 nodeslog n=14 links

Derived link distribution

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000 100000

Length of link

Prob

abili

ty o

f lin

k

DerivedIdeal

19

So far...


• Scalability•

• Load balancing•

• Fault-tolerance

• Network maintenance• Dynamic node addition/deletion•


• Efficient searching• Incorporating proximity•


20

Disadvantage of DHTs• No support for locality.

User requests www.cnn.com

Likely to requestwww.cnn.com/weather

System should use information from first search to improve performance of second search.

• No support for complex queries.

DHTs cannot do this as hashing destroys locality.

http://www.cnn.com/

http://www.cnn.com/

21

Skip list [Pugh ’90]

Data structure based on a linked list.

A G J M R W

HEAD TAIL

1 0 1 1 00

0 01

Each element linked at higher level with probability 1/2.

Level 0

A J M

Level 1

J

Level 2

22

Searching in a skip list

A G J M R W

HEAD TAIL

A J

J

Search for key ‘R’

M

Level 0

Level 1

Level 2

- +

successfailure

Time for search: O(log m) on average.Number of pointers per element: O(1) on average.

[m = number of elements in skip list]

23

Skip lists for P2P?

• Cannot reduce load on top-level elements.• Cannot survive partitioning by failures.

Disadvantages

Advantages

• Takes O(log m) expected search time.• Retains locality.• Supports dynamic additions/deletions.

Problem: Lack of redundancy.

24

A skip graph[Aspnes-Shah, SODA 2003]

A

000

J

001

M

011

G100

W101

R110

Level 1

G

R

W

A J M000 001 011

101

110

100Level 2

A G J M R W

000 001 011100 110 101

Level 0

Membership vectors

Link at level i to elements with matching prefix of length i.Average O(log m) pointers per element [m = number of resources].

25

Search: expected O (log m)

Same performance as skip lists and DHTs.

A J MG WR

Level 1

GR

WA J MLe

vel 2

A G J R W

Level 0

Restricting to the lists containing the starting element of the search, we get a skip list.

M

26

Resources vs. nodes

Skip graphs: Elements are resources.

DHTs: Elements are nodes.

C

B D

A E

Does not affect search performance or load balancing. But increases pointers at each node.

DHT

Skip graphPhysicalNetwork

PhysicalNetwork

AE

CBD

Level 0

27

com.apple

com.sun

com.ibm

com.microsoft

com.ibm/m1 com.ibm/m2 com.ibm/m3

m3

com.ibm/m4

m2m1

m4

r.htma.htm……

f.htmg.htm……

Level 0

SkipNet[HJSTW’03]

DistributedHash Table

28

So far...


• Scalability•


• Fault-tolerance





29

Insertion – 1

A

000

M

011

G

100

W

101

R

110

Level 1

G

R

W

A M

000 011

101

110

100Level 2

A G M R W

000 011100 110 101Level 0

J

001

Starting at buddy, find nearest key at level 0:range query looking for key closest to new key.

Takes O(log m) on average.

buddynew

element

30

Insertion - 2

A

000

M

011

G

100

W

101

R

110

Level 1

G

R

W

A M

000 011

101

110

100Level 2

A G M R W

000 011100 110 101Level 0

J

001

J

001

J

001

Adds O(1) time per level.Total time for insertion: O(log m).

Same as most DHTS.

Search for matching prefix of increasing length.

31

So far...


• Scalability•


• Fault-tolerance





32

Locality and range queries

• Find any key < F, > F.• Find largest key < F.• Find least key > F.

• Find all keys in interval [D..O].

• Initial element insertion at level 0.

D F A I

D F A I L O S

33

Further applications of locality

news:05/14

e.g. find latest news before today. find largest key < news: 05/14.

news:03/18 news:04/03news:03/01news:01/31Level 0

1. Version Control

34

e.g. find any copy of some Britney Spears song: search for britney*.

britney03 britney04britney02Level 0

2. Data Replication

Level 1

Level 2

Provides hot-spot management and survivability.

35

What’s left?

• Data availability• Decentralization•

• Scalability•


• Fault-tolerance





36

Fault tolerance

How do failures affect skip graph performance?

Random failures: Randomly chosen elements fail. Experimental results.

[Experiments may not necessarily give worst failure pattern.]

Adversarial failures: Adversary carefully chooses elements that fail. Theoretical results.

37

Random failures

Size of largest connected component

as fraction of live elements

0.00

0.20

0.40

0.60

0.80

1.00

1.20

0.0

0

0.0

5

0.1

0

0.1

5

0.2

0

0.2

5

0.3

0

0.3

5

0.4

0

0.4

5

0.5

0

0.5

5

0.6

0

0.6

5

0.7

0

0.7

5

0.8

0

0.8

5

0.9

0

0.9

5

Probability of f ailure

Siz

e 131072 elements

38

Searches with random failures

0.00

0.05

0.10

0.15

0.20

0.25

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Probability of f ailure

Fai

led s

earc

hes

131072 elements10000 messages

Fraction of failed searches [Non-faulty source and target]

39

Adversarial failures

Theorem: A skip graph with m elements has expansion ratio = (1/log m) whp.Ω

A

dAdA = elements adjacent to A but not in A.

Expansion ratio = min |dA|/|A|, 1 <= |A| <= m/2.

f failures can isolate only O(f•log m) elements.

# of failuresisolated elements

>= |dA| >=|A|

1log m

40

Need for repair mechanism

A J MG WR

Level 1

GR

WA J MLe

vel 2

A G J M R W

Level 0

Node failures can leave skip graph in inconsistent state.

41

3

3

Basic repair actionIf an element detects a missing neighbor, it tries

to patch the link using other levels.

1 2 4 5 6

1 5 6

1 5

Also relink at other lower levels.

Eventually each connected component of the disruptedskip graph reorganizes itself into an ideal skip graph.

42

Ideal skip graph

Let xRi (xLi) be the right (left) neighbor of x at level i.

xLi < x < xRi.xLiRi = xRiLi = x.

Invariant

If xLi, xRi exist: Successor constraints

x

Level i

Level i-1

xRi

xRi-1 xRi-1

x

..00..

..01.. ..00..

xRi = xRi-1, for some k’.

xLi = xLi-1, for some k.k

k’

1 2

43

Constraint violationNeighbor at level i not present at level (i-1).

Level i-1

Level i

..00.. ..01.. ..01.. ..01.. ..00.. ..01.. ..01....01..

Level i+1

merge..00.. ..01..

..01.. ..00.. ..01..

..01..

Level i-1

Level i

44

Additional properties

1. Low network congestion.

2. No need to know key space.

45

Network congestionInterested in average traffic through any element u i.e. the number of searches from source s to destination t that use u.

Theorem: Let dist (u, t) = d. Then the probability that a search from s to t passes through u is < 2/(d+1).

where V = {elements v: u <= v <= t} and |V| = d+1.

Elements near popular target get loaded but effect drops off rapidly.

46

76400 76450 76500 76550 766000.0

Location

Fract

ion

of

mess

ag

es

Predicted vs. real load

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Predicted loadActual loadDestination = 76500

47

Knowledge of key space

DHTs require knowledge of key space size initially.Skip graphs do not!

E Z

1 0

E ZJ

insert

Level 0

Level 1

Old elements extend membership vector as required with new arrivals.

E Z

1 0

E Z

J

0

J00 0 1

ZJ

Level 0

Level 1

Level 2

New bit

48

Similarities with DHTs

• Data availability• Decentralization•

• Scalability•


• Fault-tolerance [Random failures]





49

Property DHTsSkip

GraphsTolerance of

adversarial faultsNot yet Yes

Locality No Yes

Key space size Reqd. Not reqd.

Proximity Partially No

Differences

50

Open Problems

• Design more efficient repair mechanism.

• Incorporate proximity.

• Study effect of byzantine/selfish behavior.

• Provide locality and state-minimization

Some promising approaches:

• Soln: Composition of data structures [AS’03, ZSZ’03]•

• Tool: Locality-sensitive hashing [LS’96, IMRV’97]

51

Questions, Comments, Criticisms

1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan...

Documents

Transcript of 1 Distributed Data Structures for a Peer-to-peer system Advisor: James Aspnes Committee: Joan...