Gnutella & Searching Algorithms in Unstructured Peer-to-Peer Networks CS780-3 Lecture Notes
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen,...
-
Upload
prosper-brown -
Category
Documents
-
view
223 -
download
1
Transcript of Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen,...
Search and Replication in Unstructured Peer-to-Peer
Networks
Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker
ICS 2002
Outline
• Brief survey of P2P architectures
• Evaluation Methodology
• Search Methods
• Replication
• Conclusions
Peer-to-Peer Networks
• Peers are connected by an overlay network.
• Users cooperate to share files (e.g., music, videos, etc.)
• Dynamic: nodes join or leave frequently
P2P Network Architectures I
• Centralized: – Use of central directory server (CDS)– Peers query to the CSD to find other peers
that hold the desired object
Pros: very efficient
Cons: poorly scales single point of failure
P2P Network Architectures II
• Decentralized: No central directory server– But structured:
• P2P network topology is tightly controlled
• Files are placed at specified locations
– Unstructured:• No control in Network
topology or file placement
P2P Network Architectures III
Decentralized but Structured• “loose structured”
– Placement of files is based on hints
• “tight structure”– Precisely declare
• structure of P2P network and • file placement
– Use of distributed hash tablePros: Efficient satisfaction of queries
Good scalingCons: No proof it works
P2P Network Architectures IV
Decentralized and Unstructured• Placement of files not based on topology
knowledge• Finding files
– Node queries neighbors (usually using flooding)
Pros: extremely resilient to network changesCons: extremely unscalable
generates large loads
Evaluation Methodology I
Terminology• Network Topology:
instant graph formed by nodes in the network
• Query Distribution:frequency of lookups to files
• Replication Distribution:
percentage of nodes that have a particular file
Evaluation Methodology II
• Network Topologies– Powel-Law Random Graph (PLRG)
• Max node degree: 1746, median: 1 average 4.46
– Normal Random Graph (Random)• Average and median node degree is 4
– Gnutella graph (Gnutella)• Oct 2000 snapshot• Max degree: 136, median: 2, average: 5.5
– Two-dimensional Grid• 100x100 10000 nodes
Evaluation Methodology III
• Object query distribution qi
– Uniform– Zipf-like
• Object replication density distribution ri
– Uniform
– Proportional: ri qi
– Square-Root: ri qi
Evaluation Methodology IV
• Metrics– User aspects
• Pr(success)• #hops
– Load aspects• Average #messages per node• #nodes visited• Peak #messages
Limitation of Flooding I
• Gnutella uses TTL to check #hops queries travel
• Problem: – Hard to choose TTL:
• For objects that are widely present in the network, small TTLs suffice
• For objects that are rare in the network, large TTLs are necessary
– Number of query messages grow exponentially as TTL grows
Limitation of Flooding II
• Node may receive the same messages more than once
• Need for duplication detection mechanisms
• Still duplication increases as TTL increases in flooding
Limitation of Flooding Conclusion
• Flooding increases per-node overhead
• Need for more scalable search methods:– Expanding Ring
– Random Walks
Expanding Ring• Adaptively Adjust TTL
– Multiple floods: start with TTL=1; increment TTL by 2 each time until search succeeds
Still have duplicate messages
Random Walk
• Simple random walk– Takes too long to find anything
• Multiple-walker random walk– K walkers after each walking T steps visits as
many nodes as 1 walker walking K*T steps– More messages more overhead– When to terminate the search:
• TTL• Checking: check back with query originator once
every C steps
Search Traffic Comparison
avg. # msgs per node per query
1.863
2.85
0.053
0.961
0.027 0.0310
0.5
1
1.5
2
2.5
3
Random Gnutella
Flood Ring Walk
Search Delay Comparison
# hops till success
2.51 2.39
4.033.4
9.12
7.3
0
2
4
6
8
10
Random Gnutella
Flood Ring Walk
Lessons Learned about Search Methods
• Key: Cover the right number of nodes as quickly as possible and with as little overhead as possible
• Pay Attention to– Adaptive termination– Minimize message duplication– Small expansion in each step
Replication
• In unstructured P2P systems, search success is essentially about coverage: visiting enough nodes to find the object => replication density matters
• Goal: minimize average search size (number of probes till query is satisfied)
• Theoretical Optimal: copy everything everywhere– Limited node storage
Replication Strategies
• Uniform Replication– pi = 1/m– Simple, resources are divided equally
• Proportional Replication– pi = qi– “Fair”, resources per item proportional to
demand– Reflects current P2P practices
Square-Root Replication
• pi is proportional to square-root(qi)• Lies “In-between” Uniform and Proportional
Achieving Square-Root Replication I
• Assuming that each query keeps track the number of probes needed
• Store an object at a number of nodes that is proportional to the number of probes
• Two implementations:– Path replication: store the object along the
path of a successful “walk”– Random replication: store the object randomly
among nodes visited by the agents
Achieving Square-Root Replication II
Evaluation of Replication Methods I
• Metrics– Overall message traffic– Search delay
• Dynamic simulation– Assume Zipf-like object query probability– 5 query/sec Poisson arrival– Results are during 5000sec-9000sec– Search method: 32-walkers random walk with
state keeping and check every 4 steps
Evaluation of Replication Methods II
Square-Root Replication reduces search traffic
Avg. # msgs per node (5000-9000sec)
0
10000
20000
30000
40000
50000
60000
Owner Rep
Path Rep
Random Rep
Evaluation of Replication Methods III
Dynamic simulation: Hop Distribution (5000~9000s)
0
20
40
60
80
100
120
1 2 4 8 16 32 64 128 256
#hops
qu
eri
es
fin
ish
ed
(%
)
Owner Replication
Path Replication
Random Replication
Conclusions
• Multi-walker random walk scales much better than flooding– Can find data more quickly– Reduces the traffic overload
• Square-root replication distribution is desirable– Minimizes search delay– Minimizes the overall search traffic