P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in...

49
1 p2p, Fall 06 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p

Transcript of P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in...

Page 1: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

1p2p, Fall 06

Topics in Database Systems: Data Management in Peer-to-Peer Systems

Search in Unstructured P2p

Page 2: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

2p2p, Fall 06

Topics in Database Systems: Data Management in Peer-to-Peer Systems

D. Tsoumakos and N. Roussopoulos, “A Comparison of Peer-to-Peer Search Methods”, WebDB03

Page 3: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

3p2p, Fall 06

Overview

Centralized

Constantly-updated directory hosted at central locations (do not scale well, updates, single points of failure)

Decentralized but structured

The overlay topology is highly controlled and files (or metadata/index) are not placed at random nodes but at specified locations

Decentralized and Unstructured

peers connect in an ad-hoc fashion

the location of document/metadata is not controlled by the system

No guarantee for the success of a search

No bounds on search time

No maintenance cost

Any kind of query (not just single key or range queries)

Page 4: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

4p2p, Fall 06

Flooding on Overlays

xyz.mp3 ?

xyz.mp3

Page 5: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

5p2p, Fall 06

Flooding on Overlays

xyz.mp3 ?

xyz.mp3

Flooding

Page 6: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

6p2p, Fall 06

Flooding on Overlays

xyz.mp3 ?

xyz.mp3

Flooding

Page 7: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

7p2p, Fall 06

Flooding on Overlays

xyz.mp3

Page 8: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

8p2p, Fall 06

Search in Unstructured P2P

Must find a way to stop the search: Time-to-Leave (TTL)

Exponential Number of Messages

Cycles (?)

Note: cycles can be detected but not avoided

Page 9: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

9p2p, Fall 06

Search in Unstructured P2P

BFS vs DFS

BFS better response time, larger number of nodes (message overhead per node and overall)

Note: search in BFS continues (if TTL is not reached), even if the object has been located on a different path

Recursive vs Iterative

During search, whether the node issuing the query directly contacts others, or recursively.

Does the result follows the same path?

Page 10: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

10p2p, Fall 06

retrieve (K1)

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

K V

Iterative vs. Recursive Routing Iterative: Originator requests IP address of each hop

• Message transport is actually done via direct IPRecursive: Message transferred hop-by-hop

Page 11: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

11p2p, Fall 06

Search in Unstructured P2P

Two general types of search in unstructured p2p:

Blind: try to propagate the query to a sufficient number of nodes (example Gnutella)

Informed: utilize information about document locations (example Routing Indexes)

Informed search increases the cost of join for an improved search cost

Page 12: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

12p2p, Fall 06

Blind Search Methods

Gnutella:

Use flooding (BFS) to contact all accessible nodes within the TTL value

Huge overhead to a large number of peers +

Overall network traffic

Hard to find unpopular items

Up to 60% bandwidth consumption of the total Internet traffic

Page 13: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

13p2p, Fall 06

Free-riding on Gnutella [Adar00]

• 24 hour sampling period:– 70% of Gnutella users share no files– 50% of all responses are returned by top 1% of sharing hosts

• A social problem not a technical one• Problems:

– Degradation of system performance: collapse?– Increase of system vulnerability– “Centralized” (“backbone”) Gnutella copyright issues?

• Verified hypotheses:– H1: A significant portion of Gnutella peers are free riders– H2: Free riders are distributed evenly across domains– H3: Often hosts share files nobody is interested in (are not downloaded)

Page 14: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

14p2p, Fall 06

Free-riding Statistics - 1 [Adar00]

H1: Most Gnutella users are free riders

Of 33,335 hosts:

– 22,084 (66%) of the peers share no files

– 24,347 (73%) share ten or less files

– Top 1 percent (333) hosts share 37% (1,142,645) of total files shared

– Top 5 percent (1,667) hosts share 70% (1,142,645) of total files shared

– Top 10 percent (3,334) hosts share 87% (2,692,082) of total files shared

Page 15: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

15p2p, Fall 06

Free-riding Statistics - 2 [Adar00]

H3: Many servents share files nobody downloadsOf 11,585 sharing hosts:

– Top 1% of sites provide nearly 47% of all answers– Top 25% of sites provide 98% of all answers– 7,349 (63%) never provide a query response

Page 16: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

16p2p, Fall 06

Free Riders

• File sharing studies– Lots of people download– Few people serve files

• Is this bad?– If there’s no incentive to serve, why do people do so?– What if there are strong disincentives to being a major

server?

Page 17: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

17p2p, Fall 06

Simple Solution: Thresholds

• Many programs allow a threshold to be set– Don’t upload a file to a peer unless it shares > k files

• Problems:– What’s k?– How to ensure the shared files are interesting?

Page 18: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

18p2p, Fall 06

Categories of Queries [Sripanidkulchai01]

Categorized top 20 queries

Page 19: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

19p2p, Fall 06

Popularity of Queries [Sripanidkulchai01]

• Very popular documents are approximately equally popular• Less popular documents follow a Zipf-like distribution (i.e., the

probability of seeing a query for the ith most popular query is proportional to 1/(ialpha))

• Access frequency of web documents also follows Zipf-like distributions caching might also work for Gnutella

Page 20: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

20p2p, Fall 06

Caching in Gnutella [Sripanidkulchai01]

• Average bandwidth consumption in tests: 3.5Mbps • Best case: trace 2 (73% hit rate = 3.7 times traffic reduction)

Page 21: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

21p2p, Fall 06

Topology of Gnutella [Jovanovic01]

• Power-law properties verified (“find everything close by”)• Backbone + outskirts

Power-Law Random Graph (PLRG):

The node degrees follow a power law distribution:

if one ranks all nodes from the most connected to the least connected, then the i’th most connected node has ω/ia neighbors,

where w is a constant.

Page 22: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

22p2p, Fall 06

Gnutella Backbone [Jovanovic01]

Page 23: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

23p2p, Fall 06

Why does it work? It’s a small World! [Hong01]

• Milgram: 42 out of 160 letters from Oregon to Boston (~ 6 hops)

• Watts: between order and randomness

– short-distance clustering + long-distance shortcuts

Regular graph:

n nodes, k nearest neighbors

path length ~ n/2k

4096/16 = 256

Random graph:

path length ~ log (n)/log(k)

~ 4

Rewired graph (1% of nodes):

path length ~ random graph

clustering ~ regular graph

Page 24: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

24p2p, Fall 06

Links in the small World [Hong01]

• “Scale-free” link distribution

– Scale-free: independent of the total number of nodes

– Characteristic for small-world networks

– The proportion of nodes having a given number of links n is: P(n) = 1 /n k

– Most nodes have only a few connections

– Some have a lot of links: important for binding disparate regions together

Page 25: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

25p2p, Fall 06

Freenet: Links in the small World [Hong01]

P(n) ~ 1/n 1.5

Page 26: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

26p2p, Fall 06

Gnutella: “New” Measurements

[1] Stefan Saroiu, P. Krishna Gummadi, Steven D. Gribble: A Measurement Study of Peer-to-Peer File Sharing Systems, Proceedings of Multimedia Computing and Networking (MMCN) 2002, San Jose, CA, USA, January 2002.  [2] M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. IEEE Internet Computing Journal, 6(1), 2002 [3] Evangelos P. Markatos, Tracing a large-scale Peer to Peer System: an hour in the life of Gnutella, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2002. [4] Y. HawatheAWATHE, S. Ratnasamy, L. Breslau, and S. Shenker. Making Gnutella-like P2P Systems Scalable. In Proc. ACM SIGCOMM (Aug. 2003). [5] Qin Lv, Pei Cao, Edith Cohen, Kai Li, Scott Shenker: Search and replication in unstructured peer-to-peer networks. ICS 2002: 84-95 

Page 27: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

27p2p, Fall 06

Gnutella: Bandwidth Barriers

• Clip2 measured Gnutella over 1 month:– typical query is 560 bits long (including TCP/IP headers)– 25% of the traffic are queries, 50% pings, 25% other– on average each peer seems to have 3 other peers actively connected

• Clip2 found a scalability barrier with substantial performance degradation if queries/sec > 10: 10 queries/sec* 560 bits/query* 4 (to account for the other 3 quarters of message traffic)* 3 simultaneous connections67,200 bps 10 queries/sec maximum in the presence of many dialup users won’t improve (more bandwidth - larger files)

Page 28: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

28p2p, Fall 06

Gnutella: Summary

• Completely decentralized• Hit rates are high• High fault tolerance• Adopts well and dynamically to changing peer populations• Protocol causes high network traffic (e.g., 3.5Mbps). For example:

– 4 connections C / peer, TTL = 7– 1 ping packet can cause packets

• No estimates on the duration of queries can be given• No probability for successful queries can be given• Topology is unknown algorithms cannot exploit it• Free riding is a problem• Reputation of peers is not addressed• Simple, robust, and scalable (at the moment)

240,26)1(**20

TTL

i

iCC

Page 29: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

29p2p, Fall 06

Lessons and Limitations

• Client-Server performs well– But not always feasible

• Ideal performance is often not the key issue!

• Things that flood-based systems do well– Organic scaling– Decentralization of visibility and liability– Finding popular stuff (e.g., caching)– Fancy local queries

• Things that flood-based systems do poorly– Finding unpopular stuff [Loo, et al VLDB 04]– Fancy distributed queries– Vulnerabilities: data poisoning, tracking, etc.– Guarantees about anything (answer quality, privacy, etc.)

Page 30: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

30p2p, Fall 06

Comparison

Gnutella CAN Others?

Expressivness

Comprehensivness

Autonomy

Efficiency

Robustness

Topology pwr law

Data Placement arbitrary

Message Routing flooding

Page 31: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

31p2p, Fall 06

Comparison

Gnutella CAN Others?

Expressivness

Comprehensivness

Autonomy

Efficiency

Robustness

Topology pwr law grid

Data Placement arbitrary hashing

Message Routing flooding directed

Page 32: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

32p2p, Fall 06

Security & Privacy

• Issues:– Anonymity– Reputation– Accountability– Information Preservation– Information Quality– Trust– Denial of service attacks

Page 33: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

33p2p, Fall 06

Authenticity

title: origin of species

author: charles darwin

date: 1859

body: In an island far,far away ...

...

?

Page 34: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

34p2p, Fall 06

More than Just File Integrity

title: origin of species

author: charles darwin

date: 1859

body: In an island far,far away ...

checksum

? 00

Page 35: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

35p2p, Fall 06

More than Fetching One File

T=originY=1800

A=darwin

T=originY=1859

A=darwin

T=originY=1859

A=darwin

T=originY=?

A=darwinB=?

T=originY=1859

A=darwinB=abcd

Page 36: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

36p2p, Fall 06

Solutions

• Authenticity Function A(doc): T or F– at expert sites, at all sites?– can use signature expert sig(doc) user

• Voting Based– authentic is what majority says

• Time Based– e.g., oldest version (available) is authentic

Page 37: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

37p2p, Fall 06

Issues

• Trust computations in dynamic system• Overloading good nodes• Bad nodes can provide good content sometimes• Bad nodes can build up reputation• Bad nodes can form collectives• ...

Page 38: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

38p2p, Fall 06

Back to searching

Page 39: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

39p2p, Fall 06

Blind Search Methods

Iterative Deepening:

Start BFS with a small TTL and repeat the BFS at increasing depths if the first BFS fails

Works well when there is some stop condition and a “small” flood will satisfy the query

Else even bigger loads than standard flooding

(more later …)

Modified-BFS:

Choose only a ratio of the neighbors (some random subset)

Page 40: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

40p2p, Fall 06

Blind Search MethodsRandom Walks:

The node that poses the query sends out k query messages to an equal number of randomly chosen neighbors

Each step follows each own path at each step randomly choosing one neighbor to forward it

Each path – a walker

Two methods to terminate each walker: TTL-based or

checking method (the walkers periodically check with the query source if the stop condition has been met)

It reduces the number of messages to k x TTL in the worst case

Some kind of local load-balancing

Page 41: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

41p2p, Fall 06

Blind Search Methods

Random Walks:

In addition, the protocol bias its walks towards high-degree nodes (choose the highest degree neighbor)

Page 42: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

42p2p, Fall 06

Blind Search Methods

Using Super-nodes:

Super (or ultra) peers are connected to each other

Each super-peer is also connected with a number of leaf nodes

Routing among the super-peers

The super-peers then contact their leaf nodes

Page 43: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

43p2p, Fall 06

Blind Search Methods

Using Super-nodes:

Gnutella2

When a super-peer (or hub) receives a query from a leaf, it forwards it to its relevant leaves and to neighboring super-peers

The hubs process the query locally and forward it to their relevant leaves

Neighboring super-peers regularly exchange local repository tables to filter out traffic between them

Page 44: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

44p2p, Fall 06

Blind Search Methods

Ultrapeers can be installed (KaZaA) or self-promoted (Gnutella)

Interconnection between the superpeers

Page 45: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

45p2p, Fall 06

Informed Search Methods

Local Index

Each node indexes all files stored at all nodes within a certain radius r and can answer queries on behalf of them

Search process at steps of r, hop distance between two consecutive searches 2r+1

Increased cost for join/leave• Flood inside each r with TTL = r, when join/leave the

network

Page 46: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

46p2p, Fall 06

?query ...

Informed Search Methods

Intelligent BFS

Nodes store simple statistics on its neighbors:

(query, NeigborID) tuples for recently answered requests from or through their neighbors

so they can rank them

For each query, a node finds similar ones and selects a direction

How?

Page 47: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

47p2p, Fall 06

• Heuristics for Selecting Direction>RES: Returned most results for previous queries<TIME: Shortest satisfaction time<HOPS: Min hops for results>MSG: Forwarded the largest number of messages (all

types), suggests that the neighbor is stable<QLEN: Shortest queue<LAT: Shortest latency>DEG: Highest degree

?query ...

Informed Search Methods

Intelligent or Directed BFS

Page 48: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

48p2p, Fall 06

Informed Search Methods

Intelligent or Directed BFS

• No negative feedback• Depends on the assumption that nodes specialize in certain documents

Page 49: P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.

49p2p, Fall 06

Informed Search Methods

APS

Again, each node keeps a local index with one entry for each object it has requested per neighbor –

this reflects the relative probability of the node to be chosen to forward the query

k independent walkers and probabilistic forwarding

Each node forwards the query to one of its neighbor based on the local index (for each object, choose a neighbor using the stored probability)

If a walker, succeeds the probability is increased, else is decreased –

Take the reverse path to the requestor and update the probability, after a walker miss (optimistic update) or after a hit (pessimistic update)