Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members:...

50
Content Distribution in Unstructured Peer- to-Peer Networks Daniel Stutzbach Committee Members: • Professor Reza Rejaie • Professor Ginnie Lo

Transcript of Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members:...

Page 1: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Content Distribution in Unstructured Peer-to-Peer Networks

Daniel Stutzbach

Committee Members: • Professor Reza Rejaie• Professor Ginnie Lo• Professor Art Farley

Page 2: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Why Peer-to-Peer?

Introduction

Page 3: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.
Page 4: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.
Page 5: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.
Page 6: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Why Study Peer-to-Peer? Peers on the edge band together to

share resources. Peer-to-peer can self-scale. Peer-to-peer applications are becoming

increasingly popular. File Sharing: Kazaa, Gnutella Bandwidth Sharing: BitTorrent Cycle Sharing: SETI@Home UW found that file-sharing uses 3 times as

much bandwidth as the Web.

Introduction

Page 7: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Challenges in Peer-to-Peer Peer-to-Peer is more complicated. Discovering and managing resources is

more difficult because: The resources are distributed. Peers are not under the control of one authority. Peers are unstable. Peers have heterogeneous resources to provide.

It’s harder to measure a deployed system.

Introduction

Page 8: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Research on Peer-to-Peer Characterizing deployed systems:

Provides insight into how P2P systems behave in the real world.

Is needed to develop accurate models for simulation.

Design of new techniques to leverage peer-to-peer resources: Overlay construction Search mechanisms Resource allocation

Introduction

Page 9: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Topics Covered Peer-to-Peer Search

Keyword search for matching filenames Example implementations: Gnutella, Kazaa, eDonkey 2000 Example topologies: ad-hoc mesh, ultrapeers Searching for files to transfer

Peer-to-Peer Transfer Spread parts of files quickly to scale and alleviate flash-crowds Example implementations: eDonkey 2000, BitTorrent, Slurpie Example topologies: ad-hoc mesh, a forest of trees Searching for peers with the right blocks or more bandwidth

Peer-to-Peer Streaming Spread parts of a stream quickly to scale and alleviate flash-crowds Example implementations: CoopNet, SplitStream, PRO Example topologies: ad-hoc mesh, a forest of trees Searching for peers with the right layers, more bandwidth, or lower

delay.

Introduction

Page 10: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Relationship of Topics

Streaming

Similar Functions

Search

Transfer

Components of File-Sharing

Introduction

Page 11: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Overview Background Measurement and Characterization Design:

Peer-to-Peer Search Peer-to-Peer Transfer Peer-to-Peer Streaming

Introduction

Page 12: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Peer-to-Peer Search To the user, all file-sharing

programs look pretty much the same: like a search engine.

However, they operate in different ways.

Background: Peer-to-Peer Search

Page 13: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Gnutella Classic

Background: Peer-to-Peer Search

Page 14: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

UltrapeersUltrapeers

Leaves

Background: Peer-to-Peer Search

Page 15: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Peer-to-Peer Transfer

Background: Peer-to-Peer Transfer

• A source has been located.

• Now we need to download the file.

• Split the file into blocks.

• Download blocks from wherever we can.

Page 16: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Peer-to-Peer Streaming

• A source has been located.

• Now we need to view the stream.

• Split the file into encoded sub-streams.

• Listen to as many sub-streams as we can.

• Timing matters.

• But we don’t need all streams.

Background: Peer-to-Peer Streaming

Page 17: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Characterization Measurement Techniques Characterizations

Measurement

Page 18: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Measurement Techniques There are five basic approaches to

measuring peer-to-peer systems: Interception Participation Crawling Probing Centralized

Measurement: Techniques

Page 19: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Interception Pros:

It can monitor many users. It can observe transfers. It can observe throughput.

Cons: It captures a biased cross-section. It misses quiet peers.

Measurement: Techniques

Page 20: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Participation Pros:

It can capture a cross-section of overlay traffic. It can compare different open-source

implementations. Cons:

It assumes the measurement node is “typical”. It tells us nothing about atypical nodes. It’s harder to do with closed-source software.

Measurement: Techniques

Page 21: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Crawling Pros:

It can captures the topology. It provides a global perspective. It can captures the entire peer population.

Cons: The network changes while the crawler

runs. It’s hard to verify the accuracy of the

crawler.

Measurement: Techniques

Page 22: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Probing Pros:

It’s easy to do. It can capture many peer

characteristics. Cons:

Sample population may be biased based on:

Degree Availability Files shared

Measurement: Techniques

Page 23: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Centralized Pros:

It provides global knowledge of some aspects.

It’s easy to do for systems with a central component.

Cons: Most peer-to-peer systems don’t have

a centralized component.

Measurement: Techniques

Page 24: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Characterization Churn File Characteristics Peer Characteristics Query Characteristics Topology Implementation Characteristics

Measurement: Characterization

Page 25: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Churn

Citation Systems Observed

Session Time

[SGG02] Gnutella, Napster

50% <= 60 min.

[CLL02] Gnutella, Napster

31% <= 10 min.

[SW04] Kazaa 50% <= 1 min

[BSV03] Overnet 50% <= 60 min.

[GDS+03] Kazaa 50% <= 2.4 min

Adapted from [RGK04]

Measurement: Characterization: Churn

Page 26: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Churn: Open Issues Existing results are not consistent. The implications of churn on the

topology are not well-understood. The downtime distribution is

unknown. Correlations between uptime,

downtime, and future up and downtimes have not been examined.

Measurement: Characterization: Churn

Page 27: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

File Characteristics: Storage The popularity of files stored

follows a Zipf distribution

Measurement: Characterization: File Characteristics

Page 28: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Zipf

From [FHKM04]

Measurement: Characterization: File Characteristics

Page 29: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

File Characteristics: Storage The popularity of files stored follows a

Zipf distribution The 10% most popular files make 50% of

all stored bytes. The most popular files are around 4 MB. However, the 3% of files which are videos

make up 21% of stored bytes. Most files are shared by a small fraction

of users. 25%-67% of users share no files at all.

Measurement: Characterization: File Characteristics

Page 30: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

File Characteristics: Clustering 30% of files have a correlation of at

least 60% with at least one other file. If two peers have 10 files in common,

there’s an 80% they have at least one more file in common.

Generating a graph by treating users as nodes and assigning edges where there are more than N files in common, results in a small world.

Measurement: Characterization: File Characteristics

Page 31: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

File Characteristics: Transfers 90% of files transferred are smaller than

10 MB. Most bytes transferred are part of files

larger than 700 MB. The most popular files are roughly equal

in popularity, while unpopular files follow Zipf.

Measurement: Characterization: File Characteristics

Page 32: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Not-So-Zipf

Taken from [GDS+03]

Measurement: Characterization: File Characteristics

Page 33: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

File Characteristics: Transfers 90% of files transferred are smaller than 10 MB. Most bytes transferred are part of files larger

than 700 MB. The most popular files are roughly equal in

popularity, while unpopular files follow Zipf. The most popular 5% of transferred files

account for 50% of all transfers. That’s around 45,000 songs which can be

stored in 175 GB. An inverse cache can result in a savings

between 67%-86%.

Measurement: Characterization: File Characteristics

Page 34: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

File Characteristics: Open Issues The shift in popularity of files over time

is not well understood, requiring observations over several months.

It would be interesting to see if correlations between files can be used to predict which files a user will want.

No studies have characterized the swarming download feature included in many modern file-sharing applications.

Measurement: Characterization: File Characteristics

Page 35: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Query Characteristics The most popular queries are of

relatively equal popularity, while less popular queries follow Zipf.

There is little relationship between sharing many files and responding to many queries.

40% of queries are duplicates.

Measurement: Characterization: Query Characteristics

Page 36: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Query Characteristics:Open Issues

The relationship between query, transfer, and file popularities are not well-understood.

Queries are composed of several search terms, which we know little about.

We don’t know how long query results are typically valid for.

Measurement: Characterization: Query Characteristics

Page 37: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Topology: Open Issues The most recent published topology data is

from mid-2001. At the time, Gnutella had around 50,000 peers. Today, it has more than 1 million. The introduction of Ultrapeers has drastically

altered the topology. Those crawls took at an hour or more to

complete, but the median peer lifetime may be just a few minutes.

No topology studies have been done on peer-to-peer networks other than Gnutella.

Measurement: Characterization: Topology Characteristics

Page 38: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Characterization Summary Churn File Characteristics Peer Characteristics Query Characteristics Topology Implementation Characteristics

Measurement: Characterization

Page 39: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Designing Peer-to-Peer Search The Convention Wisdom: Flooding

doesn’t scale. Improvements:

Use ultrapeers Walk instead of flood Index replication Interest-based short-cuts Consider distributed hash tables (DHTs) Overlay-to-Internet topology matching

Design: Search

Page 40: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Walking Directed walks are globally efficient, but

send all the query traffic to certain nodes. Random walks are efficient over an

ultrapeer network, but are slow for less popular results.

K-random walks are efficient and fast. Issues:

How far does it scale? There are subtle issues that make it hard to

implement.

Design: Search: Walking

Page 41: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Index Replication It’s the other side of the coin: bringing the indexing

information closer to the queries. Ultrapeers are a special case of index replication. Most file-sharing systems use some type of proportional

indexing. Napster and DHT-based systems use uniform indexing. However, the optimal index replication system is

square-root indexing. [CS02] Open Issues:

Highly distributed, unstructured indexing schemes may not work well under heavy churn.

Additional indexing doesn’t help for queries that have no matches.

Design: Search: Index Replication

Page 42: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Distributed Hash Tables Using a DHT to index all the files in a

file-sharing network is daunting. Let’s just index the unpopular files,

using a heuristic to decide which files are unpopular.

Issues: How good are the heuristics in practice? DHTs are difficult to incrementally

deploy.

Design: Search: DHT

Page 43: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Designing Peer-to-Peer Search Distributed search has improved

dramatically from the early pure-flooding Gnutella.

We don’t have a good “network health” meter to tell us how well a file-sharing network is performing.

Design: Search

Page 44: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Designing Peer-to-Peer Transfer

No papers on incremental improvements.

Systems: BitTorrent Slurpie Rateless Codes

Design: Transfer

Page 45: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

BitTorrent Unique features:

Connect, and the data will come. Reward those uploading the most.

Modeling studies show… BitTorrent handles an initial flash-crowds well. However, it does not do as well if a second a

second flash-crowd. Issues:

We don’t know enough about where BitTorrent’s bottlenecks are, when it works well, and when it doesn’t.

Design: Transfer

Page 46: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Slurpie Unique features:

Estimates the total peer network size Constant load on the root server Chooses a block to download first, then

looks for peers Open Issues:

Just a prototype Limited head-to-head experiments against

BitTorrent

Design: Transfer

Page 47: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Rateless Codes Unique features:

Other systems need to worry about finding the right block.

Using encoding, we can make nearly any block the right block.

Modern rateless codes can generate “practically infinite” codes in O(1) time per block.

Issues: It’s unknown if finding the right block is a

significant problem in existing systems.

Design: Transfer

Page 48: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Peer-to-Peer File Transfers:Open Issues We don’t have a solid

understanding of where the bottlenecks are in existing systems.

We don’t have good metrics for determining where they are.

We don’t have good models for comparing new systems against old ones.

Design: Transfer

Page 49: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Peer-to-Peer Streaming User characteristics are well-understood based on

measurements of client-server streaming. Multiple Descriptor Coding: the breakthrough CoopNet uses multiple trees, organized by a central

server. SplitStream uses multiple trees, organized using a DHT. PRO proposes a decentralized, gossip scheme to build a

loose mesh. Open Issues:

CoopNet and SplitStream are delay-optimized, not bandwidth-optimized.

PRO is still in development. No significant deployment.

Design: Streaming

Page 50: Content Distribution in Unstructured Peer-to-Peer Networks Daniel Stutzbach Committee Members: Professor Reza Rejaie Professor Ginnie Lo Professor Art.

Conclusion While much has been done in the area of

peer-to-peer content distribution, there are still many open avenues.

I presently have papers under submission regarding: Developing metrics to measure the accuracy of

a new, efficient topology crawler Characterizing the modern Gnutella topoloy Characterizing churn in peer-to-peer networks Demonstrating the effectiveness of peer-to-

peer file transfers to handle flash crowds

Conclusion