1 Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel...

33
1 Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stut zbach, Reza Rejaie University of Oregon Multimedia Computing and Networking 2006 (MMCN’06), 18-19th January ose, California, USA
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of 1 Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel...

1

Characterizing Files in the Modern Gnutella Network:

A Measurement Study

Shanyu Zhao, Daniel Stutzbach, Reza Rejaie

University of Oregon

SPIE Multimedia Computing and Networking 2006 (MMCN’06), 18-19th January 2006San Jose, California, USA

2

Outlines

Measurement study of modern Gnutella system

Conduct static, topological and dynamic analysis

Help to improve design and evaluations of P2P file-sharing applications

3

Previous studies

Focus on a small population Be more than three years old Not examine dynamics of file characteristics

over time and correlation between the overlay topology and file distribution

4

Why Gnutella

Top three (eDonkey2K, FastTrack, Gnutella) Gnutella has Browse-Host extension to extra

ct the list of shared files from peers One of most studied P2P systems; compare

and contrast with previous studies

5

Original Gnutella

A new node joins the system (Node A) Node A connects to some node (Node B) by pre-

existing list, a particular website, IRC and etc Node B sends its working nodes to Node A Node A connects provided nodes till certain

threshold During search, Node A sends requests to connected

nodes which in turn forward requests

6

Original Gnutella

Nodes reply the request directly or indirectly depending on the firewall existence

Node A downloads file pieces from one ore more positive nodes

Unlike Napster, Gnutella is decentralized; flood-based searches

7

Modern Gnutella

Contrast to unstructured overlay topology, most modern Gnutella clients adopt a two-tier overlay structure

Ultrapeers and leaf peers (majority) Legacy peers (not implement ultrapeer featur

e)

8

Measurement methodology

Problems of general crawlers Slow, distorted, inflate population

Previous studies Partial snapshot, periodic probe of a fixed group Significance is doubted

Goal of this work Capture entire population (?) Short period

9

Measurement methodology

Topology crawl List of neighboring nodes

Content crawl List of available files of each node Need more

10

Cruiser

Parallel P2P crawler Orders of magnitude faster than previous

crawlers (?) Master-slave architecture

Slave crawls hundreds of peers and master coordinates multiple slaves

Increase degree of concurrency

11

Cruiser

Using 6 off-the-shelf 1GHz GNU/Linux boxes, crawl takes 15min + 5.5hr + 15min ~ 6 hours

Each content crawl takes 10GB log file containing file name and content hash

12

Dataset

Three measurement periods; within each period, take snapshots everyday

6/8/2005-6/18/2005, 8/23/2005-9/9/2005 and 10/11/2005-10/21/2005

Examine both short and long timescales

13

Dataset

14

Sources of unreachable nodes

Firewall Severe network congestion Peer departed Not support Browse Host protocol

Ultrapeers: depart Leaf peers: depart and firewall Contact 20% peers (~half a million)

15

Problems

Low-bandwidth TCP connection Some crawls do not complete after the timeout threshold,

as they are sent at extremely low rate

File identity File name is not a reliable file identifier; so this work use

content hash

Post-processing More than 100 million distinct files Divide into 7 segments randomly, trim files of less than 10

copies in a segment, combine trimmed back to one

16

Static analysis

Ratio of free riders Degree of resources sharing among

cooperative peers File popularity distribution File type analysis

17

Ratio of free riders

Free riders drop, ratio of ultrapeers is lower, long-lived peers slightly higher, # files not strongly correlate

18

Degree of resources sharing among cooperative peers

Distribution of # peers sharing x files – power-law distribution

19

Degree of resources sharing among cooperative peers

Distribution of contributed disk space – power-law distribution

20

Degree of resources sharing among cooperative peers

Correlation not as strong as previous studies Discernable line with slope 3.7MB/file which

is typical size of MP3 audio file

21

File popularity distribution

22

File type analysis

23

File type analysis

Previous studies Current studies

Music 67.2% files

79.2% bytes

67% files

40% bytes

Video 2.1% files

19.1% bytes

6% files

52.5% bytes

24

Topological analysis

Per-file perspective – figure a & b Per-peer perspective – figure c

25

Topological analysis

Churn (dynamics of peer participation) is dominant factor Depart Join Leaf peers become ultrapeers Rapid change in overlay topology prevents format

ion of topological clustering

26

Dynamics analysis

Variations in shared files by individual peers Variations in popularity of individual files Trends in popularity variations

27

Variations in shared files by individual peers

28

Variations in popularity of individual files

Focus on top 100 and top 1000 files

29

Trends in popularity variations

Track top 10 files across several days (fig a & b) Over several months (fig c)

30

Conclusion

Use parallel crawl to obtain snapshots of peer connectivity and available files

Conduct three types of analysis Understand the distribution, correlation and

dynamics of available files

31

Summary of findings

Free riding significantly drops # shared files and contributed storage space

by individual peers follow power-law distribution most peers contribute little disk space (<100MB) while small # peers contribute very large space (50-100GB)

Popularity of individual files follow Zipf distribution small # files are extremely popular but majority of files are very unpopular

32

Summary of findings

Most popular file type is MP3 file (2/3 of all files, 1/3 of all bytes)

Popularity and occupied space by video files has tripled over past few years

# video files < 1/10 of audio files but occupy 25% more bytes

93% of bytes or 73% of files are multimedia files

33

Summary of findings

Files are randomly distributed; no strong correlation between the available files at peers that are one, two or three hops apart in overlay topology

Shared files by individual slowly change over timescale of days; more popular files experience larger variations in popularity