Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004.
-
Upload
erik-willis -
Category
Documents
-
view
220 -
download
0
Transcript of Mapping the Gnutella Network Presented By: Tony Young M.Math Candidate October 7th, 2004.
Mapping the Gnutella Network
Presented By:
Tony Young
M.Math Candidate
October 7th, 2004
Outline
Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review
Outline
Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review
Introduction
Peer to peer systems have recently exploded onto the internet scene
Two main contributing factors: Low cost and high availability of resources
(computing and storage) Increased network connectivity (proliferation
of “always on” connections)
Introduction Peer systems build a virtual topology (overlay)
with its own routing mechanisms The topology of the overlay and routing protocols
directly affects Performance: Number of physical hops to send a
message through virtual overlay Reliability: Will a message actually reach the other end Scalability: Can other nodes be added while keeping
performance good Anonymity: Can we protect the identity of nodes in the
network
Introduction Gnutella is studied in depth and analysis is
performed to determine how the overlay affects the four characteristics previously mentioned
Started by capturing the network topology and behaviour
Performed a macroscopic analysis of the network to evaluate costs and benefits
Investigated possible improvements
Introduction
Two questions drive analysis What is the connectivity structure of
Gnutella? How well does the Gnutella overlay map to
the actual network topology?
Introduction
Connectivity Structure Networks as diverse as natural networks
usually have a few well connected nodes and many poorly connected nodes
• I.e. Power Law Networks We will see Gnutella is not a pure power law
network, but still has good fault tolerance and is less resistant to DoS attacks
Introduction
Overlay Topology Important for ISP’s: overlays that don’t map
closely to the physical topology adds additional stress on the infrastructure and costs ISP’s more money
Scalability is directly linked to efficient use of network resources
Outline
Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements
Gnutella in Depth Gnutella is an open protocol It is decentralized and unstructured Allows group membership and searching of available
files for download Gnutella should operate in a dynamic environment where
hosts can join/leave at any time Gnutella should experience good performance and
scalability External attacks should not cause data loss or
performance degradation Users seeking or providing unpopular material should stay
anonymous
Gnutella in Depth
Gnutella nodes are called “servents” (SERVer-cliENTS) Provide a client-side interface to allow
searching of file base Provide server-side storage, routing and
response to network messages and requests
Gnutella in Depth
To connect, a node contacts an “always on” host (I.e. gnutella.com) and sends a PING
Node replies with a PONG and forwards the PING on to other nodes in the network who reply with PONG messages and forward the PING on PING stops after TTL hops
Gnutella in Depth
To find files, users submit QUERY messages to other nodes Messages are broadcast to all neighbours
who forward them on to other neighbours, etc. for TTL hops
QUERY RESPONSE messages are returned to the querying node
Gnutella in Depth
To download a file, nodes send GET and PUSH messages to individual hosts holding a file I.e. transfer requests and transfers are routed
directly between communicating hosts, and not back-propagated
Gnutella in Depth Messaging protocol has three important
features TTL and “hops passed” fields are attached to each
message Randomly generated message ID is attached to each
message Each node keeps track of recently routed messages
to prevent re-broadcasting and to implement back-propagation
Gnutella in Depth
PING message contains the host address and name, number of files and size of data store
PONG message contains the same information from the host that received the PING
Gnutella in Depth PING messages propagate until TTL has
expired Hop count incremented at each servent
receiving the PING Message propagates until hop count = TTL
PONG messages are back-propagated (I.e. sent on the reverse path that the original message followed) to the host initiating the PING
Gnutella in Depth
QUERY messages are sent the same way as a PING message Nodes check the search string requested
against the names of their locally stored files QUERY RESPONSE messages are
back-propagated to the querying node and include information necessary to download the file
Outline
Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review
The Crawler In order to conduct the network tests, a crawler
was developed to gather information about the virtual topology
Crawler starts with a list of active nodes and sends a PING message to each of them
PONG messages are received and the IP, port, number of stored files and size of archive are stored in a table
PING propagates to other nodes and PONG back propagates to crawler
The Crawler A sequential version of the crawler was initially
developed I.e. send a PING with an empirically determined
optimal TTL to a set of nodes; resend to the nodes where the PING stops, etc.
Proved to be very slow: 50 hours to collect data from a 4 000 node network
Slowness means two things: Not scalable: Will get slower as we add more nodes Does not give an accurate network snapshot: network
changes drastically over 50 hours!
The Crawler A distributed crawler was developed next
Client-Server architecture Server maintains node list and creates a network
graph Clients receive a list of nodes to contact and discover
neighbours for Decided to use only 50 clients at once
Reduces invasiveness of search and consumption of network resources
Reduced crawling time to a couple of hours for a large initial list and a network of 30 000 nodes
The Crawler Network membership is defined as follows
A node is a member of the network if the crawler is able to connect to it
A node might be excluded from network membership if it was reported as active by a server or other node, but the crawler could not contact it
• This might happen if nodes go offline before the crawler can contact them
Outline Introduction Gnutella in Depth The Crawler Analysis of Network
Growth Trends Traffic Estimates Connectivity and Reliability Overlay vs. Topology
Summary and Improvements Paper Review
Analysis of Network
Data was collected over a 6 month period Data shows:
Overhead traffic is reducing Traffic volume is a significant barrier to
growth
Growth Trends
Size of network is growing rapidly Largest connected component in November
2000 had 2 063 neighbours Largest connected component in May 2001
had 48 195 neighbours! Number of neighbours for the largest
connected component has grown 25 times!
Growth Trends
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Growth Trends
Despite the explosive growth, most nodes are not connected long
Successive crawls of the network found: 40% of nodes leave the network in less than
4 hours 25% of nodes are alive for more than 24
hours
Traffic Estimates A modified version of the crawler
recorded traffic generated across one randomly chosen link 36% of total traffic (in bytes) is user
generated QUERY messages 55% is group membership (PING/PONG)
messages 9% is non-standard or malformed messages N.B. File transfer traffic is excluded
Traffic Estimates
After June 2001 (when new Gnutella implementation was released) 92% of total traffic (in bytes) was QUERY
messages 8% is group membership (PING/PONG)
messages N.B. File transfer traffic is excluded
Traffic Estimates 95% of all nodes are reachable within 7 hops.
Thus, each message typically uses a TTL = 7 Most links are expected to support similar amounts
of traffic for these reasons
As verified empirically, the total Gnutella generated traffic is proportional to the number of connections in the network However, the average number of connections per
node stays relatively constant as the network grows
Traffic Estimates
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Traffic Estimates
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Traffic Estimates
The total traffic estimate for the Gnutella network is 1 Gbps I.e. 170 000 connections for a 50 000 node
network times 6 kbps per connection This is approximately 330 TB/month! Excluding file transfers!
Traffic Estimates
This total is 1.7% of the total internet traffic in US backbones in December 2000
This volume of traffic is believed to be an obstacle to further growth
The underlying network topology must be used more efficiently to allow scaling and wider deployment
Connectivity and Reliability Note: Nodes decide locally:
How many connections to support When to add or drop a connection
Recent research shows that many natural systems organize themselves into “power law networks” I.e. networks where a few nodes are well
connected and most nodes have very few connections
Connectivity and Reliability Power law networks:
Number of nodes with L links (connections) is proportional to L-k where k is system-dependent
Resilient to losing many poorly connected nodes
Falls apart quickly if only a few well connected nodes are lost
Extremely robust to random failures, but vulnerable to targeted attacks
Connectivity and Reliability Power law networks appear as a linear
system on a log-log plot Data for December 2000 shows that early
Gnutella networks were power law Data for March 2001 shows that later
Gnutella networks are a mixture• There are a constant number of nodes with fewer
than 10 links• Above 10 links, nodes follow a power law
structure
Connectivity and Reliability
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Connectivity and Reliability
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Connectivity and Reliability
Why did the distribution change? Two possible reasons:
About 20% of Gnutella users have modem connections - DSL and up can support more connections
Gnutella users run as many connections as their network can support - perception is that more connections = better query results
Connectivity and Reliability
Does the change in distribution affect reliability? Yes! Preserves resilience to random failures Makes network less dependent on well
connected nodes and hence less prone to DoS attacks
Overlay vs. Topology
Peer systems change the way bandwidth is used on the internet Servers are at the edge of the network now,
and peers are constantly downloading Most ISP’s use flat-rate billing
Peer systems may break this model!
Overlay vs. Topology Due to the amount of traffic peer systems
generate, efficient use of resources is important The greater the mismatch between the
overlay and the physical network topology, the more messages need to be transmitted to route information from A to B
This means more stress on the network resources
Overlay vs. Topology
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Communication from A to all other nodes requires one message over the D - E link
Overlay vs. Topology
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Communication from A to all other nodes requires six messages over the D - E link
Overlay vs. Topology
How well does Gnutella map to the topology? Assume that domain names are roughly
evident of the hierarchy of the internet Check how well generated traffic maps to the
cluster of domain names found by the crawler
Overlay vs. Topology
After analysis of 10 overlays, it was found that Gnutella nodes often connect to peers outside of their respective domains
Thus, it appears that Gnutella does not make efficient use of the underlying topology
Outline
Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review
Summary and Improvements Gnutella has a multimodal connectivity
distribution that is partially constant and partially power law Network is resilient to random failures Network is harder to attack by malicious parties, but
not immune to DoS attacks
Gnutella makes little effort to ward off attackers E.g. topology, connectivity and traffic information is
easy to obtain and can be used to plan attacks
Summary and Improvements Gnutella’s traffic volume is a significant
fraction of all internet traffic Makes the future growth of the network
reliant on efficient use of the topology Gnutella’s overlay does not match the
network topology very well This increases quite substantially the number
of messages and the amount of network traffic generated
Summary and Improvements
Necessary improvements Make efforts to hide overlay and connectivity
information (encryption?) Match overlay more closely with topology Limits to growth must be solved first and fast
at the rate that Gnutella is growing
Summary and Improvements
Suggested Improvements Exploit locality of files and query distribution
(I.e. caching and localized queries) Replace query flooding strategy with
something more efficient (I.e. superpeer routing and group communication)
Outline
Introduction Gnutella in Depth The Crawler Analysis of Network Summary and Improvements Paper Review
Paper Review Organization
Some discussions of the Gnutella architecture and protocols were scattered throughout the paper
Should have combined everything into a more logical order inside the protocol section
Writing Style Generally very good. Some missing words
and poor grammar
Paper Review Novel Ideas
Presented a qualitative and quantitative analysis of the Gnutella network, and some important points for P2P as a whole
Content Some backing information was missing Some claims were made without supporting
evidence, or just referring the reader to another paper
Questions?