Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon...

19
Sampling for Sampling for Characterizing Characterizing Unstructured Unstructured Overlays Overlays A. H. A. H. Rasti Rasti University of University of Oregon Oregon M. M. Torkjazi Torkjazi R. Rejaie R. Rejaie N. N. Duffield Duffield AT&T Labs - AT&T Labs - Research Research W. W. Willinger Willinger Graciously Presented By: Graciously Presented By: Shubho Sen Shubho Sen AT&T Labs - AT&T Labs - Research Research

Transcript of Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon...

Respondent-driven Respondent-driven Sampling for Sampling for

Characterizing Characterizing Unstructured Unstructured

OverlaysOverlaysA. H. A. H. RastiRasti

University of OregonUniversity of OregonM. M. TorkjaziTorkjaziR. RejaieR. Rejaie

N. N. DuffieldDuffield

AT&T Labs - ResearchAT&T Labs - ResearchW. W. WillingerWillinger

D. D. StutzbachStutzbach Stutzbach EnterprisesStutzbach Enterprises

Graciously Presented By:Graciously Presented By:Shubho SenShubho Sen AT&T Labs - ResearchAT&T Labs - Research

MotivationMotivation

• P2P systems are very popular in practice.P2P systems are very popular in practice.– Millions of simultaneous users.Millions of simultaneous users.– A significant fraction of Internet trafficA significant fraction of Internet traffic

• Measurement studies aid understanding Measurement studies aid understanding existing systems and user behavior.existing systems and user behavior.

• Capturing an accurate global “snapshot” is Capturing an accurate global “snapshot” is often infeasible.often infeasible.– P2P systems are distributed, large, and rapidly changing.P2P systems are distributed, large, and rapidly changing.– P2P crawlers are likely to capture incomplete or distorted P2P crawlers are likely to capture incomplete or distorted

snapshotssnapshots

• Sampling is a natural approach, and has been Sampling is a natural approach, and has been used implicitly in most earlier P2P measurement used implicitly in most earlier P2P measurement studies.studies.

How can we collect representative samples?How can we collect representative samples?•22

The Graph Sampling The Graph Sampling ProblemProblem

• We focus on sampling We focus on sampling peer propertiespeer properties, such as, such as number of neighbors (degree), access link bandwidth, number of neighbors (degree), access link bandwidth, session time, # filessession time, # files

• Sampling peer properties has two steps:Sampling peer properties has two steps:– Discovering and selecting peers (or samples)Discovering and selecting peers (or samples)– Measuring the desired properties of selected peersMeasuring the desired properties of selected peers

• Selecting peers Selecting peers uniformly at randomuniformly at random is hard – is hard – there are two sources of bias [Stutzbach:IMC06]there are two sources of bias [Stutzbach:IMC06]– TopologicalTopological: high-degree peers are more likely to be selected: high-degree peers are more likely to be selected– TemporalTemporal: short-lived peers are more likely to be selected: short-lived peers are more likely to be selected

• Random walks are a promising approach to Random walks are a promising approach to samplingsampling– The resulting bias is precisely known The resulting bias is precisely known – Samples can be collected in parallel by multiple Samples can be collected in parallel by multiple

walkerswalkers•33

Sampling Using Random Sampling Using Random WalkWalk

• Random walks can be described with a transition Random walks can be described with a transition matrix P(x,y)matrix P(x,y)

• P(x,y) : probability of moving from x to yP(x,y) : probability of moving from x to y

• PPrr(x,y) : probability of moving from x to y after r (x,y) : probability of moving from x to y after r movesmoves

• Random walks converge to a stationary distributionRandom walks converge to a stationary distribution

• Problem: we need a uniform distributionProblem: we need a uniform distribution

otherwise0

ofneighbor a is )deg(

1),(

xyxyxP

E

xxvPx r

r 2

)deg())((lim)(

•44

Metropolized Random Walk Metropolized Random Walk (MRW)(MRW)

• The Metropolis-Hastings method modifies the transition The Metropolis-Hastings method modifies the transition matrix to yield the desired uniform distribution matrix to yield the desired uniform distribution [Stutzbach:IMC06][Stutzbach:IMC06]

• MRW method:MRW method:– Select a neighbor Select a neighbor yy of of xx uniformly at random uniformly at random– Transition to Transition to yy with probability min( deg( with probability min( deg(xx)/deg()/deg(yy) , 1)) , 1)– Otherwise, self-loop to Otherwise, self-loop to xx..– Results in uniform stationary dist. Results in uniform stationary dist. (x)= 1/|V|(x)= 1/|V|

MRW compensates for bias as samples are collectedMRW compensates for bias as samples are collected

yxyxQ

yxy

xyxP

yxQ

yx

if),(1

if)1,)deg(

)deg(min(),(

),(

•55

xxyy

This paperThis paper

• Presents a new graph sampling Presents a new graph sampling technique, technique, Respondent-Driven Respondent-Driven Sampling (RDS)Sampling (RDS)

• Compares the performance of RDS Compares the performance of RDS and MRW sampling techniques using and MRW sampling techniques using simulations & experimentssimulations & experiments

•66

Respondent-driven SamplingRespondent-driven Sampling

• A development of Snowball Sampling A development of Snowball Sampling [Salganik04][Salganik04]

• Commonly used in social sciences to sample Commonly used in social sciences to sample “hidden” populations, e.g. HIV+ individuals “hidden” populations, e.g. HIV+ individuals

• Social relationships (references) are used Social relationships (references) are used by sampler to diffuse into hidden by sampler to diffuse into hidden populationspopulations– Each person introduces Each person introduces nn other persons other persons– Similar to random walk (Similar to random walk (n n = 1)= 1)

• We adopt the RDS technique from social We adopt the RDS technique from social sciences for sampling P2P networkssciences for sampling P2P networks

•77

RDS FormulationRDS Formulation

• Goal: Estimate the distribution of node property XGoal: Estimate the distribution of node property X• Perform Perform regularregular random walk, collect values of random walk, collect values of property property

XX and and node degree (deg(node degree (deg(vv)))) at each visited node at each visited node• Deal with the bias during the Deal with the bias during the post-processingpost-processing as as

follows:follows:– Divide possible values for X into several ranges: {RDivide possible values for X into several ranges: {R11, . . . ,R, . . . ,Rmm}}

– Partition nodes with the X value within the same range: {VPartition nodes with the X value within the same range: {V11, . . , . . . ,V. ,Vmm}}

• Using Hansen-Hurwitz estimator to compensate for the Using Hansen-Hurwitz estimator to compensate for the bias, the proportion of all nodes in group bias, the proportion of all nodes in group ii is estimated as is estimated as follows:follows:

Tv

Tv

i

v

vp i

)deg(1

)deg(1

ˆ

•88

• Ti: visited samples in group Ti: visited samples in group ii• T: all visited samplesT: all visited samples

Evaluation OverviewEvaluation Overview

• Performance metricPerformance metric– Consider only peer properties that may interact with the Consider only peer properties that may interact with the

walk:walk:• 1) Peer Degree, 2) Peer Uptime, 3) Peer RTT1) Peer Degree, 2) Peer Uptime, 3) Peer RTT

– Compare the dist. of the these peer properties from Compare the dist. of the these peer properties from samples and “ground truth” using Kolmogorov-Smirnov samples and “ground truth” using Kolmogorov-Smirnov (KS) statistics(KS) statistics

• Evaluation MethodologyEvaluation Methodology– Evaluation over static graphsEvaluation over static graphs

• Effect of graph structureEffect of graph structure– Evaluation over dynamic graphs (session level simulation)Evaluation over dynamic graphs (session level simulation)

• Benefits of parallel Sampling (see the paper)Benefits of parallel Sampling (see the paper)• Effect of 1) churn, 2) peer discovery, 3) target peer degreeEffect of 1) churn, 2) peer discovery, 3) target peer degree

– Experiments over Gnutella networkExperiments over Gnutella network

•99

Evaluation: Static GraphsEvaluation: Static Graphs

• Using graphs with different Using graphs with different degree degree distribution distribution && clustering clustering characteristics: characteristics:– Random graphs (ERRandom graphs (ER): Erdos-Renyi): Erdos-Renyi– Small-world graphs (SW): Small-world graphs (SW): Watts and StrogatzWatts and Strogatz– Scale-free graphs (BA): Scale-free graphs (BA): Barabasi and AlbertBarabasi and Albert– Hierarchical Scale-Free graphs (HSF): Hierarchical Scale-Free graphs (HSF): Barabasi Barabasi

‘02‘02• Power-law degree distributionPower-law degree distribution• Node clustering is inversely proportional to node Node clustering is inversely proportional to node

degree degree – Gnutella graphs (GA): Gnutella graphs (GA): Snapshots of Gnutella Snapshots of Gnutella

Ultrapeer topologyUltrapeer topology

•1010

Hierarchical Scale-Free Hierarchical Scale-Free (HSF) (HSF)

•1111

Static GraphsStatic Graphs

• Accuracy of both techniques is Accuracy of both techniques is improved with the number of improved with the number of samples in most casessamples in most cases

• The rate of improvement in The rate of improvement in accuracy is much lower over accuracy is much lower over HSF especially for MRWHSF especially for MRW

• Walkers are likely to get Walkers are likely to get trapped within clusters in HSF trapped within clusters in HSF graphsgraphs

• Leaving a cluster requires Leaving a cluster requires visiting high degree nodes but visiting high degree nodes but MRW is less likely to visit MRW is less likely to visit these nodesthese nodes

• Rewiring a small fraction of Rewiring a small fraction of randomly selected edges in randomly selected edges in HSF significantly improves HSF significantly improves accuracy for both techniquesaccuracy for both techniques

RDS is less sensitive to graph RDS is less sensitive to graph clustering than MRWclustering than MRW

•1212

Dynamic GraphsDynamic Graphs

• Churn is a primary limiting Churn is a primary limiting factor for accuracyfactor for accuracy

• Session len.> 5m Session len.> 5m Very Very good sampling accuracygood sampling accuracy

• Churn model has little effectChurn model has little effect• Similar impact on other peer Similar impact on other peer

properties (see the paper)properties (see the paper)

• Sampling error is small once Sampling error is small once nodes have sufficient nodes have sufficient connectivity (> 5)connectivity (> 5)

• Lower accuracy for smaller Lower accuracy for smaller degree is due to graph degree is due to graph partitioning partitioning

• Partitioned nodes in Partitioned nodes in HistoryHistory mech. reduce the accuracy of mech. reduce the accuracy of samplingsampling

•1313

Experiment: GnutellaExperiment: Gnutella

• Run crawler, 1000 RDS & Run crawler, 1000 RDS & 1000 MRW walkers in parallel1000 MRW walkers in parallel– 500 steps per walker500 steps per walker

• Use captured snapshots by Use captured snapshots by crawler as a “rough” crawler as a “rough” referencereference– Show min, max, avg KS over 6 Show min, max, avg KS over 6

experiments experiments – Focus only on degree distFocus only on degree dist

• The degree dist from samples The degree dist from samples & crawls are very similar & crawls are very similar (KS~0.03)(KS~0.03)

• The accuracy is an order of The accuracy is an order of magnitude lower than magnitude lower than dynamic sim due to dynamic sim due to inaccurate reference.inaccurate reference.

• Both sampling technique Both sampling technique achieve similar accuracy achieve similar accuracy

•1414

Conclusions & Future Conclusions & Future WorkWork

•1515

• RDS always performs as good or better than RDS always performs as good or better than MRWMRW

• High level of graph clustering can significantly High level of graph clustering can significantly degrade the accuracy of both RDS and MRWdegrade the accuracy of both RDS and MRW– RDS is less sensitive than MRW to graph clusteringRDS is less sensitive than MRW to graph clustering

• There is sweet spot for the number of parallel There is sweet spot for the number of parallel samplers.samplers.

• Poor connectivity & high dynamics adversely Poor connectivity & high dynamics adversely affect the accuracy of both techniques.affect the accuracy of both techniques.

• Future Work:Future Work:– RDS is a promising approach for sampling user RDS is a promising approach for sampling user

properties in Online Social Networks properties in Online Social Networks – Sampling over directed graphs raises new challenges.Sampling over directed graphs raises new challenges.

Thank You !Thank You !

Different Grpah Different Grpah StructuresStructures

Dynamic Simulation Dynamic Simulation SettingSetting

• Simulation environmentSimulation environment– Session-time distributions : Weibull, Session-time distributions : Weibull,

Exponential, ParetoExponential, Pareto– Poisson arrival processPoisson arrival process– Peer discovery : Oracle, FIFO, HeartBeat, Peer discovery : Oracle, FIFO, HeartBeat,

HistoryHistory– Target population : 100’000Target population : 100’000– Min. Degree : 3-30Min. Degree : 3-30– Sampling Parameters:Sampling Parameters:

• Node degree (DEG)Node degree (DEG)• Node query latency (RTT)Node query latency (RTT)• Session length/uptime (UT)Session length/uptime (UT)

•1818

Dynamic Graphs: Effect of Dynamic Graphs: Effect of ParallelismParallelism

• Too much Too much parallelism does parallelism does not improve not improve performanceperformance

• Too long random Too long random walks have walks have negative effectnegative effect

• Sweet spot existsSweet spot exists

•2020