WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

30
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction

Transcript of WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Page 1: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

WALKING IN FACEBOOK:A CASE STUDY OF UNBIASED SAMPLING OF OSNS

2010.6.9 junction

Page 2: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Outline

Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion

Page 3: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Online Social Networks (OSNs) A network of declared

friendships between users, allowing users to maintain relationships

Many popular OSNs with different focus Facebook, LinkedIn, Flickr, …

C

A

EGF

BD

H

Facebook More than 400 million active users 50% of them log on to Facebook in any given day Average user has 130 friends People spend over 500 billion minutes per month on

Facebook more than 100 million mobile users Mobile user are twice more active than non-mobile users.

Social Graph• undirected graph• G = (V, E)• V: nodes (users)• E: edges (relationships)• kv : node degree

Page 4: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Why Sample OSNs?

Representative samples desirable study properties test algorithms

Obtaining complete dataset difficult companies usually unwilling to share data tremendous overhead to measure all

(~100TB for Facebook)

Page 5: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Problem Statement

Obtain a representative sample of users in a given OSN by exploration of the social graph. Uniform sample of Facebook users explore graph using various crawling

techniques

Page 6: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Outline

Motivation and Problem Statement Sampling Methodology

Crawling Methods Convergence Evaluation Data Collection

Evaluation of Sampling Techniques Facebook Data Analysis Conclusion

Page 7: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Crawling Methods

Crawling Methods Breadth First Search (BFS) Random Walk (RW) Re-Weighted Random Walk (RWRW) Metropolis-Hastings Random Walk (MHRW) Uniform Sampling (UNI)

Page 8: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Breadth First Search (BFS)

Early measurement studies of OSNs use BFS as primary sampling technique

Starting from a seed, explores all neighbor nodes.

As this method discovers all nodes within some distance from the starting point, an incomplete BFS is likely to densely cover only some specific region of the graph.

BFS leads to bias towards high degree nodes

C

A

EGF

BD

H

Unexplored

Explored

Visited

Page 9: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Random Walk (RW)

Explores graph one node at a time with replacement

In the stationary distribution

biased towards higher degree nodes (πv ~ kv)

C

A

EGF

BD

H

1/3

1/3

1/3

Next candidate

Current node

,

1RWwP k

Degree of node υ

2

k

E

Number of edges

Page 10: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Re-Weighted Random Walk (RWRW)

Corrects for degree bias at the end of collection Without re-weighting, the probability distribution

for node property A is: (e.g. the degree, network size ...)

Re-Weighted probability distribution :1/

( )1/iu A u

iu V u

kp A

k

Degree of node u

Page 11: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

15

2)

5

1

3

1

3

1(1 MH

AAP

Metropolis-Hastings Random Walk (MHRW)

Explore graph one node at a time with replacement

In the stationary distribution

Exactly the uniform distribution

C

A

EGF

BD

H

1/3

1/5

1/3

Next candidate

Current node

2/15

5

1

5

3

3

1MH

ACP

,

,

1min(1, ) if neighbor of

1 if =

MH ww

MHy

y

kw

k kPP w

1

V

Page 12: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Uniform Sampling (UNI)

As a basis for comparison (ground truth) Rejection sampling

uniform sampling of on the 32-bit IDs discarding the non-existing ones yields a uniform sample of the existing user IDs in

Facebook for any allocation policy (i.e. even if the userIDs are not evenly allocated in the 32-bit address space)

UNI not a general solution for sampling OSNs userID space must not be sparse names instead of numbers must be supported by the systems

Page 13: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Convergence Detection

Number of samples (iterations) to loose dependence from starting points?

Page 14: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Convergence Evaluation

Using Multiple Parallel Walks to improve convergence avoid getting trapped in certain region starting from 28 different randomly chosen initial

nodes Detecting Convergence with Online Diagnostics

sampling longer and discard a number of initial “burn-in” iterations Consumed BW (TB) and measurement time (days) Crucial to decide appropriate ‘burn-in’ and total running

time Grweke Diagnostic Gelman-Rubin Diagnostic

Page 15: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Geweke Diagnostic

Detect the convergence of a single Markov chain

With increasing number of iterations, Xa and Xb move further apart, which limits the correlation between them.

according to the law of large numbers, the z values become normally distributed ~ (0, 1)

Declare convergence when most values fall in the [-1,1] interval

Xa Xb ( ) ( )

( ) ( )a b

a b

E X E Xz

Var X Var X

Page 16: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Walk 1

Walk 2

Walk 3

1 1n m BR

n mn W

Between walks variance

Within walks variance

Gelman-Rubin Diagnostic

Detects convergence for m>1 walks (m: # of chains)

Compare the empirical distributions of individual chains with the empirical distribution of all sequences together

if they are similar enough (R,1.02) , declare convergence

Page 17: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Data Collection

Information collected

UserIDName NetworksPrivacy settings

Friend ListUserIDName

NetworksPrivacy Settings

UserIDName NetworksPrivacy settings

u

1111

Profile PhotoAdd as Friend

Regional School/Workplace

UserIDName NetworksPrivacy settings

View FriendsSend Message

Page 18: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Data Collection

Summary of data set• 28 x 81K = 2.26 M• 28 initial starting nodes• crawl until exactly 81K samples are collected

• repeat the same node in a walk• # of rejected nodes without repetition : 645 K

• 18.53M nodes picked uniform from [1, 232]• only 1216 K users existed• 228 K users had zero friends

• RW: 97 % nodes are unique• BFS: 97 % nodes are unique• confirms that the random seeding chose different areas of FB

Page 19: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Outline

Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques

Convergence Analysis Methods Comparison Unbiased Estimation

Facebook Data Anaylsis Conclusion

Page 20: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

What is a fair way to compare the results of MHRW with RW and BFS? MHRW visits fewer unique nodes than RW and

BFS MHRW stays at some nodes for relatively long

time/iterations Happens usually at some low degree node

An appropriate practical comparison should be based on the number of visited unique nodes

Convergence Analysis

,

,

1min(1, ) if neighbor of

1 if =

MH ww

MHy

y

kw

k kPP w

3

2

3

11

3

1)

3

1,1min(

1

1

,MHCAp

Page 21: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Node Degree

Convergence Test

When does it reach equilibrium?

Burn-in determined to be 3K -> discard 6K

• converge when all 28 values fall in the [-1, 1] interval• 500 iterations• converge when all R scores drop below 1.02• (0,1): not in / in• 3000 iterations

Page 22: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Methods Comparison

MHRW, RWRW produce good in estimating the probability of a node degree The degree

distribution will converge fast to a good uniform sample

Poor performance for BFS, RW

28 crawls

Page 23: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Unbiased Estimation (BFS, RW) Node degree distribution

introduce a strong bias towards the high degree nodes

the low-degree nodes are under-represented

Page 24: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Unbiased Estimation (MHRW) Degree distribution identical to UNI (MHRW,

RWRW)

Page 25: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Outline

Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion

Page 26: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

FB Social Graph – degree distribution

Degree distribution not a power law

a2=3.3

8

a1=1.3

2

Page 27: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

FB Social Graph - Assortativity Assortativity

nodes tend to connect to similar or different nodes?

positive correlation: high degree nodes tend to connect to other high degree nodes

Page 28: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

FB Social Graph – Privacy Awareness

Page 29: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Outline

Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion

Page 30: WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS 2010.6.9 junction.

Conclusion

Compared graph crawling methods MHRW, RWRW performed remarkably well BFS, RW lead to substantial bias

Practical recommendations correct for bias usage of online convergence diagnostics proper use of multiple chains