=1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of...

33
Network Data Sampling and Estimation Hui Yang and Yanan Jia September 25, 2014 Yang, Jia September 25, 2014 1 / 33

Transcript of =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of...

Page 1: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Data Sampling and Estimation

Hui Yang and Yanan Jia

September 25, 2014

Yang, Jia September 25, 2014 1 / 33

Page 2: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Outline

1 Introduction

2 Network Sampling DesignsInduced and Incident Subgraph SamplingStar and Snowball SamplingLink Tracing Sampling

3 Background on Statistical Sampling TheoryHorvitz-Thompson Estimation for TotalsEstimation of Group Size

4 Estimation of Totals in Network GraphsoverviewVertex TotalsTotals on Vertex PairsTotals of Higher Order

5 Estimation of Network Group SizeEstimation of Network Group Size

Yang, Jia September 25, 2014 2 / 33

Page 3: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Introduction

Introduction

Yang, Jia September 25, 2014 3 / 33

Page 4: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Introduction

Introduction

Population graph: network graph G = (V ,E )

Sampled graph: G ? = (V ?,E ?)

A characristic of network graph G : η(G )

Estimation of η(G ): η̂(G )

Example: Estimate the average degree of a network graph G by theaverage degree of a sampled graph G ?, η̂(G ) = η(G ?), is this aproper estimator? Depend on the sampling design!

Yang, Jia September 25, 2014 4 / 33

Page 5: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs

Network Sampling Designs

Yang, Jia September 25, 2014 5 / 33

Page 6: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Induced and Incident Subgraph Sampling

Induced and Incident SubgraphSampling

Yang, Jia September 25, 2014 6 / 33

Page 7: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Induced and Incident Subgraph Sampling

Induced and Incident Subgraph Sampling

Figure : Induced Subgraph Sampling,vertices→edges

Figure : Incident Subgraph Sampling,edges→vertices

Yang, Jia September 25, 2014 7 / 33

Page 8: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Induced and Incident Subgraph Sampling

Induced and Incident Subgraph Sampling

Figure : Induced Subgraph Sampling,vertices→edges

Figure : Incident Subgraph Sampling,edges→vertices

Yang, Jia September 25, 2014 8 / 33

Page 9: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Star and Snowball Sampling

Star and Snowball Sampling

Yang, Jia September 25, 2014 9 / 33

Page 10: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Star and Snowball Sampling

Unlabeled Star Subgraph Sampling

Figure : Unlabeled Star Subgraph Sampling

Yang, Jia September 25, 2014 10 / 33

Page 11: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Star and Snowball Sampling

Labeled Star (One-stage Snowball) Subgraph Sampling

Figure : Labeled Star Subgraph Sampling

Yang, Jia September 25, 2014 11 / 33

Page 12: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Star and Snowball Sampling

Two-stage Snowball Subgraph Sampling

Figure : Two-stage Snowball Subgraph Sampling

Yang, Jia September 25, 2014 12 / 33

Page 13: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Star and Snowball Sampling

Inclusion Probabilities of Star Subgraph Sampling

Yang, Jia September 25, 2014 13 / 33

Page 14: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Link Tracing Sampling

Link Tracing Sampling

Yang, Jia September 25, 2014 14 / 33

Page 15: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Network Sampling Designs Link Tracing Sampling

Link Tracing Subgraph Sampling

Figure : Link Tracing Subgraph Sampling

Yang, Jia September 25, 2014 15 / 33

Page 16: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Background on Statistical Sampling Theory

Background on Statistical SamplingTheory

Yang, Jia September 25, 2014 16 / 33

Page 17: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Background on Statistical Sampling Theory Horvitz-Thompson Estimation for Totals

Horvitz-Thompson Estimation for Totals

Poputation U of size Nu

A value yi associated with each unit i ∈ U

Sample S of size n, each unit i ∈ U has probability πi of beingincluded in S

Population total: τ =∑

i∈U yi

Horvitz-Thompson estimation of τ : τ̂π =∑

i∈S yi/πi

τ̂π is an unbiased estimate of τ

Yang, Jia September 25, 2014 17 / 33

Page 18: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Background on Statistical Sampling Theory Estimation of Group Size

Estimation of Group Size

Group size Nu is typically needed to compute πi ’s, in many cases Nu

is unknown

Capture-recapture estimator: first sample S1 of size n1 is taken, andall of the units in S1 are marked. All of the units in S1 are thenreturned to the population. Next, a sample of size n2 is taken. Then

N̂(c/r)u = n1/(m/n2) = n1n2/m, where m is the number of marked

units observed in the second sample.

Yang, Jia September 25, 2014 18 / 33

Page 19: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs

Estimation of Totals in NetworkGraphs

Yang, Jia September 25, 2014 19 / 33

Page 20: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs overview

Estimation of Totals in Network Graphs

Population U = {1, . . . ,Nu}Unit Values yi for i ∈ U .

Total τ =∑

i yi and average µ = τ/Nu.

With appropriate choice of a population of units U and unit values y ,various graph summary characteristics η(G ) can be written in a form thatinvolves a total τ =

∑i yi .

Yang, Jia September 25, 2014 20 / 33

Page 21: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Vertex Totals

Vertex Totals

Let U = V and yi = di . The average degree of a graph G isobtained by scaling the total

∑i∈V di by Nv .

Let U = V and yi be a binary variable indicating that a vertex has agiven characteristic. τ counts the number of vertices with thatcharacteristic, and τ/Nv , the proportion.

Given a sample of vertices V ∗ ⊆ V , the Horvitz-Thompson estimator forvertex totals τ =

∑i∈V yi takes the form

τ̂π =∑i∈V ∗

yiπi

where the πi are the vertex inclusion probabilities corresponding to theunderlying network sampling design.

Yang, Jia September 25, 2014 21 / 33

Page 22: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Totals on Vertex Pairs

Totals on Vertex Pairs

Let U = V (2) and y(i ,j) = I(i ,j)∈E be the indicator of the event thatthere is an edge between i and j . The number of edges Ne is given bythe total

∑(i ,j)∈V (2) I(i ,j)∈E .

Let U = V (2) and y(i ,j) = Ik∈(i ,j) be the indicator of the event thatthe shortest path between i and j contains node k. In the case ofunique shortest paths, the betweenness centrality cB(k) of a vertexk ∈ V is given by the total

∑(i ,j)∈V (2) Ik∈(i ,j) .

Given a sample of vertices pair V ∗(2) ⊆ V (2), the Horvitz-Thompson

estimator for vertex totals τ =∑

(i ,j)∈V (2) yij takes the form

τ̂π =∑

(i ,j)∈V ∗(2)

yi ,jπi ,j

Yang, Jia September 25, 2014 22 / 33

Page 23: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Totals on Vertex Pairs

Totals on Vertex Pairs Example

A network of interactions among Nv = 5, 151 proteins in S. cerevisiaewith Ne = 31, 201.

Induced subgraph sampling, with Bernoulli sampling of vertices, usingp = 0.10, 0.20, 0.30.

Estimator of Ne is N̂e =∑

(i ,j)∈V ∗(2)yi,jπi,j

= N?e

p2

Histograms of N̂e based on 10, 000 trials.

Yang, Jia September 25, 2014 23 / 33

Page 24: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Totals of Higher Order

Totals of Higher Order

Let U = V (3) be the set of all triples of distinct vertices (i , j , k)

yijk = AijAjkAki (A is the adjacency matrix of G ), thenτ∆(G ) =

∑(i ,j ,k)∈V (3) y(i ,j ,k) is the number of triangles in the graph.

yijk = AijAjk(1− Aki ) + Aij(i − Ajk)Aki + (1− Aij)AjkAki , thenτ?3 (G ) =

∑(i ,j ,k)∈V (3) y(i ,j ,k) is the number of vertex triples that are

connected by exactly two edges.

Given a sample V ∗(3) ⊆ V (3), the Horvitz-Thompson estimator for

τ =∑

(i ,j ,k)∈V (3) y(i ,j ,k) takes the form

τ̂π =∑

(i ,j ,k)∈V ∗(3)

yi ,j ,kπi ,j ,k

Yang, Jia September 25, 2014 24 / 33

Page 25: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Totals of Higher Order

Totals of Higher Order

Clustering coefficient clT of a graph G :

clT (G ) =3τ4(G )

τ3(G )=

3τ4(G )

τ?3 (G ) + 3τ4(G )

where τ3(G ) is the number of connected triples, τ∆(G ) the number oftriangles in the graph, and τ?3 (G ) = τ3(G )− 3τ4(G ), is the number ofvertex triples that are connected by exactly two edges.

The value clT (G ), called the transitivity of the graph. clT (G ) is a functionof two different totals of the form

τ̂π =∑

(i ,j ,k)∈V (3)

yi ,j ,k

Yang, Jia September 25, 2014 25 / 33

Page 26: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Totals of Higher Order

Totals of Higher Order Example

Protein interactions Network has τ4(G ) = 44, 858 triangles,τ?3 (G ) = 1, 006, 575 triples connected by exactly two edges, and aclustering coefficient clT (G ) = 0.1179.

We simulated 10,000 trials of induced subgraph sampling, withBernoulli sampling of vertices, using p = 0.20.

Unbiased estimates of the two totals:

τ4(G ) = p−3τ4(G ∗)

τ?(G ) = p−3τ?(G ∗)

clT (G ) =3τ4(G )

τ?3 (G ) + 3τ4(G )

Yang, Jia September 25, 2014 26 / 33

Page 27: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Totals in Network Graphs Totals of Higher Order

Yang, Jia September 25, 2014 27 / 33

Page 28: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Network Group Size

Estimation of Network Group Size

Yang, Jia September 25, 2014 28 / 33

Page 29: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Network Group Size Estimation of Network Group Size

Estimation of Network Group Size

Simple random sampling without replacement or Bernoulli sampling

Doing the sampling twice, after ’marking’ the first sample, usecapture-recapture estimators,

N̂v =n2

mn1.

Yang, Jia September 25, 2014 29 / 33

Page 30: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Network Group Size Estimation of Network Group Size

Estimating the Size of a ’Hidden Population’

Snowball Sampling

G = (V ,E ) a directed graph.

G ? a subgraph of G , with vertices V ? = V ?0 ∪ V ?

1 obtained through aone-wave snowball sample, V ?

0 selected through Bernoulli samplingwith p0.

N the size of the initial sample, M1 the number of arcs amongindividuals in V ?

0 , M2 the number of arcs pointing from individuals inV ?

0 to individuals in V ?1 .

Estimator of Nv will be derived using the method-of-moments.

N̂v = nm1 + m2

m1

Yang, Jia September 25, 2014 30 / 33

Page 31: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Network Group Size Estimation of Network Group Size

Other Network Graph Estimation Problems

Estimation of Degree Frequency.

The estimation of the number of connected components in a graph.

The estimation of quantities not easily expressed as totals.

Sampling and estimation are also being used as a way of producingcomputationally efficient ’approximations’ to quantities that, if computedfor the full network graph, would be prohibitively expensive.

Yang, Jia September 25, 2014 31 / 33

Page 32: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Network Group Size Estimation of Network Group Size

Summary

Formalize the problem of sampling and estimation in network graphsDescribe a handful of common network sampling designsDevelop estimators of a number of quantities of interest.

Yang, Jia September 25, 2014 32 / 33

Page 33: =1=Network Data Sampling and Estimation...Sampling and estimation are also being used as a way of producing computationally e cient ’approximations’ to quantities that, if computed

Estimation of Network Group Size Estimation of Network Group Size

Thanks for your attention !

Yang, Jia September 25, 2014 33 / 33