Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R...

10
Statistical Network Analysis and Data Science: Things we do well, and things we’re still working on Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University [email protected] BU Data Science Day, Jan 2016

Transcript of Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R...

Page 1: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Statistical Network Analysis and Data Science:Things we do well, and things we’re still working on

Eric D. Kolaczyk

Dept of Mathematics and Statistics, Boston University

[email protected]

BU Data Science Day, Jan 2016

Page 2: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Introduction

Networks and Data Science

Network-based perspectives are now ubiquitous in data science. GoogleScholar reports ∼ 436, 000 articles with ‘network’ in the title publishedsince 1999.

Contributions have been from across the data sciences:

Computer Science, Mathematics, StatisticsSignal processing, Statistical Physics, Information SciencesBioinformatics, Economics, Neuroscience, SociologyDigital Humanities

1

USE R !

ISBN 978-1-4939-0982-7

Use R ! Use R !

Eric KolaczykGábor Csárdi

Statistical Analysis of Network Data with R

Statistical Analysis of Netw

ork Data w

ith RKolaczyk · Csárdi

Eric Kolaczyk · Gábor Csárdi

Statistical Analysis of Network Data with R

Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).

Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).

Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.

Statistics

9 781493 909827

BU Data Science Day, Jan 2016

Page 3: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Introduction

Networks Analysis

Pdp

dCLK Cyc

Tim

VriPer

Network-based analysis traditionally a relatively small ‘field’ of study

Epidemic-like spread of interest in networks since mid/late-1990s

Arguably due to various factors, such as

Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.

BU Data Science Day, Jan 2016

Page 4: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Introduction

Where is Statistics Amid All of this Work?

Everywhere!

Statistical aspects of network analysis include problems involving

sampling and design

description and visualization

modeling and inference

prediction

for data both of and on networks.

Much of this work occurs in domain-specific areas;a nontrivial amount of it is general.

This is an actively-evolving field . . . much of network analysis we can dowell, while for a good deal of the rest we still have a ways to go!

BU Data Science Day, Jan 2016

Page 5: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Introduction

Statistical Challenges Working with Network Data

Broadly speaking, the primary statistical challenge(s) in most networkproblems comes from nontrivial interplay between

relational/dependent nature of the data;

lack of (traditional) geometry; and

‘big data’.

Question: Despite a fairly well-developed ability to deal successfully withmany classes of problems in many contexts and domain areas, how well dowe truly understand the implications of network data (e.g., as opposed toIID, temporal, or spatial data) on statistics at a foundational level?

My Answer: Somewhat . . . but there’s still a long way to go!

BU Data Science Day, Jan 2016

Page 6: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Three Vignettes

Illustrations

Through the use of three vignettes of current/ongoing work from myresearch group, I will illustrate

what I mean by ‘foundations’;

the novelty lent to foundational topics when intersected with complexnetworks; and

some of the resulting challenges and open problems.

Vignette Topics

1 Adjusting for Bias in Network Sampling

2 Propagation of Uncertainty To Network Summary Statistics

3 Inference for Large Samples of Network Data Objects

BU Data Science Day, Jan 2016

Page 7: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Three Vignettes

Estimation of Degree Distribution Under Network Sampling

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

Friendster cmty 1−5

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

Friendster cmty 6−15

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

Friendster cmty 16−30

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

Orkut cmty 1−5

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

Orkut cmty 6−15

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

Orkut cmty 16−30

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

LiveJournal cmty 1−5

0 5 10−15

−10

−5

log2(degree)

log2

(fre

q)

LiveJournal cmty 6−15

0 5 10−15

−10

−5

log2(degree)lo

g2(f

req)

LiveJournal cmty 16−30

Figure: Estimating degree distributions of communities from Friendster, Orkut and Livejournal. Blue dots represent the truedegree distributions, black dots represent the sample degree distributions, red dots represent the estimated degree distributions.

Sampling rate=30%. Dots which correspond to a density < 10−4 are eliminated from the plot.

BU Data Science Day, Jan 2016

Page 8: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Three Vignettes

Propagation of Uncertaintyilv

Y_b

3773

_at

hydG

_b40

04_a

tga

tR_1

_b20

87_a

tcy

nR_b

0338

_at

leuO

_b00

76_a

tm

etR

_b38

28_a

tar

sR_b

3501

_at

yiaJ

_b35

74_a

ttd

cR_b

3119

_at

soxR

_b40

63_a

trh

aS_b

3905

_at

rhaR

_b39

06_a

tal

pA_b

2624

_at

rtcR

_b34

22_a

ttd

cA_b

3118

_at

rbsR

_b37

53_a

txy

lR_b

3569

_at

emrR

_b26

84_a

tcp

xR_b

3912

_at

yhhG

_b34

81_a

tm

elR

_b41

18_a

thy

cA_b

2725

_at

rpiR

_b40

89_a

tgl

nG_b

3868

_at

hyfR

_b24

91_a

tfh

lA_b

2731

_at

iciA

_b29

16_a

tyg

iX_b

3025

_at

yhiX

_b35

16_a

tyh

iW_b

3515

_at

adiY

_b41

16_a

tyh

iE_b

3512

_at

srlR

_b27

07_a

tgu

tM_b

2706

_at

gntR

_b34

38_a

tm

alT

_b34

18_a

tcy

tR_b

3934

_at

crp_

b335

7_at

yjdG

_b41

24_a

tyj

bK_b

4046

_at

yijC

_b39

63_a

tly

sR_b

2839

_at

glnL

_b38

69_a

tux

uR_b

4324

_at

ygaE

_b26

64_a

tyj

fQ_b

4191

_at

cadC

_b41

33_a

tlld

R_b

3604

_at

sohA

_b31

29_a

tid

nR_b

4264

_at

treR

_b42

41_a

tfr

uR_b

0080

_at

lacI

_b03

45_a

ten

vY_b

0566

_at

fucR

_b28

05_a

tom

pR_b

3405

_at

fecI

_b42

93_a

tfis

_b32

61_a

thu

pB_b

0440

_at

hupA

_b40

00_a

tcs

pE_b

0623

_at

nadR

_b43

90_a

trp

oN_b

3202

_at

gcvR

_b24

79_a

tds

dC_b

2364

_at

atoC

_b22

20_a

tap

pY_b

0564

_at

rcsB

_b22

17_a

tna

gC_b

0676

_at

him

A_b

1712

_at

lrp_

b088

9_at

fnr_

b133

4_at

putA

_b10

14_a

tui

dA_b

1617

_at

b139

9_at

baeR

_b20

79_a

tga

tR_2

_b20

90_f

_at

fadR

_b11

87_a

tna

rL_b

1221

_at

pspF

_b13

03_a

tph

oP_b

1130

_at

fliA

_b19

22_a

tdn

aA_b

3702

_at

birA

_b39

73_a

tle

xA_b

4043

_at

yifA

_b37

62_a

tas

nC_b

3743

_at

caiF

_b00

34_a

tna

c_b1

988_

athn

s_b1

237_

atcs

pA_b

3556

_at

uhpA

_b36

69_a

tar

cA_b

4401

_at

nhaR

_b00

20_a

tso

xS_b

4062

_at

rpoH

_b34

61_a

tar

gR_b

3237

_at

fur_

b068

3_at

evgA

_b23

69_a

tro

b_b4

396_

atrp

oD_b

3067

_at

hcaR

_b25

37_a

trp

oE_b

2573

_at

ebgR

_b30

75_a

tar

aC_b

0064

_at

exuR

_b30

94_a

tga

lR_b

2837

_at

mtlR

_b36

01_a

tgl

pR_b

3423

_at

sdiA

_b19

16_a

tflh

D_b

1892

_at

flhC

_b18

91_a

tcs

gD_b

1040

_at

lrhA

_b22

89_a

txa

pR_b

2405

_at

gcvA

_b28

08_a

tyg

aA_b

2709

_at

rcsA

_b19

51_a

tic

lR_b

4018

_at

glcC

_b29

80_a

tbe

tI_b0

313_

atac

rR_b

0464

_at

yhdM

_b32

92_a

thi

pA_b

1507

_at

hipB

_b15

08_a

trp

oS_b

2741

_at

mod

E_b

0761

_at

tyrR

_b13

23_a

tm

alI_

b162

0_at

slyA

_b16

42_a

tcy

sB_b

1275

_at

ada_

b221

3_at

trpR

_b43

93_a

tde

oR_b

0840

_at

torR

_b09

95_a

tb2

531_

atm

arR

_b15

30_a

tm

arA

_b15

31_a

tga

lS_b

2151

_at

narP

_b21

93_a

tm

lc_b

1594

_at

purR

_b16

58_a

tpd

hR_b

0113

_at

him

D_b

0912

_at

farR

_b07

30_a

tyl

cA_b

0571

_at

oxyR

_b39

61_a

tm

etJ_

b393

8_at

cbl_

b198

7_at

phoB

_b03

99_a

tb1

499_

atm

hpR

_b03

46_a

tkd

pE_b

0694

_at

yebF__U_N0075zipA__U_N0075b2618_U_N0075bcp___U_N0075cpxR__U_N0075crcB__U_N0075crp___U_N0075cspF__U_N0075dam___U_N0075dnaA__U_N0075dnaN__U_N0075dnaT__U_N0075era___U_N0075fis___U_N0075fklB__U_N0075folA__U_N0075galF__U_N0075gcvR__U_N0075gyrA__U_N0075gyrI__U_N0075hlpA__U_N0075holD__U_N0075hscA__U_N0075ldrA__U_N0075mcrB__U_N0075mcrC__U_N0075menB__U_N0075menC__U_N0075minD__U_N0075minE__U_N0075murI__U_N0075yoeB__U_N0075nrdA__U_N0075nupC__U_N0075pyrC__U_N0075rimI__U_N0075rstB__U_N0075ruvC__U_N0075sbcB__U_N0075uspA__U_N0075

=⇒

●●

●●

●● ●

●●

●●

YBR083W

YHR084W

YDR480W

YGR088W

YHR030C

YLR342W

YCL027W

YDR461WYFL026W

YBL016W

YJL157C

YMR043W

YGR032W

YER111C

YJL128C

YLR113W

YKL062W

YML004C

YNL145W

YKR095W

YCR089W

YBR040W

YMR198WYDR309C

YDR085C

YMR065W

YNL192W

YNR044W

YGL021WYDR146C YGL116W

YBR023C

YDR055W

YGR189C

YGR282C

YIL076W

YJL158C

YJL159W

YKL096WYKL163W

YKL164C

YMR238W

YCR012W

YNL160W

YBR072W

YDL204WYBR126C

YJL141C

YKL150W

YER054C

YOR173W

YGR248W

YJL161WYHR139C

YIR019C

YJR153W YHR084W

YMR043W

YLR113W

YMR037C

YKL062W

YPL089CYBR083W

=⇒ Density = 0.14(?)

Examples particularly prevalent in biology (e.g., gene regulatory networks,protein-protein interaction networks, and neural functional connectivitynetworks), but some noise likely present in most network applications.

BU Data Science Day, Jan 2016

Page 9: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Three Vignettes

‘Statistics 101’ for Collections of Networks

Current state of the art (aka ‘mass univariate’) uses edge-wise testing(presence/absence) and multiple correction.

(A) Mass-univariate analysis (Sex) (B) Mass-univariate analysis (Age)

Uncorrected Corrected Uncorrected Corrected

Comparison: Mass-Univariate vs Multivariate

Both methods detect differences in mean networks across gender andage, when using the full 1000 connectomes; but . . .

Only our multivariate method detects those differences at smallsample sizes (i.e., relevant to single labs).

BU Data Science Day, Jan 2016

Page 10: Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk· Gábor

Three Vignettes

Collaborators and Support

Supported in part by AFOSR, NIH, NSF, and ONR.

BU Data Science Day, Jan 2016