Multilevel network data facilitate statistical inference ...
Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R...
Transcript of Statistical Network Analysis and Data Science · Statistical Analysis of Network Data with R...
Statistical Network Analysis and Data Science:Things we do well, and things we’re still working on
Eric D. Kolaczyk
Dept of Mathematics and Statistics, Boston University
BU Data Science Day, Jan 2016
Introduction
Networks and Data Science
Network-based perspectives are now ubiquitous in data science. GoogleScholar reports ∼ 436, 000 articles with ‘network’ in the title publishedsince 1999.
Contributions have been from across the data sciences:
Computer Science, Mathematics, StatisticsSignal processing, Statistical Physics, Information SciencesBioinformatics, Economics, Neuroscience, SociologyDigital Humanities
1
USE R !
ISBN 978-1-4939-0982-7
Use R ! Use R !
Eric KolaczykGábor Csárdi
Statistical Analysis of Network Data with R
Statistical Analysis of Netw
ork Data w
ith RKolaczyk · Csárdi
Eric Kolaczyk · Gábor Csárdi
Statistical Analysis of Network Data with R
Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).
Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).
Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.
Statistics
9 781493 909827
BU Data Science Day, Jan 2016
Introduction
Networks Analysis
Pdp
dCLK Cyc
Tim
VriPer
Network-based analysis traditionally a relatively small ‘field’ of study
Epidemic-like spread of interest in networks since mid/late-1990s
Arguably due to various factors, such as
Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.
BU Data Science Day, Jan 2016
Introduction
Where is Statistics Amid All of this Work?
Everywhere!
Statistical aspects of network analysis include problems involving
sampling and design
description and visualization
modeling and inference
prediction
for data both of and on networks.
Much of this work occurs in domain-specific areas;a nontrivial amount of it is general.
This is an actively-evolving field . . . much of network analysis we can dowell, while for a good deal of the rest we still have a ways to go!
BU Data Science Day, Jan 2016
Introduction
Statistical Challenges Working with Network Data
Broadly speaking, the primary statistical challenge(s) in most networkproblems comes from nontrivial interplay between
relational/dependent nature of the data;
lack of (traditional) geometry; and
‘big data’.
Question: Despite a fairly well-developed ability to deal successfully withmany classes of problems in many contexts and domain areas, how well dowe truly understand the implications of network data (e.g., as opposed toIID, temporal, or spatial data) on statistics at a foundational level?
My Answer: Somewhat . . . but there’s still a long way to go!
BU Data Science Day, Jan 2016
Three Vignettes
Illustrations
Through the use of three vignettes of current/ongoing work from myresearch group, I will illustrate
what I mean by ‘foundations’;
the novelty lent to foundational topics when intersected with complexnetworks; and
some of the resulting challenges and open problems.
Vignette Topics
1 Adjusting for Bias in Network Sampling
2 Propagation of Uncertainty To Network Summary Statistics
3 Inference for Large Samples of Network Data Objects
BU Data Science Day, Jan 2016
Three Vignettes
Estimation of Degree Distribution Under Network Sampling
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
Friendster cmty 1−5
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
Friendster cmty 6−15
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
Friendster cmty 16−30
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
Orkut cmty 1−5
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
Orkut cmty 6−15
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
Orkut cmty 16−30
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
LiveJournal cmty 1−5
0 5 10−15
−10
−5
log2(degree)
log2
(fre
q)
LiveJournal cmty 6−15
0 5 10−15
−10
−5
log2(degree)lo
g2(f
req)
LiveJournal cmty 16−30
Figure: Estimating degree distributions of communities from Friendster, Orkut and Livejournal. Blue dots represent the truedegree distributions, black dots represent the sample degree distributions, red dots represent the estimated degree distributions.
Sampling rate=30%. Dots which correspond to a density < 10−4 are eliminated from the plot.
BU Data Science Day, Jan 2016
Three Vignettes
Propagation of Uncertaintyilv
Y_b
3773
_at
hydG
_b40
04_a
tga
tR_1
_b20
87_a
tcy
nR_b
0338
_at
leuO
_b00
76_a
tm
etR
_b38
28_a
tar
sR_b
3501
_at
yiaJ
_b35
74_a
ttd
cR_b
3119
_at
soxR
_b40
63_a
trh
aS_b
3905
_at
rhaR
_b39
06_a
tal
pA_b
2624
_at
rtcR
_b34
22_a
ttd
cA_b
3118
_at
rbsR
_b37
53_a
txy
lR_b
3569
_at
emrR
_b26
84_a
tcp
xR_b
3912
_at
yhhG
_b34
81_a
tm
elR
_b41
18_a
thy
cA_b
2725
_at
rpiR
_b40
89_a
tgl
nG_b
3868
_at
hyfR
_b24
91_a
tfh
lA_b
2731
_at
iciA
_b29
16_a
tyg
iX_b
3025
_at
yhiX
_b35
16_a
tyh
iW_b
3515
_at
adiY
_b41
16_a
tyh
iE_b
3512
_at
srlR
_b27
07_a
tgu
tM_b
2706
_at
gntR
_b34
38_a
tm
alT
_b34
18_a
tcy
tR_b
3934
_at
crp_
b335
7_at
yjdG
_b41
24_a
tyj
bK_b
4046
_at
yijC
_b39
63_a
tly
sR_b
2839
_at
glnL
_b38
69_a
tux
uR_b
4324
_at
ygaE
_b26
64_a
tyj
fQ_b
4191
_at
cadC
_b41
33_a
tlld
R_b
3604
_at
sohA
_b31
29_a
tid
nR_b
4264
_at
treR
_b42
41_a
tfr
uR_b
0080
_at
lacI
_b03
45_a
ten
vY_b
0566
_at
fucR
_b28
05_a
tom
pR_b
3405
_at
fecI
_b42
93_a
tfis
_b32
61_a
thu
pB_b
0440
_at
hupA
_b40
00_a
tcs
pE_b
0623
_at
nadR
_b43
90_a
trp
oN_b
3202
_at
gcvR
_b24
79_a
tds
dC_b
2364
_at
atoC
_b22
20_a
tap
pY_b
0564
_at
rcsB
_b22
17_a
tna
gC_b
0676
_at
him
A_b
1712
_at
lrp_
b088
9_at
fnr_
b133
4_at
putA
_b10
14_a
tui
dA_b
1617
_at
b139
9_at
baeR
_b20
79_a
tga
tR_2
_b20
90_f
_at
fadR
_b11
87_a
tna
rL_b
1221
_at
pspF
_b13
03_a
tph
oP_b
1130
_at
fliA
_b19
22_a
tdn
aA_b
3702
_at
birA
_b39
73_a
tle
xA_b
4043
_at
yifA
_b37
62_a
tas
nC_b
3743
_at
caiF
_b00
34_a
tna
c_b1
988_
athn
s_b1
237_
atcs
pA_b
3556
_at
uhpA
_b36
69_a
tar
cA_b
4401
_at
nhaR
_b00
20_a
tso
xS_b
4062
_at
rpoH
_b34
61_a
tar
gR_b
3237
_at
fur_
b068
3_at
evgA
_b23
69_a
tro
b_b4
396_
atrp
oD_b
3067
_at
hcaR
_b25
37_a
trp
oE_b
2573
_at
ebgR
_b30
75_a
tar
aC_b
0064
_at
exuR
_b30
94_a
tga
lR_b
2837
_at
mtlR
_b36
01_a
tgl
pR_b
3423
_at
sdiA
_b19
16_a
tflh
D_b
1892
_at
flhC
_b18
91_a
tcs
gD_b
1040
_at
lrhA
_b22
89_a
txa
pR_b
2405
_at
gcvA
_b28
08_a
tyg
aA_b
2709
_at
rcsA
_b19
51_a
tic
lR_b
4018
_at
glcC
_b29
80_a
tbe
tI_b0
313_
atac
rR_b
0464
_at
yhdM
_b32
92_a
thi
pA_b
1507
_at
hipB
_b15
08_a
trp
oS_b
2741
_at
mod
E_b
0761
_at
tyrR
_b13
23_a
tm
alI_
b162
0_at
slyA
_b16
42_a
tcy
sB_b
1275
_at
ada_
b221
3_at
trpR
_b43
93_a
tde
oR_b
0840
_at
torR
_b09
95_a
tb2
531_
atm
arR
_b15
30_a
tm
arA
_b15
31_a
tga
lS_b
2151
_at
narP
_b21
93_a
tm
lc_b
1594
_at
purR
_b16
58_a
tpd
hR_b
0113
_at
him
D_b
0912
_at
farR
_b07
30_a
tyl
cA_b
0571
_at
oxyR
_b39
61_a
tm
etJ_
b393
8_at
cbl_
b198
7_at
phoB
_b03
99_a
tb1
499_
atm
hpR
_b03
46_a
tkd
pE_b
0694
_at
yebF__U_N0075zipA__U_N0075b2618_U_N0075bcp___U_N0075cpxR__U_N0075crcB__U_N0075crp___U_N0075cspF__U_N0075dam___U_N0075dnaA__U_N0075dnaN__U_N0075dnaT__U_N0075era___U_N0075fis___U_N0075fklB__U_N0075folA__U_N0075galF__U_N0075gcvR__U_N0075gyrA__U_N0075gyrI__U_N0075hlpA__U_N0075holD__U_N0075hscA__U_N0075ldrA__U_N0075mcrB__U_N0075mcrC__U_N0075menB__U_N0075menC__U_N0075minD__U_N0075minE__U_N0075murI__U_N0075yoeB__U_N0075nrdA__U_N0075nupC__U_N0075pyrC__U_N0075rimI__U_N0075rstB__U_N0075ruvC__U_N0075sbcB__U_N0075uspA__U_N0075
=⇒
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
YBR083W
YHR084W
YDR480W
YGR088W
YHR030C
YLR342W
YCL027W
YDR461WYFL026W
YBL016W
YJL157C
YMR043W
YGR032W
YER111C
YJL128C
YLR113W
YKL062W
YML004C
YNL145W
YKR095W
YCR089W
YBR040W
YMR198WYDR309C
YDR085C
YMR065W
YNL192W
YNR044W
YGL021WYDR146C YGL116W
YBR023C
YDR055W
YGR189C
YGR282C
YIL076W
YJL158C
YJL159W
YKL096WYKL163W
YKL164C
YMR238W
YCR012W
YNL160W
YBR072W
YDL204WYBR126C
YJL141C
YKL150W
YER054C
YOR173W
YGR248W
YJL161WYHR139C
YIR019C
YJR153W YHR084W
YMR043W
YLR113W
YMR037C
YKL062W
YPL089CYBR083W
=⇒ Density = 0.14(?)
Examples particularly prevalent in biology (e.g., gene regulatory networks,protein-protein interaction networks, and neural functional connectivitynetworks), but some noise likely present in most network applications.
BU Data Science Day, Jan 2016
Three Vignettes
‘Statistics 101’ for Collections of Networks
Current state of the art (aka ‘mass univariate’) uses edge-wise testing(presence/absence) and multiple correction.
(A) Mass-univariate analysis (Sex) (B) Mass-univariate analysis (Age)
Uncorrected Corrected Uncorrected Corrected
Comparison: Mass-Univariate vs Multivariate
Both methods detect differences in mean networks across gender andage, when using the full 1000 connectomes; but . . .
Only our multivariate method detects those differences at smallsample sizes (i.e., relevant to single labs).
BU Data Science Day, Jan 2016
Three Vignettes
Collaborators and Support
Supported in part by AFOSR, NIH, NSF, and ONR.
BU Data Science Day, Jan 2016