Advanced Statistical Modeling In Real Estate Appraisal John A. Kilpatrick, Ph.D., MAI
DAY 2: STATISTICAL METHODS FOR NETWORK ANALYSISstatnet.org/nme/d2-s2.pdf · STATISTICAL METHODS FOR...
Transcript of DAY 2: STATISTICAL METHODS FOR NETWORK ANALYSISstatnet.org/nme/d2-s2.pdf · STATISTICAL METHODS FOR...
DAY 2:
STATISTICAL METHODS FOR NETWORK ANALYSIS
Martina Morris, Ph.D.Steven M. Goodreau, Ph.D.Samuel M. Jenness, Ph.D.
Supported by the US National Institutes of Health
NME Workshop 1
Today we will cover
NME Workshop 2
Three classes of statistical models for networks
Simple null models
Generative models for static networks
Generative models for dynamic networks
Ending with an example How we can use these models to answer questions about
epidemic dynamics and interventions
morning
afternoon
Basic null hypothesis statistical tests Does your network differ from a simple random graph?
Simple null models to test against: CUG and BRG
ERGMs : generative models to test for multiple structural properties simultaneously in static networks Components of an ERG model
Interpretation of coefficients
Estimation algorithms (and when they fail)
Model Diagnostics (estimation, and goodness of fit)
Simulation from fitted models
Network data Types
Principles for estimation
Outline for the morning session
NME Workshop 3
This is what we’ll use ERGMs for in Epi modeling
Note
NME Workshop 4
We’ll cover a lot of ground here some of the material and vocabulary may be unfamiliar
Don’t worry if you don’t understand everything Focus on getting the big picture, not the details
EpiModel puts a lot of this behind the curtain So you don’t have to deal with it, for the most part
The details do matter when you have a problematic model though
And – don’t be afraid to ask questions
Getting started
NME Workshop 5
Recall, the two ways to access statnetWeb https://statnet.shinyapps.io/statnetWeb/ library(statnetWeb); run_sw();
Open statnetWeb and load the faux.mesa.high network
How do you know if your network is significantly different than a simple random graph?
Statistical Testing: Basics6
NME Workshop
Description vs. Inference in statistics
NME Workshop 7
So far we have been using descriptive statistics to explore our network data Density
Degree and geodesic distributions
Mixing matrices
Component size distributions
Next, we might want to compare these statistics to what we would expect by chance What do we mean “by chance”?
Is there a natural “null hypothesis test” in this context?
Recap
NME Workshop 8
Does the structure of our social network differ from a simple random graph?
What are some structural differences you can see?
faux.mesa.high network Simple random graph with the same tie probability
Consider triangles
Suppose kids have a tendency to become friends with their friends’ friends
And this is the only generative process occurring.
Presumably, this would mean that you would observe more triangles than expected by chance in the graph.
How would you test this for a specific network?
NME Workshop 9
A basic statistical test for triangles
Begin by counting the # triangles in your network
Say this is “T”, your test statistic
Then determine the probability of observing T or more triangles in this network …
And see if it is less than 5%
NME Workshop 10
But … how do you determine that probability?
For that you need a null hypothesis of some sort
What is the natural null hypothesis?
NME Workshop 11
It turns out there’s more than one …
But they all get used the same way when constructing a statistical test.
To create a sampling distribution consistent with the null
And compare your observed value to that distribution
Null probability distribution (1)
Unconditional: For a network this size (size = # nodes)
enumerate all possible networks for a fixed number of nodes,
count the number of triangles in each network, and
construct the frequency distribution of these counts.
Where does the number of triangles in your network lie in this distribution?
Top 5%?
Bottom 5%
Near the middle?
NME Workshop 12
Null probability distribution (1)
For example: Take a network with 3 nodes
How many dyads are there?
32=
3!
2! 3−2 != 3
How many different networks on these dyads? Every dyad has 2 possible values, and there are 3 dyads
So the number of possible networks is: 23 = 8
What is the distribution of triangle counts? 7 networks have 0 triangles
1 network has 1 triangle
So if your network has 1 triangle, what do you think?
NME Workshop 13
0
2
4
6
8
0 1
Triangle distribution for 3 node networks
Null probability distribution (1)
One problem with the unconditional null distribution
enumerate all possible networks for a fixed number of nodes
this is not so easy in practice
for 4 nodes: # of dyads is 4*3/2 = 6
# of possible networks = 26 = 64
for 10 nodes: # of dyads is 10*9/2 = 45
# of possible networks = 245 ≈ 35 trillion
for 20 nodes: # of dyads is 20*19/2 = 190
# of possible networks = 2190 ≈ 1057
We can solve this problem by sampling from the space of networks.
NME Workshop 14
Null probability distribution (1)
More important question for the unconditional null distribution
Do you really care about comparing your network to networks with zero ties?
Or all possible ties?
Or does it make more sense to compare your network to other networks with the same number of ties?
Controlling for density, does your network have more or less triangles than expected?
NME Workshop 15
Null probability distribution (2)
Condition on the density, the number of nodes and links
This is the Conditional Uniform Graph test (CUG)
enumerate all possible networks for a fixed number of nodes and links,
count the number of triangles in each network,
construct the frequency distribution of the counts
compare the value in your network
This also reduces the sample space
but it’s still a lot of graphs… 𝑛2𝑒= 𝑛2! /𝑒! ( 𝑛
2− 𝑒)!
so we will still need to sample from this space in practice
NME Workshop 16
The CUG is implemented as a permutation test
NME Workshop 17
Since full enumeration is typically not possible
We sample the enumeration space by permutation
Randomly choose a tied dyad, and a dyad without a tie
Permute the tie and the non-tie This preserves the exact density in the network
Count the number of triangles in the new network
Repeat until you have the desired sample size
Permutation tests are often used in statistics When the distribution of a sample statistic is not known
Null probability distribution (3)
Condition on the probability of a tie
This is the Bernoulli Random Graph test (BRG)
Similar to the CUG, but treats density as a random variable
Implemented via Markov Chain Monte-Carlo (MCMC)
Randomly choose a dyad
Flip a coin with probability(tie) = density of the network This will not preserve the exact density for each network, but will preserve it on average
Repeat many times, then count the number of triangles in the final network
Repeat until you have a sample of the desired size
NME Workshop 18
Null models in statnetWeb
NME Workshop
Select a summary measure for the observed data
Compare it to the distribution simulated from a null model
In statnetWeb:
We can plot null distribution overlays on degree and geodesic distributions
And plot the CUG and BRG distributions for selected network summary measures
19
In statnetWeb: Degree distribution
Compare the degree distribution in faux.mesa.high to what we would expect by chance
20NME Workshop
Network Descriptives
Degree Distribution
Select CUG and BRG
null models
What do you see now?
Overlays the mean and 95% confidence intervals from 100 simulations
Test for the number of isolates
Compare the number of isolates in faux.mesa.high to what we would expect by chance
21NME Workshop
Network Descriptives
MoreNull model
tests
How likely is the number of isolates we observed in our network, under the null model?
CUG test for triangles
NME Workshop 22
Are there more triangles in the observed network?
Choose the triangle term from the dropdown menu and run 100 simulations to see how our network compares to the two null models
“CUG” and “BRG”
Indeed…
NME Workshop 23
Yes the observed triangle count is high
NME Workshop 24
But why?
… a simple null hypothesis test doesn’t provide any insight about that.
Limitations of simple null hypotheses
If we are only interested in whether the triangle counts are different than expected given the density of the graph
One can use these simple null hypothesis tests
But if we want to understand the underlying generative process, quantify the impact of each process on our network, and control for other network features …
This requires a more general statistical modeling framework
NME Workshop 25
Can you control for more than just density?
What if you want to test more than one network feature?
And you want a model grounded in generative social theory?
… That’s when you turn to ERGMs
Statistical Testing: Beyond Basics26
NME Workshop
Motivation
NME Workshop 27
Why are there so many more triangles?
What do you see when color-coding the nodes by their attributes?
faux.mesa.high network Simple random graph with the same tie probability
(At least) two theories about the process that generates triangles:
1. Homophily: People tend to chose friends who are like them, in terms of
grade, race, etc. (“birds of a feather”), triad closure is a by-product
2. Transitivity: People who have friends in common tend to become friends
(“friend of a friend”), triad closure is the key process
So, for three actors in
the same grade
A cycle-closing tie may
form due to transitivityBut it may be due
instead to homophily
Friend of a friend, or birds of a feather?
NME Workshop 28
But not completely. Any tie may be classified by whether it is:
The cells represent how the processes jointly influence that tie, so the distribution of ties in this table is informative.
This suggests we should be able to disentangle the two processes statistically
Transitivity and homophily are confounded
NME Workshop 29
Within Grade:
Triangle forming:
Yes No
Yes Both Homophily
No Transitivity Neither
partially
ERGMs: Basic idea
NME Workshop 30
We want to model the probability of a tie as a function of:
Nodal attributes (that influence degree and mixing)
The propensity for certain “configurations” (like triangles)
The dyads may be dependent
Nodal attribute effects do not induce dyad dependence
But triad closure does
So we model the joint distribution directly
where: g(y) = vector of network statistics
= vector of model parameters
k( ) = numerator summed over all possible networks on node set y
Probability of observing a graph (set of relationships) y on a fixed set of nodes:
• Exponential family model• Well understood statistical properties (≠ well understood models)• Very general and flexible
Exponential Random Graph Model (ERGM)
NME Workshop 31
𝑃 𝑌 = 𝑦 ) =exp(𝜽′𝒈 𝒚 )
𝑘( )
Probability of observing a graph (set of relationships) y on a fixed set of nodes:
If you’re not familiar with this kind of compact vector notation, the numerator is just:
exp(𝜽1𝑥1 + 𝜽2𝑥2 +⋯+ 𝜽𝑝𝑥𝑝)
Kind of like a linear model, but a bit different (watch out for this later)
Exponential Random Graph Model (ERGM)
NME Workshop 32
𝑃 𝑌 = 𝑦 ) =exp(𝜽′𝒈 𝒚 )
𝑘( )
where 𝝏 𝒈 𝒚 represents the change in 𝒈 𝒚 when Yij is toggled between 0 and 1
The conditional odds of a tie
NME Workshop 33
can be re-expressed as𝑃 𝑌 = 𝑦 ) =exp(𝜽′𝒈 𝒚 )
𝑘( )
𝑙𝑜𝑔𝑖𝑡 𝑃 𝑌𝑖𝑗 = 1 rest of the graph ) = 𝑙𝑜𝑔𝑃 𝑌𝑖𝑗 = 1 rest of the graph )
𝑃 𝑌𝑖𝑗 = 0 rest of the graph )
= 𝜽′𝝏 𝒈 𝒚
The probability of the graph
The conditional log odds of a specific tie
This is an auto logistic regression (auto because of the possible dependence)
ERGM specification: 𝜽′𝒈 𝒚
NME Workshop 34
The 𝒈 𝒚 terms in the model are summary “network statistics”
Counts of network configurations, for example:1. Edges: 𝑦𝑖𝑗
2. Within-group ties: 𝑦𝑖𝑗𝐼(𝑖 ∈ 𝐶, 𝑗 ∈ 𝐶)
3. 2-stars: 𝑦𝑖𝑗𝑦𝑖𝑘
4. 3-cycles: 𝑦𝑖𝑗𝑦𝑖𝑘𝑦𝑗𝑘
A key distinction in the types of terms: Dyad independent (1 & 2 are examples)
Dyad dependent (3 & 4 are examples)
ERGM specification: 𝜽′𝒈 𝒚
NME Workshop 35
Model specification involves:
1. Choosing the set of network statistics 𝒈 𝒚
From minimal : # of edges
To saturated: one term for every dyad in the network
NB: statnetWeb allows you to choose from the list of terms and retrieve documentation for each one
2. Choosing “homogeneity constraints” on the parameter , for example, with edges:
all homogeneous
group specific (e.g., sex or age specific )
dyad specific
Let’s explore the Florentine marriage network
… to StatnetWeb36
NME Workshop
Flomarriage: Bernoulli Model
Load the flomarriage network
Network of marriage ties between families in Renaissance Florence
On the Fit Model page, look up the documentation on the edges term
37NME Workshop
Flomarriage: Bernoulli Model
Add edges to the ergm formula
Fit the model
What does this model imply? Homogeneous edge probability
Every tie is equally likely
Not a very interesting model
38NME Workshop
Step 1
Step 2
Interpreting the coefficients
The log-odds of any tie existing is:
Corresponding probability:
39NME Workshop
= −1.609 × change in # ties
= −1.609 × 1
=exp −1.609
1 + exp −1.609
= 0.1667
You can confirm that this is the density of the network
𝜃 = 𝑙𝑛𝑝
(1 − 𝑝)
𝑝 =𝑒𝜃
1 + 𝑒𝜃
Flomarriage: Triad Formation
NME Workshop 40
The “triangle” term is a measure of clustering
Read the documentation for the triangle term
Fit the model edges + triangle
Hint: you can just add the triangle term if edges is already in your formula
Then click Fit Model
Triangle is a dyad dependent term, so the estimation algorithm changes to MCMC (more on this later)
Flomarriage: Triad Formation
NME Workshop 41
Now how to interpret the coefficients?
Conditional log-odds of two actors having a tie:
(−1.68 × change in the # of ties) + (0.16 × change in # of triangles)
always=1 how many triangles can one tie change?
For a tie that will create zero triangles −1.68 + 0 = −1.68
One triangle −1.68 + (0.16 × 1) = −1.52
Two triangles −1.68 + (0.16 × 2) = −1.36
Note, not significant
Still unlikely, but a bit less so
Flomarriage: Nodal covariates
NME Workshop 42
What do you notice?
We can test whether edge probabilities are a function of wealth
This is a quantitative nodal attribute, so we use the ergm term “nodecov”
flomarriage sized by wealth
Flomarriage: Nodal covariates
NME Workshop 43
Reset the ergm formula and fit the following model:
There is a significant positive wealth effect on the odds of a tie
What does the positive coefficient mean?
Not that there is homophily by wealth
Just that wealthy nodes have more ties
Note that the wealth effect operates on both nodes in a dyad.
Flomarriage: Nodal covariates
NME Workshop 44
The conditional log-odds of a tie between two actors is:
For a tie between two nodes with minimum wealth (3)
For a tie between two nodes with maximum wealth (146)
For a tie between nodes with maximum and minimum wealth
Note: To specify homophily on wealth, you would use the ergm-term absdiff
−2.59 × change in # ties + 0.01 × wealth of node 1 + 0.01 × wealth of node 2
−2.59 + 0.01 × 3 + 3 = −2.53
−2.59 + 0.01 × 146 + 146 = 0.33
−2.59 + 0.01 × 146 + 3 = −1.1
Estimation (in one slide)
NME Workshop 45
There is no “closed form” or analytic solution for the estimated coefficients (as there is in OLS: 𝛽 = 𝑋′𝑋 −1(𝑋′𝑌))
Instead, we rely on a defining property of Maximum Likelihood Estimates (MLEs) for exponential family models
At the MLE of the coefficients:
expected values of the statistics under the model = the observed statistics
And we find these MLEs using an iterative search algorithm
A “Markov Chain Monte Carlo” (MCMC) algorithm Start with some initial θ values, simulate a sample of networks from those values
Compare the means of the simulated statistics to the observed values
Update the values of θ based on the deviations
Repeat until the (expected – observed) < epsilon
Estimation (ok, I needed 2 slides)
NME Workshop 46
What does it mean to “simulate networks from those values”?
Pick a dyad at random
Toss a coin to set the tie status
The probability of the tie is determined by the model
And the details of the MCMC sampling algorithm (Gibbs, Metropolis, Metropolis-Hastings)
Repeat (many many many times)
This produces a Markov Chain of networks
Sample from this chain, every 1000th element (say)
Calculate the mean of the model statistics from this sample
And compare the this mean to the observed network statistics
Computationally intensive estimation
NME Workshop 47
Has been key to statistical estimation of complex (i.e., realistic and interesting) models for dependent data
And to the emergence of the field of “data science”
In most cases, it works really well
And there is lots of mathematical theory proving it has good convergence properties
… but, it can run into trouble
especially if the model you’re trying to fit is not a good one for the observed network
Model Degeneracy
NME Workshop 48
Models with dyad dependent terms can behave differently than we expect
They look simple, almost like logistic regression
But they represent effects that cascade through a network via a chain of dependence (this is the “watch out” from earlier)
Homogeneous triangle and k-star terms turn out to be some of the worst offenders
Model Degeneracy
NME Workshop 49
Technical Definition:
When a model places almost all probability on a small number of uninteresting graphs
Most common “uninteresting” graphs: Complete (all links exist)
Empty
Model degeneracy = misspecification The model you specified would almost never produce the network you
observed
Model Degeneracy
NME Workshop 50
Switch to the faux.mesa.high network
Fit a model with: edges + triangle
What happens?
Trying to fit this model, the algorithm heads off into networks that are much more dense than the observed network.
What does this mean? That this model would not have produced this network, for any combination of parameter estimates for the two terms
i.e., this is a model misspecification problem
Degeneracy Plot (for the 2 star model)
NME Workshop 51
Only the white area has networks with some interesting variation
The dark areas are complete graphs, or empty graphs (+/-1 or 2 edges)
This model does not produce many useful networksFrom Mark Handcock’s 2003 tech report:
https://www.csss.washington.edu/Papers/2003/wp39.pdf
Solution: better network statistics
NME Workshop 52
Old statistic: t(x) = # of triangles in the graph
Here, every additional 3-cycle has the same impact,
New statistic: Set declining marginal returns for each additional 3-cycle involving the
same edge
The specific function we place on this shared partner distribution involves a geometric weighting
Hence the name: geometrically weighted edge-wise shared partners A.k.a. GWESP
The parameter that specifies the rate of decline in marginal returns is α
The smaller the α, the more rapid the decline
Solution: better network statistics
NME Workshop 53
spi = # of edges with i shared partners𝑔𝑤𝑒𝑠𝑝 = 𝑒α
𝑖=1
𝑛−2
1 − 1 − 𝑒−α 𝑖 𝑠𝑝𝑖
This configuration contains:• 1 edge with 3 shared partners • 6 edges with 1 shared partner
α GWESP(α)
0 𝑒0 1 − 1 − 𝑒−0 1 × 6 + 𝑒0 1 − 1 − 𝑒−0 3 × 1 = 7
0.5 𝑒0.5 1 − 1 − 𝑒−0.5 1 × 6 + 𝑒0.5 1 − 1 − 𝑒−0.5 3 × 1 = 7.55
1 𝑒1 1 − 1 − 𝑒−1 1 × 6 + 𝑒1 1 − 1 − 𝑒−1 3 × 1 = 8.03
The # of edges with 1+ shared partners
Solution: better network statistics
NME Workshop 54
spi = # of edges with i shared partners𝑔𝑤𝑒𝑠𝑝 = 𝑒α
𝑖=1
𝑛−2
1 − 1 − 𝑒−α 𝑖 𝑠𝑝𝑖
Count of edges in each triangle (i.e. # of triangles x 3)
Count of edges in at least one triangle (because only an edge’s first triangle counts)
Adding a gwesp term to the faux.mesa.high model
And conducting model assessments
… to StatnetWeb55
NME Workshop
Fitting and diagnosing a model
NME Workshop 56
Convergence is the first assessment
Dyad independent models will always converge
Dyad dependent models may or may not
Next step depends on the model:
Dyad independent Dyad dependent
Goodness of fit assessment: GOF plots
Convergence assessment: MCMC diagnostics
What are MCMC Diagnostics?
NME Workshop 57
MCMC Diagnostics tell us if the estimation algorithm is mixing well
These look pretty good
The traceplots on the left show random walks around the target value (a bit of correlation in the sampled networks, but not enough to cause concern)
The distribution of sampled statistics on the right is centered on the target values
These are taken from the penultimate MCMC chain, which is stored in the ergmoutput object
Goodness of Fit (GOF)
NME Workshop 58
Traditional GOF stats can be used AIC, BIC are included in the model summary
We also take another approach
We are interested in how well we fit aggregate properties of the network structure that we did not include as model terms
This helps to identify what the model gets wrong
We use 3 “higher order” statistics:
Degree distribution
Shared partner distribution (non-parametric) (local clustering)
Geodesic distance distribution (global clustering)
DATA
MODEL
ESTIMATEDCOEFFICIENTS
SIMULATED DATA(draws from the prob. dist.)
HIGHER ORDERGRAPH STATISTICS
OF DATA
HIGHER ORDERGRAPH STATISTICS
OF SIMULATED DATA
GOODNESS OF FIT OF MODEL
TO DATA
NME Workshop 59
We’ll show how to do this next
Take a break?60
NME Workshop
So let’s run and compare several models
NME Workshop 61
These will allow us to examine the evidence for homophily vs. transitivity
We’ll assess the convergence of the different models
As well as the goodness of fit
And the implications for the generative process of high school friendship patterns in this network
Fit the Bernoulli model to faux.mesa.high
62
Estimate, and run the default set of GOF terms for this model:
NME Workshop
faux.mesa.high ~ edges
Is this a dyad independent or dyad dependent model?
Dyad independent models are not fit with MCMC, so we don’t need to check MCMC diagnostics
We can move directly to GOF
Save the model
NME Workshop 63
This will keep the results so we can compare them later
Run the GOF for this model
64
Go to the Goodness of Fit tab
NME Workshop
Run the default GOF
This will take a moment because GOF is simulating 100 networks from the model, and calculating the default summary statistics for each one
Data:
Model: Bernoulli
(i.e. edges only)
Goodness of fit measure 1: degree distribution
NME Workshop 65
Boxplots show 100 simulations from the Bernoulli model
Black line shows the observed data from faux.mesa.high
Data:
Model: Bernoulli
(i.e. edges only)
This edge has an ESP value of 3
Goodness of fit measure 2: ESP distribution (local clustering)
NME Workshop 66
Data:
Model: Bernoulli
(i.e. edges only)
A/B have geodesic 2
A/C have geodesic ∞
AB
C
Goodness of fit measure 3: geodesic distribution(global clustering)
NME Workshop 67
faux.mesa.high ~ edges
degree edgewise shared partner geodesic
Goodness of fit measures assembled
NME Workshop 68
Summary: Not a good fit to any of the aggregate structural properties observed
Fit a model with gwesp
69
Estimate, save and assess this model:
NME Workshop
faux.mesa.high ~ edges + gwesp(0.25, fixed = TRUE)
Save this model too.
This is a dyad dependent model
It converges (unlike the triangle model)
It is fit with MCMC
So we should check the MCMC diagnostics
Run the MCMC diagnostics for this model
70
Go to the MCMC diagnostics tab
NME Workshop
Select Model 2
Looks pretty good
Run the Goodness of Fit
71
edge-wise shared partnersdegree minimum geodesic distance
NME Workshop
faux.mesa.high ~ edges + gwesp(0.25, fixed = TRUE)
Much better, though the ESP distribution fit isn’t great
And, a quick eyeball test…
NME Workshop 72
So, back to our original question:How much of the clustering is due to homophily, and how much to transitivity?
The global structure looks kinda similar now, ... But something is not right
Observed network Network simulated from model*
* We’ll get to simulation in just a bit
Test this by comparing four models
Model Network Statistics g(y)
Edges # of edges
Edges + GWESP
(transitivity)
# of edges
weighted shared partners
Edges + Attributes
(homophily)
# of edges
# of edges for each race, sex, grade
# of edges that are within-race, within-grade, within-sex
Edges + Attributes + GWESP
(both)
# of edges
# of edges for each race, sex, grade
# of edges that are within-race, within-grade, within-sex
weighted shared partners
NME Workshop 73
Fitting and saving models
NME Workshop 74
statnetWeb allows you to save up to 5 models – we’ll fit 4 (you can cut and paste from here to statnetWeb):
1. edgesFit model, save model, reset formula
2. edges + gwesp(0.25, fixed = T)Fit model, save model, reset formula
3. edges + nodefactor("Grade") + nodefactor("Race") + nodefactor("Sex") + nodematch("Grade", diff = T) + nodematch("Race", diff = F) + nodematch("Sex", diff = F) Fit model, save model, reset formula
4. edges + nodefactor("Grade") + nodefactor("Race") + nodefactor("Sex") + nodematch("Grade", diff = TRUE) + nodematch("Race", diff = FALSE) + nodematch("Sex", diff = FALSE) + gwesp(0.25, fixed = TRUE)Fit model, save model
You’ve already fit and saved these
Model Comparison
NME Workshop 75
Note how the gwespestimate changes from model 2 to 4
About 25% smaller
That’s the impact of controlling for attribute effects, including homophily
Homophilyestimates change also, once you control for transitivity
GOF comparison for all 4 models:
1. Edges
AIC: 2288
3. Edges + Attributes
AIC: 1809
2. Edges + GWESP
AIC: 1999
4. Edges + Attributes +
GWESP
AIC: 1648
76NME Workshop
This will take some time to run
Summary
NME Workshop 77
Both transitivity and homophily play a role in clustering these friendships• Homophily accounts for the distribution of path lengths (geodesics)
• Transitivity (Triadic closure)
• Accounts for the large number of isolates
• Captures the local clustering (ESP) reasonably well
• ~25% of the transitivity effect is a by-product of homophily
• The gwesp coefficient drops by ~25% when homophily is added to the model
The GOF suggests the ESP distribution is still not well fit• You could tinker some more, if this was a real research question
• But we’ll move on…
Simulating networks from the model
NME Workshop 78
A fitted model describes a probability distribution across all networks of this size
The model assigns a probability to every possible network
The model terms and the estimated coefficients make some networks more likely than others
You can simulate networks from this distribution
Using the same MCMC algorithm that was used for estimation
And the simulated networks will be centered on the network statistics in the original observed network
This, of course, is why these models are really useful for network epidemiology
Simulations
NME Workshop 79
Choose one of the models that you have saved and run 100 simulations with the default control settings
Choose the model on the Simulations page next to “ergm formula”
Do you see autocorrelation in the simulation statistics?
Increase the MCMC interval to 10,000 and re-run the simulations to see how this changes the autocorrelation
Some common statistics in ergms
NME Workshop 80
Term Formula Unit Value(s)
~edges # of edges edges 8
~nodefactor(“color”) Sum of degrees for nodes of each color nodes/edges* [8,] 6, 2
~nodefactor(“color”,
base=2)
Sum of degrees for nodes of each color nodes/edges* 8, [6,] 2
~nodematch(“color”) # of edges between nodes of same color edges 6
~nodematch(“color”, diff
= TRUE)
# of edges between nodes of same color, for each color
edges 3, 2, 1
undirected network of 10 nodes, including nodal attribute “color”, with values:
1=black,
2=red,
3=green
Some common statistics in ergms
NME Workshop 81
Term Formula Unit Value(s)
~nodemix(“color”, base=1) # of edges between nodes of each color combo edges [3,] 2, 2, 0, 0, 1
~degree(0) # of nodes of degree 0 nodes 2
~degree(2:5) # of nodes of degrees 2, 3, 4, 5 each nodes 1, 2, 1, 0
~concurrent # of nodes of at least degree 2 nodes 4
undirected network of 10 nodes, including nodal attribute “color”, with values:
1=black,
2=red,
3=green
Some common statistics in ergms
NME Workshop 82
Term Formula Unit Value(s)
~triangle # of triangles (beware!) triangles 2
~gwesp(0) # of edges in at least one triangle edges 5
~gwesp(∞) # of edges in triangles total (=3 * # triangles) triangles 6
undirected network of 10 nodes, including nodal attribute “color”, with values:
1=black,
2=red,
3=green
Next
NME Workshop 83
Collecting network data
There are different types
And the type you collect has implications for the statistical methods that you can use
Selected References
NME Workshop 84
Journal of Statistical Software (v42) 2008 – Eight papers on the statnet software; covering theory, algorithms and usage
Goodreau, S., et al. (2009). "Birds of a Feather, or Friend of a Friend? Using Statistical Network Analysis to Investigate Adolescent Social Networks." Demography 46(1): 103–125.
Handcock MS. (2003) Assessing Degeneracy in Statistical Models of Social Networks. CSSS working paper 39. https://www.csss.washington.edu/research/working-papers/39
Hunter DR. Curved Exponential Family Models for Social Networks. (2007) Social networks. 29(2):216-30. doi: 10.1016/j.socnet.2006.08.005. PubMed PMID: PMC2031865.
Hunter DR, Handcock MS. Inference in Curved Exponential Family Models for Networks. (2006) Journal of Computational and Graphical Statistics. 15(3):565-83. doi: 10.1198/106186006X133069.
Hunter DR, Goodreau SM, Handcock MS. Goodness of fit of social network models. (2008) J Am Stat Assoc. 103(481):248-58. doi: 10.1198/016214507000000446. PubMed PMID: WOS:000254311500029.