ICPSR - Complex Systems Models in the Social Sciences - Lecture 5(a) - Professor Daniel Martin Katz
-
Upload
daniel-martin-katz -
Category
Education
-
view
244 -
download
0
description
Transcript of ICPSR - Complex Systems Models in the Social Sciences - Lecture 5(a) - Professor Daniel Martin Katz
COMPLEX SYSTEMS MODELS IN THE SOCIAL SCIENCES
MICHAEL J BOMMARITO II DANIEL MARTIN KATZ
Structure and Community Detec1on in Networks
Defini1on – Simple Version
� Broadly: “a group of nodes that are rela&vely densely connected to each other but sparsely connected to other dense groups in the network” ¡ Porter, Onnela, Mucha. Communi&es in Networks. No1ces to the AMS, 2009.
� Examples: ¡ Cliques in a high school social network ¡ Vo1ng coali1ons in Congress ¡ Consumer types in a network of co-‐purchases
Michael J. Bommarito II, Daniel Mar1n Katz
Example – Social Networks
Imagine this Graph ….
Michael J. Bommarito II, Daniel Mar1n Katz
Example – Social Networks
What factors might affect the formaJon of friendships in a high school social network? Ideas: Age, Gender, Class, Race, Interests
How might we assign communiJes to this network?
VerJces: People Edges: Friendship
Michael J. Bommarito II, Daniel Mar1n Katz
Example – Social Networks
What factors might affect the formaJon of friendships in a high school social network? Ideas: Age, Gender, Class, Race, Interests
How might we assign communiJes to this network?
Girls
Boys
VerJces: People Edges: Friendship
Michael J. Bommarito II, Daniel Mar1n Katz
Example – Vo1ng Coali1ons
Michael J. Bommarito II, Daniel Mar1n Katz
VerJces: People Edges: Co-‐voted at least once
Now let’s look at the same network as if it represented co-‐voJng in the Senate. Ideas: Issue posi1on, geography, ethnicity, gender How might we assign communiJes to this network?
Example – Vo1ng Coali1ons
Republicans
Democrats
Independents
Michael J. Bommarito II, Daniel Mar1n Katz
VerJces: People Edges: Co-‐voted at least once
Now let’s look at the same network as if it represented co-‐voJng in the Senate. Ideas: Issue posi1on, geography, ethnicity, gender How might we assign communiJes to this network?
Context!
Note that we have assigned community membership differently despite observing the same graph! Community detecJon is not a concept that can be divorced from context.
Michael J. Bommarito II, Daniel Mar1n Katz
Directedness
Undirected Directed
Michael J. Bommarito II, Daniel Mar1n Katz
Directedness
Many methods do not incorporate direcJon! Many methods that do incorporate direcJon do not allow for bidirected edges. Different soVware packages may implement the same “method” with or without support for directed edges.
Michael J. Bommarito II, Daniel Mar1n Katz
Weights
Unweighted Weighted
• Binary rela1onships • Data limita1ons
• Rela1onship strength • Frequency of rela1onship • Flow
Michael J. Bommarito II, Daniel Mar1n Katz
Weights
Unweighted Weighted
• Binary rela1onships • Data limita1ons
• Rela1onship strength • Frequency of rela1onship • Flow
Note edge thickness.
Michael J. Bommarito II, Daniel Mar1n Katz
Weights
Many methods do not incorporate edge weights! Methods that do incorporate edge weights may differ in acceptable values! • Integers or real weights • Strictly posi1ve weights Different soVware packages may implement the same “method” with or without support for weighted edges.
Michael J. Bommarito II, Daniel Mar1n Katz
Resolu1on
Resolu1on is a concept inherited from op1cs. According to Wiki, Op,cal resolu,on describes the ability of an imaging system to resolve detail in the object that is being imaged.
High resoluJon) Low resoluJon
• Can make out many details! (15.1MP) • But…
• Details may be noise • Some1mes they don’t ma]er!
• Can’t read a word! • But…
• Can focus on broad regions • Noise is out of focus
Michael J. Bommarito II, Daniel Mar1n Katz
Resolu1on
High resoluJon (microscopic) Low resoluJon (macroscopic)
Same graphs!
Michael J. Bommarito II, Daniel Mar1n Katz
Resolu1on
Different hypotheses or quesJons correspond to different resoluJons. Different methods are more or less effecJve at detecJng community structure at different resoluJons. Modularity-‐based methods cannot detect structure below a known resoluJon limit.
Michael J. Bommarito II, Daniel Mar1n Katz
Overlapping Communi1es
Palla, Derenyi, Farkas ,Vicsek. Uncovering the overlapping community structure of complex networks in nature and society
Nature 435, 2005.
Michael J. Bommarito II, Daniel Mar1n Katz
Computa1onal Complexity Refresher
ComputaJonal complexity is a serious issue!
Data is becoming more abundant and more detailed. Many quan1ta1ve research projects hinge on the feasibility of calcula1ons. Understanding computa1onal complexity can allow you to communicate with department IT personnel or computer scien1sts to solve your problem. Make sure your project is feasible before commi[ng the Jme!
Michael J. Bommarito II, Daniel Mar1n Katz
Computa1onal Complexity Refresher
Computa1onal complexity in the context of modern compu1ng is primarily focused on two resources: 1. Time: How long does it take to perform a sequence of opera1ons?
• CPU/GPU • Exact vs. approximate solu1ons
2. Storage: How much space does it take to store our problem? • Memory and “persistent” storage (to a lesser degree) • Data representa1ons
We tend to communicate 1me and storage complexity through “Big-‐O nota1on.”
Michael J. Bommarito II, Daniel Mar1n Katz
Computa1onal Complexity Refresher
In computa1onal complexity, “Big-‐O nota1on” conveys informa1on about how 1me and storage costs scale with inputs. • O(1): constant -‐ independent of input • O(n): scales linearly with the size of input • O(n^2): scales quadra1cally with the size of input • O(n^3): scales cubically with the size of input
These terms ofen occur with log n terms and are then given the prefix “quasi-‐.”
For graph algorithms, the input n is typically • |V|, the number of ver1ces • |E|, the number of edges
Michael J. Bommarito II, Daniel Mar1n Katz
Taxonomy of Methods
This taxonomy of methods follows the history of their development. • Divisive Methods
• Edge-‐betweenness (2002)
• Modularity Methods • Fast-‐greedy (2004) • Leading Eigenvector (2006)
• Dynamic Methods • Clique percola1on (2005) • Walktrap (2005)
More on my blog here: Summary of community detec1on algorithms in igrap • h]p://bommaritollc.com/2012/06/17/summary-‐community-‐detec1on-‐algorithms-‐
igraph-‐0-‐6/
Michael J. Bommarito II, Daniel Mar1n Katz
Edge Betweenness
PublicaJon(s): Girvan, Newman. Community structure in social and biological networks. PNAS, 2002. Basic Idea: Divide the network into subsequently smaller pieces by finding edges that “bridge” communi1es. Constraints: • Can be adapted to directed networks (igraph). • Can be adapted to weights (no public sofware). Time Complexity: O(|V|^3) in general, O(|V|^2 log |V|) for special cases
Michael J. Bommarito II, Daniel Mar1n Katz
Edge Betweenness
From the paper:
Michael J. Bommarito II, Daniel Mar1n Katz
Quick Aside – Zach’s Karate Club
Zachary's Karate Club: Social network of friendships between 34 members of a karate club at a US university in the 1970s
Event: During the observa1on period, the club broke into 2 smaller clubs. This split occurred along a pre-‐exis1ng social division between the two “communi1es” in the network.
Drawn from the Paper: Zachary. An informa&on flow model for conflict and fission in
small groups. Journal of Anthropological Research 33, 1977.
Download the Data: h]p://www-‐personal.umich.edu/~mejn/netdata/
Michael J. Bommarito II, Daniel Mar1n Katz
Edge Betweenness
Only misclassifica1on
Michael J. Bommarito II, Daniel Mar1n Katz
Edge Betweenness
Betweenness tends to get the big picture right. However, resolu1on can be a problem! Do not draw conclusions about small communi1es from this algorithm alone.
Michael J. Bommarito II, Daniel Mar1n Katz
Modularity
• e is the number of edges in module i • d is total degree of ver1ces in module i • m is the total number of edges in network Q is difference between observed connecJvity within modules and EV for the configuraJon model (degree-‐distribuJon fixed)
Michael J. Bommarito II, Daniel Mar1n Katz
Modularity
Remember our previous discussion on computa1onal complexity?
Modularity maximiza1on is an NP-‐hard problem.
This means that there is no polynomial representa1on of 1me complexity!
All methods therefore try to solve for approximate solu&ons.
Michael J. Bommarito II, Daniel Mar1n Katz
Modularity
Michael J. Bommarito II, Daniel Mar1n Katz
Benjamin H. Good, Yves-‐Alexandre de Montjoye & Aaron Clauset, The Performance of Modularity Maximiza1on in Prac1cal Contexts, Phys. Rev. E 81, 046106 (2010)
Fast Greedy
PublicaJon(s): • Newman. Fast algorithm for detec&ng community structure in networks. Phys. Rev. E, 2004. • Clauset, Newman, Moore. Finding community structure in very large networks. Phys. Rev. E, 2004. • Wakita, Tsurumi. Finding Community Structure in Mega-‐scale Social Networks. 2007. Basic Idea: Try to randomly assemble a larger and larger communi1es from the ground up. Start by placing each vertex in its own community and then combine communi1es that produce the best modularity at that step. Constraints: • Can be adapted to directed edges (no public). • Can be adapted to weights (igraph). Time Complexity: O(|E||V| log |V|) worst case
Michael J. Bommarito II, Daniel Mar1n Katz
Fast Greedy
Fast-‐Greedy also tends to aggressively create larger communi1es to the detriment of smaller communi1es.
Why is this node red instead of blue?
Michael J. Bommarito II, Daniel Mar1n Katz
Leading Eigenvector
PublicaJon(s): • Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 2006. • Leicht, Newman. Community structure in directed networks. Phys. Rev. Le]., 2008. Basic Idea: Use the sign on the components of the leading eigenvector of the Laplacian to sequen1ally divide the network. Constraints: • Can be adapted to directed edges (no public). • Can be adapted to weights (igraph). Time Complexity: O(|V|^2)
Michael J. Bommarito II, Daniel Mar1n Katz
Leading Eigenvector
Note that eigenvector’s results seem to split the difference between edge betweenness and fast-‐greedy in this case.
Why are these nodes not a part of the larger modules?
Michael J. Bommarito II, Daniel Mar1n Katz
Walktrap
PublicaJon(s): Pons, Latapy. Compu&ng communi&es in large networks using random walks. JGAA, 2006. Basic Idea: Simulate many short random walks on the network and compute pairwise similarity measures based on these walks. Use these similarity values to aggregate ver1ces into communi1es. Constraints: • Can be adapted to directed edges (igraph). • Can be adapted to weights (igraph). • Can alter resolu1on by walk length (igraph). Time Complexity: depends on walk length, O(|V|^2 log |V|) typically
Michael J. Bommarito II, Daniel Mar1n Katz
Walktrap
Michael J. Bommarito II, Daniel Mar1n Katz
Walktrap
Walktrap assigns ver1ces to different communi1es than previous algorithms. Note that the simulated walk length can be changed to alter resolu1on. Furthermore, simulaJon is stochasJc and thus results may change even aVer fixing the walk length and input graph!
Michael J. Bommarito II, Daniel Mar1n Katz
Method Comparison
Edge-‐Betweenness Fast-‐Greedy
Leading Eigenvector Walktrap
Michael J. Bommarito II, Daniel Mar1n Katz
Recommended Sofware -‐ igraph
• Core Library: C • Interfaces: Python, R, Ruby • Features: Graph opera1ons & algorithms, random graph genera1on, graph sta1s1cs, community detec1on, visualiza1on layout, ploqng • URL: h]p://igraph.sourceforge.net/ • Documenta1on: h]p://igraph.sourceforge.net/documenta1on.html
Michael J. Bommarito II, Daniel Mar1n Katz
Example Python Source Code
Michael J. Bommarito II, Daniel Mar1n Katz
Fron1ers of Community Detec1on: Temporal Network Dynamics
Michael J. Bommarito II, Daniel Mar1n Katz
Gergely Palla, Albert-Laszlo Barabasi & Tamas Vicsek, Quantifying Social Group Evolution, Nature 446:7136, 664-667 (2007)
Fron1ers of Community Detec1on:
Community Structure Over Scales, Time Period, etc.
Michael J. Bommarito II, Daniel Mar1n Katz
Science 14 May 2010, Vol. 328. no. 5980, pp. 876 - 878
Community Detec1on Review Ar1cles
Some Useful Review ArJcles: Mason A. Porter, Jukka-Pekka Onnela and Peter J. Mucha. 2009. “Communities in Networks.” Notices of the American Mathematical Society 56: 1082-1166. Santo Forunato. 2010. “Community detection in graphs.” Physics Reports. 486: 75-174.
Michael J. Bommarito II, Daniel Mar1n Katz
A Transi1on to Our Sink Method Paper
� Provide a very brief introduc1on to the Exponen1al Random Graph Models (p*)
Michael J. Bommarito II, Daniel Mar1n Katz
� Now we are going to transi1on to a specific project -‐-‐-‐ where we apply some of the ideas contained herein
Our Sink Paper –Physica A
Michael J. Bommarito II, Daniel Mar1n Katz
Dynamic Acyclic Digraphs
Michael J. Bommarito II, Daniel Mar1n Katz
� We are interested in conduc1ng community detec1on in the special case of dynamic acyclic digraphs.
� Before we transi1on to the full presenta1on, some background: � Dynamic = Changing both locally and globally � Digraph = Directed graph � Acyclic = No cycles because current documents generally cannot cite documents in the future
Dynamic Acyclic Digraphs
Michael J. Bommarito II, Daniel Mar1n Katz
Case-‐to-‐case judicial cita1on networks are dynamic acyclic digraphs.
So are academic cita1on networks, patents cita1on networks, etc.
Dynamic Acyclic Digraphs
Michael J. Bommarito II, Daniel Mar1n Katz
QuesJon: What does modularity mean when there can be no closed paths/walks?
Answer:
Read the paper!
Takeaway: Correct methodologies are ones that make sense in the context of your data.
They don’t always exist already!