Dynamics of networks
-
Upload
olga-marsh -
Category
Documents
-
view
22 -
download
2
description
Transcript of Dynamics of networks
Dynamics of networks
Jure LeskovecMachine Learning DepartmentCarnegie Mellon University
2
Cyberspace
The emergence of ‘cyberspace’ and the World Wide Web is like the discovery of a new continent. Jim Gray, 1998 Turing Award address
3
Today: Rich data
Large on-line systems have detailed records of human activity On-line communities:
▪ Facebook (64 million users, billion dollar business)▪ MySpace (300 million users)
Communication:▪ Instant Messenger (~1 billion users)
News and Social media:▪ Blogging (250 million blogs world-wide, presidential candidates run
blogs) On-line worlds:
▪ World of Warcraft (internal economy 1 billion USD)▪ Second Life (GDP of 700 million USD in ‘07)
Opportunities for impact in science and industry
4
The networks
b) Internet (AS)c) Social networksa) World wide web
d) Communication e) Citationsf) Protein interactions
5
Networks: What do we know? We know lots about the network structure:
Properties: Scale free [Barabasi ’99], 6-degrees of separation [Milgram ’67], Navigation [Adamic-Adar ’03, LibenNowell ’05], Bipartite cores [Kumar et al. ’99], Network motifs [Milo et al. ‘02], Communities [Newman ‘99], Conductance [Mihail-Papadimitriou-Saberi ‘06], Hubs and authorities [Page et al. ’98, Kleinberg ‘99]
Models: Preferential attachment [Barabasi ’99], Small-world [Watts-Strogatz ‘98], Copying model [Kleinberg el al. ’99], Heuristically optimized tradeoffs [Fabrikant et al. ‘02], Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘00], Bowtie [Broder et al. ‘00], Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ‘01]
We know much less about processes and
dynamics of networks
6
My research: Network dynamics
Network Dynamics: Network evolution
▪ How network structure changes as the network grows and evolves?
Diffusion and cascading behavior▪ How do rumors and diseases spread over networks?
7
My research: Scale matters We need massive network data for the
patterns to emerge: MSN Messenger network [WWW ’08, Nature ‘08]
(the largest social network ever analyzed)▪ 240M people, 255B messages, 4.5 TB data
Product recommendations [EC ‘06]
▪ 4M people, 16M recommendations Blogosphere [work in progress]
▪ 60M posts, 120M links
8
My research: The structure
Diffusion & Cascades
Network Evolution
Patterns & Observatio
ns
Q1: What do cascades look like?
Q2: How likely are people to get influenced?
Q5: How does network structure change as the network grows/evolves?
Models & Algorithms
Q3: How do we find influential nodes?
Q4: How do we quickly detect epidemics?
Q6:How do we generate realistic looking and evolving networks?
9
My research: The structure
Diffusion & Cascades
Network Evolution
Patterns & Observatio
ns
Q1: What do cascades look like?
Q2: How likely are people to get influenced?
Q5: How does network structure change as the network grows/evolves?
Models & Algorithms
Q3: How do we find influential nodes?
Q4: How do we quickly detect epidemics?
Q6:How do we generate realistic looking and evolving networks?
10
Diffusion and Cascades
Behavior that cascades from node to node like an epidemic News, opinions, rumors Word-of-mouth in marketing Infectious diseases
As activations spread through the network they leave a trace – a cascade
Cascade (propagation graph)Network
11
Setting 1: Viral marketing
People send and receive product recommendations, purchase products
Data: Large online retailer: 4 million people, 16 million recommendations, 500k products
10% credit 10% off
[w/ Adamic-Huberman, EC ’06]
12
Setting 2: Blogosphere
Bloggers write posts and refer (link) to other posts and the information propagates
Data: 10.5 million posts, 16 million links
[w/ Glance-Hurst et al., SDM ’07]
13
Viral marketing cascades are more social: Collisions (no summarizers) Richer non-tree structures
Q1) What do cascades look like? Are they stars? Chains? Trees?
Information cascades (blogosphere):
Viral marketing (DVD recommendations):
(ordered by frequency)
pro
pagati
on
[w/ Kleinberg-Singh, PAKDD ’06]
14
Q2) Human adoption curves Prob. of adoption depends on the number of friends
who have adopted [Bass ‘69, Granovetter ’78] What is the shape?
Distinction has consequences for models and algorithms
k = number of friends adopting
Pro
b.
of
ad
op
tion
k = number of friends adopting
Pro
b.
of
ad
op
tion
Diminishing returns? Critical mass?
To find the answer we need lots of
data
15
Q2) Adoption curve: Validation
Later similar findings were made for group membership [Backstrom-Huttenlocher-Kleinberg ‘06], and probability of communication [Kossinets-Watts ’06]
Prob
abili
ty o
f pur
chas
ing
0 5 10 15 20 25 30 35 400
0.010.020.030.040.050.060.070.080.09
0.1DVD
recommendations(8.2 million
observations)
# recommendations received
Adoption curve follows the diminishing returns.Can we exploit this?
[w/ Adamic-Huberman, EC ’06]
16
My research: The structure
Diffusion & Cascades
Network Evolution
Patterns & Observatio
ns
A1: Cascade shapes
A2: Human adoption follows diminishing returns
Q5: How does network structure change as the network grows/evolves?
Models & Algorithms
Q3: How do we find influential nodes?
Q4: How do we quickly detect epidemics?
Q6:How do we generate realistic looking and evolving networks?
17
Cascade & outbreak detection
Blogs – information epidemics Which are the influential/infectious blogs?
Viral marketing [Domingos-Richardson]
Who are the trendsetters? Influential people?
Disease spreading Where to place monitoring stations to detect
epidemics?
18
The problem: Detecting cascades
How to quickly detect epidemics as they spread?
c1
c2
c3
[w/ Krause-Guestrin et al., KDD ’07](best student paper)
19
Two parts to the problem Cost:
Cost of monitoring is node dependent
Reward: Minimize the number of affected
nodes:▪ If A are the monitored nodes, let R(A)
denote the number of nodes we save
We also consider other rewards: ▪ Minimize time to detection▪ Maximize number of detected outbreaks
R(A)A
( )
[w/ Krause-Guestrin et al., KDD ’07](best student paper)
20
Reward for detecting cascade i
Given: Graph G(V,E), budget M Data on how cascades C1, …, Ci, …,CK spread over time
Select a set of nodes A maximizing the reward
subject to cost(A) ≤ M
Solving the problem exactly is NP-hard Max-cover [Khuller et al. ’99]
Optimization problem
21
Detection: Solution outline
Submodularity of the reward functions
CELF: algorithm with approximate guarantee
Lazy evaluation to speed-up CELF
We extend the result of [Kempe-Kleinberg-Tardos ’03]
[w/ Krause-Guestrin et al., KDD ’07](best student paper)
22
Background: Greedy hill-climbing
If objective function exhibits diminishing returns: Submodularity(think of it as “convexity”)
Theorem [Nemhauser et al. ‘78]If R is a function that is monotone and submodular, then k-step greedy produces set A for which R(A) is within (1-1/e) ~63% of optimal.
a
b
c
a
b
c
d
d
Marginal reward
e
GreedyAdd sensor with highest marginal
gain
Slow! O(kN)
23
Reward function is submodular
R(A) is monotone: monitoring more nodes never hurts We must show R is submodular: A B
Natural example: Sets A1, A2,…, An
R(A) = size of union of Ai
(size of covered area)
If R1,…,RK are submodular, then ∑pi Ri is submodular
AB
u
Gain of adding a node to a small set
Gain of adding a node to a large set
R(A {u}) – R(A) ≥ R(B {u}) – R(B)
24
Reward function is submodular Theorem:
Reward function is submodular
Consider cascade i: Ri(uk) = set of nodes saved from uk
Ri(A) = size of union Ri(uk), ukAÞ Ri is submodular
Global optimization: R(A) = Prob(i) Ri(A)
R is submodular
u1 Ri(u1)
Cascade i
u2
Ri(u2)
(best student paper)[w/ Krause-Guestrin et al., KDD ’07]
25
Solution: CELF Algorithm
We extend the setting to variable node cost case
We develop CELF (cost-effective lazy forward-selection) algorithm: Two independent runs of a modified greedy
▪ Solution set A’: ignore cost, greedily optimize reward▪ Solution set A’’: greedily optimize reward/cost ratio
Pick best of the two: arg max(R(A’), R(A’’))
Theorem: CELF is near optimal CELF achieves ½(1-1/e) factor approximation
For the size of our problems naïve CELF is too slow
(best student paper)[w/ Krause-Guestrin et al., KDD ’07]
26
Scaling up: Lazy evaluation Observation:
Submodularity guarantees that marginal rewards decrease with the solution size
Idea: Use marginal reward from previous step as a upper
bound on current marginal reward
d
Marginal reward
ab
c
d
e
27
CELF: Lazy evaluation
CELF algorithm: Keep an ordered list of
marginal rewards ri from previous step
Re-evaluate ri only for the top node
a
b
c
ab
cd
d
e
e
Marginal reward
28
CELF: Lazy evaluation
CELF algorithm: Keep an ordered list of
marginal rewards ri from previous step
Re-evaluate ri only for the top node
a
ab
c
d
d
b
c
e
e
Marginal reward
29
CELF: Lazy evaluation
CELF algorithm: Keep an ordered list of
marginal rewards ri from previous step
Re-evaluate ri only for the top node
a
c
ab
c
d
d
b
Marginal reward
e
e
30
Blogs: Information epidemics
For more info see our website: www.blogcascade.org
Which blogs should one read to catch big stories?
CELF
In-links
Random
# postsOut-links
Number of selected blogs (sensors)
Reward
(higher is
better)
(used by Technorati)
31
CELF: Scalability
CELF
GreedyExhaustive search
CELF runs 700x faster than simple greedy algorithm
Number of selected blogs (sensors)
Run time
(seconds)
(lower is better)
32
Same problem: Water Network Given:
a real city water distribution network
data on how contaminants spread over time
Place sensors (to save lives)
Problem posed by the US Environmental Protection Agency
SS
c1
c2
[w/ Krause et al., J. of Water Resource Planning]
33
Water network: Results
Our approach performed best at the Battle of Water Sensor Networks competition
CELF
PopulationRandom
Flow
Degree
Author Score
CMU (CELF) 26
Sandia 21
U Exter 20
Bentley systems 19
Technion (1) 14
Bordeaux 12
U Cyprus 11
U Guelph 7
U Michigan 4
Michigan Tech U 3
Malcolm 2
Proteo 2
Technion (2) 1
Number of placed sensors
Population
saved(higher is better)
[w/ Ostfeld et al., J. of Water Resource Planning]
34
My research: The structure
Diffusion & Cascades
Network Evolution
Patterns & Observatio
ns
A1: Cascade shapes
A2: Human adoption follows diminishing returns
Q5: How does network structure change as the network grows/evolves?
Models & Algorithms
A3, A4: CELF algorithm for detecting cascades and outbreaks
Q6: How do we generate realistic looking and evolving networks?
35
Background: Network models
Empirical findings on real graphs led to new network models
Such models make assumptions/predictions about other network properties
What about network evolution?
log degree
log pro
b. Model
Power-law degree distribution
Preferential attachment
Explains
36
Networks are denser over time Densification Power Law:
a … densification exponent (1 ≤ a ≤ 2)
Q5) Network evolution
What is the relation between the number of nodes and the edges over time?
Prior work assumes: constant average degree over time
Internet
Citations
a=1.2
a=1.6
N(t)
E(t
)
N(t)
E(t
)
[w/ Kleinberg-Faloutsos, KDD ’05](best research paper)
37
Q5) Network evolution
Prior models and intuition say that the network diameter slowly grows (like log N, log log N)
time
diam
eter
diam
eter
size of the graph
Internet
Citations
Diameter shrinks over time as the network grows the
distances between the nodes slowly decrease
[w/ Kleinberg-Faloutsos, KDD ’05](best research paper)
38
Q6) Need a new network model
None of the existing models generates graphs with these properties
We need a new model Idea: Generate graphs recursively
Recursion (fractals)
Skewed distributions (power-laws)
39
The model: Kronecker graphs
We prove Kronecker graphs mimic real graphs: Power-law degree distribution, Densification,
Shrinking/stabilizing diameter, Spectral properties
Initiator
(9x9)(3x3)
(27x27)
Kronecker product of graph adjacency matrices
(actually, there is also a nice social interpretation of the model)
[w/ Chakrabarti-Kleinberg-Faloutsos, PKDD ’05]
pij
Edge probability Edge probability
40
Q6) Generating realistic graphs Want to generate realistic networks:
Why synthetic graphs? Anomaly detection, Simulations, Predictions, Null-model,
Sharing privacy sensitive graphs, …
Q: Which network properties do we care about? A: Don’t commit, let’s match adjacency matrices
Compare graphs properties, e.g., degree
distribution
Given a real network
Generate a synthetic network
41
Maximum likelihood estimation
Kronecker graphs: Estimation
Naïve estimation takes O(N!N2): N! for different node labelings:
▪ Our solution: Metropolis sampling: N! (big) const N2 for traversing graph adjacency matrix
▪ Our solution: Kronecker product (E << N2): N2 E Do stochastic gradient descent
a bc d
P( | ) Kronecker
arg max
We estimate the model in O(E)
[w/ Faloutsos, ICML ’07]
42
Estimation: Epinions (N=76k, E=510k)
We search the space of ~101,000,000
permutations Fitting takes 2 hours Real and Kronecker are very close
Degree distribution
node degree
pro
babili
ty
0.99 0.54
0.49 0.13
Path lengths
number of hops
# r
each
able
pair
s
“Network” values
ranknetw
ork
valu
e
[w/ Faloutsos, ICML ’07]
43
My research: The structure
Diffusion &
Cascades
Network Evolution
Patterns &
Observations
A1: Cascade shapes
A2: Human adoption follows
diminishing returns
A5: Densification,
Shrinking diameters
Models & Algorithm
s
algorithm for
detecting cascades and
outbreaks
A6: Kronecker graphs,
Estimating Kronecker initiator
A3, A4: CELF
Two quick applicatio
ns
Conclusion andFuture work
44
Two applications
P1) Web Projections: Predicting search result quality without
page content
P2) MSN Instant Messenger: Social network of the whole world
45
P1) Web Projections
User types in a query to a search engine Search engine returns results:
Is this a good set of search results?
Result returned by the search engine
Hyperlinks
Non-search resultsconnecting the graph
[w/ Dumais-Horvitz, WWW ’07]
46
P1) Web Projections: Results We can predict search result quality with
80% accuracy just from the connection patterns between the results
Predict “Good”Predict “Poor”
Good?
Poor?
[w/ Dumais-Horvitz, WWW ’07]
47
P2) Planetary look on a small-world
Milgram’s small world experiment
(i.e., hops + 1)
Lots of engineering effort and good hardware: 4 dual-core Opteron
server 48GB memory 6.8TB of fast SCSI disks
How to learn short paths?
Small-world experiment [Milgram ‘67] People send letters from Nebraska to Boston
How many steps does it take? Messenger social network – largest network analyzed
240M people, 30B conversations, 4.5TB data
[w/ Horvitz, WWW ’08, Nature ‘08]
MSN Messenger network
Number of steps
between pairs of people
48
My research: The structure
Diffusion & Cascades
Network Evolution
Patterns & Observatio
ns
A1: Cascade shapes
A2: Human adoption follows
diminishing returns
A5: Densification,
Shrinking diameters
Models & Algorithms
A3, A4: CELF algorithm for detecting cascades and outbreaks
A6: Kronecker graphs,
Estimating Kronecker initiator
Big data
matters
49
Future direction 1: Diffusion
How do news and information spread New ranking and influence measures for blogs Sentiment analysis from cascade structure Crawling 2 million blogs/day (with Spinn3r)
Obscure technology story
Small
tech blog
WiredSlashd
otNew Scientis
t New York
TimesCNN
BBC
50
Future direction 1: Diffusion
Models of information diffusion When, where and what post will create a cascade? Where should one tap the network to get the
effect they want? Social Media Marketing
How to handle richer classes of graphs Interaction of network structure and
node/edge attributes
51
Future direction 2: Evolution
Why are networks the way they are? Continue work on fundamental properties of networks Build models and understanding Network community structure
Health of a social network How to better design and steer networks Adoption of social networking services (with Facebook)
Predictive modeling of large communities Online multi-player games are closed worlds with detailed
traces of human activity Predict success of guilds, battles, elections (with EVE online)
52
Future direction 3: Scaling up
Map-reduce for massive graphs Algorithms and techniques for massive graphs With Yahoo! Research:
4000 node Hadoop cluster ▪ world’s 50th fastest supercomputer▪ 3 TB memory▪ 1.5 PB storage
We already have new ways to peek into the structure of massive networks
53
Networks & Computer Science
Computer
systems
Theory and
algorithms
Machine learning /
Data mining
(complex) networks
54
Networks & Science
Computer
systems
Theory and
algorithms
Machine learning /
Data mining
Social Science
s
Biology
Physics
Applications &
Industry
Statistics
(complex) networks
55
Christos FaloutsosJon Kleinberg
Andreas KrauseCarlos Guestrin
Eric HorvitzSusan Dumais
Andrew TomkinsRavi Kumar
Lada AdamicBernardo Huberman
Kevin LangMichael Manohey
Zoubin GharamaniAnirban Dasgupta
Tom MitchellJohn Lafferty
Larry WassermanAvrim Blum
Samuel MaddenMichalis Faloutsos
Ajit SinghJohn Shawe-Taylor
Lars BackstromMarko Grobelnik
Natasa Milic-FraylingDunja Mladenic
Jeff HammerbacherMatt Hurst
Deepay ChakrabartiNatalie Glance
Jeanne VanBriesen
Microsoft ResearchYahoo Research
FacebookEve online
Spinn3rLinkedIn
Nielsen BuzzMetricsMicrosoft Live Labs
Thanks!
56
My research: The structure
Diffusion & Cascades
Network Evolution
Patterns & Observatio
ns
A1: Cascade shapes
A2: Human adoption follows
diminishing returns
A5: Densification,
Shrinking diameters
Models & Algorithms
A3, A4: CELF algorithm for detecting cascades and outbreaks
A6: Kronecker graphs,
Estimating Kronecker initiator
Big data matters