Post on 27-Dec-2015
1
Tools for Large Graph Mining
Thesis Committee:
Christos Faloutsos
Chris Olston
Guy Blelloch
Jon Kleinberg (Cornell)
- Deepayan Chakrabarti
2
Introduction
Internet Map [lumeta.com]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com]
Friendship Network [Moody ’01]
► Graphs are ubiquitous
3
Introduction
What can we do with graphs? How quickly will a disease
spread on this graph?
“Needle exchange” networks of drug users
[Weeks et al. 2002]
4
Introduction
What can we do with graphs? How quickly will a disease
spread on this graph? Who are the “strange
bedfellows”? Who are the key people? …
► Graph analysis can have great impact
Hijacker network [Krebs ‘01]
“Key” terrorist
5
Graph Mining: Two Paths
Specific applications
• Node grouping
• Viral propagation
• Frequent pattern mining
• Fast message routing
General issues
• Realistic graph generation
• Graph patterns and “laws”
• Graph evolution over time?
6
Our Work
General issues
• Realistic graph generation
• Graph patterns and “laws”
• Graph evolution over time?
Specific applications
• Node grouping
• Viral propagation
• Frequent pattern mining
• Fast message routing
7
Our Work
General issues
• Realistic graph generation
• Graph patterns and “laws”
• Graph evolution over time?
Specific applications
• Node grouping
• Viral propagation
• Frequent pattern mining
• Fast message routing
Node Grouping Find “natural” partitions and outliers
automatically. Viral Propagation
Will a virus spread and become an epidemic?
Graph Generation How can we mimic a given real-world
graph?
8
Roadmap
Specific applications
• Node grouping
• Viral propagation
General issues
• Realistic graph generation
• Graph patterns and “laws”31
2
4 Conclusions
Find “natural” partitions and outliers automatically
Focus of this talk
9
Node Grouping [KDD 04]
Products
Cus
tom
ers
Cus
tom
er G
roup
s
Product Groups
Simultaneously group customers and products, or, documents and words, or, users and preferences …
Customers
Products
10
Node Grouping [KDD 04]
Cus
tom
er G
roup
s
Product Groups
Row and column groups
• need not be along a diagonal, and
• need not be equal in number
Cus
tom
er G
roup
s
Product Groups
Both are fine
11
Motivation
Visualization
Summarization
Detection of outlier nodes and edges
Compression, and others…
12
Node Grouping
Desiderata:
1. Simultaneously discover row and column groups
2. Fully Automatic: No “magic numbers”
3. Scalable to large matrices
4. Online: New data should not require full recomputations
13
Closely Related Work
Information Theoretic Co-clustering [Dhillon+/2003] Number of row and column groups must be
specified
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large graphs
Online
14
Other Related Work
K-means and variants: [Pelleg+/2000, Hamerly+/2003]
“Frequent itemsets”: [Agrawal+/1994]
Information Retrieval:[Deerwester+1990, Hoffman/1999]
Graph Partitioning:[Karypis+/1998]
Do not cluster rows and cols simultaneously
User must specify “support”
Choosing the number of “concepts”
Number of partitions
Measure of imbalance between clusters
15
What makes a cross-association “good”?
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups
Good Clustering
1. Similar nodes are grouped together
2. As few groups as necessary
A few, homogeneous
blocks
Good Compression
Why is this better?
implies
16
Main Idea
Good Compression
Good Clusteringimplies
Column groups
Row
gro
ups
density pi1 = % of dots
size * H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi
Binary Matrix
+ Σi
17
Examples
One row group, one column group
high low
m row group, n column group
highlow
Total Encoding Cost = size * H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi + Σi
18
What makes a cross-association “good”?
Why is this better?
low low
Total Encoding Cost = size * H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi + Σi
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups
19
Formal problem statement
Given a binary matrix,
Re-organize the rows and columns into groups, and
Choose the number of row and column groups, to
Minimize the total encoding cost.
20
Formal problem statement
Given a binary matrix,
Re-organize the rows and columns into groups, and
Choose the number of row and column groups, to
Minimize the total encoding cost.
Note: No Parameters
21
Algorithmsk =
5 row groups
k=1, l=2
k=2, l=2
k=2, l=3
k=3, l=3
k=3, l=4
k=4, l=4
k=4, l=5
l = 5 col groups
22
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-association
Lower the encoding cost
23
Fixed k and ll = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-association
Lower the encoding cost
24
Fixed k and lRe-assign: for each row x
re-assign it to the row group which minimizes the code cost
Column groups
Row
gro
ups 1.Row re-assigns
2.Column re-assigns
3. and repeat …
Column groups
Row
gro
ups
25
Choosing k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-association
Lower the encoding cost
Find good groups for fixed k and l
26
Choosing k and l
Split:1. Find the most “inhomogeneous” group.
2. Remove the rows/columns which make it inhomogeneous.
3. Create a new group for these rows/columns.
Column groups
Row
gro
ups
Row
gro
ups
Column groups
27
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-association
Lower the encoding cost
Re-assigns
Splits
28
Experiments
l = 5 col groups
k = 5 row
groups
“Customer-Product” graph with Zipfian sizes, no noise
29
Experiments
“Quasi block-diagonal” graph with Zipfian sizes, noise=10%
l = 8 col groups
k = 6 row
groups
30
Experiments
“White Noise” graph: we find the existing spurious patterns
l = 3 col groups
k = 2 row
groups
31
Experiments“CLASSIC”
• 3,893 documents
• 4,303 words
• 176,347 “dots”
Combination of 3 sources:
• MEDLINE (medical)
• CISI (info. retrieval)
• CRANFIELD (aerodynamics)
Doc
umen
ts
Words
32
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
Doc
umen
ts
Words
33
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
insipidus, alveolar, aortic, death, …
blood, disease, clinical, cell, …
34
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
CISI(Information Retrieval)
providing, studying, records, development, …
abstract, notation, works, construct, …
35
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
CRANFIELD (aerodynamics)
shape, nasa, leading, assumed, …
CISI(Information Retrieval)
36
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE(medical)
CRANFIELD (aerodynamics)
paint, examination, fall, raise, leave, based, …
CISI(Information Retrieval)
37
ExperimentsN
SF
Gra
nt P
ropo
sals
Words in abstract
“GRANTS”
• 13,297 documents
• 5,298 words
• 805,063 “dots”
38
Experiments
“GRANTS” graph of documents & words: k=41, l=28
NS
F G
rant
Pro
posa
ls
Words in abstract
39
Experiments
“GRANTS” graph of documents & words: k=41, l=28
The Cross-Associations refer to topics:
• Genetics
encoding, characters, bind, nucleus
40
Experiments
“GRANTS” graph of documents & words: k=41, l=28
The Cross-Associations refer to topics:
• Genetics
• Physics
coupling, deposition, plasma, beam
41
Experiments
“GRANTS” graph of documents & words: k=41, l=28
The Cross-Associations refer to topics:
• Genetics
• Physics
• Mathematics
• …
manifolds, operators, harmonic
42
Experiments
Number of “dots”
Tim
e (
secs
)
Splits
Re-assigns
Linear on the number of “dots”: Scalable
43
Summary of Node Grouping
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large matrices
Online: New data does not need full recomputation
44
Extensions
We can use the same MDL-based framework for other problems:
1. Self-graphs
2. Detection of outlier edges
45
Extension #1 [PKDD 04]
Self-graphs, such as Co-authorship graphs Social networks The Internet, and the World-wide Web
Customers
Products
Authors
Bipartite graph Self-graph
46
Extension #1 [PKDD 04]
Self-graphs Rows and columns represent the same nodes so row re-assigns affect column re-assigns…
Bipartite graph Self-graph
Authors
Customers
Products
47
Experiments
Authors
Aut
hors
DBLP dataset
• 6,090 authors in:• SIGMOD
• ICDE
• VLDB
• PODS
• ICDT
• 175,494 co-citation or co-authorship links
48
Experiments
Authors
Aut
hors
Aut
hor
grou
ps
Author groups
k=8 author groups found
Stonebraker, DeWitt, Carey
49
Extension #2 [PKDD 04]
Outlier edges Which links should not exist?
(illegal contact/access?) Which links are missing?
(missing data?)
50
Extension #2 [PKDD 04]
Nodes
No
des
Outliers Deviations from “normality”
Lower quality compression
Find edges whose removal maximally reduces cost
No
de
Gro
up
s
Node Groups
Outlier edges
51
Roadmap
Specific applications
• Node grouping
• Viral propagation
General issues
• Realistic graph generation
• Graph patterns and “laws”31
2
4 Conclusions
Will a virus spread and become an epidemic?
52
The SIS (or “flu”) model
(Virus) birth rate β : probability than an infected neighbor attacks
(Virus) death rate δ : probability that an infected node heals
Cured = Susceptible
Infected
Healthy
NN1
N3
N2Prob. β
Prob. β
Prob. δ
Undirected network
53
The SIS (or “flu”) model
Competition between virus birth and death Epidemic or extinction?
depends on the ratio β/δ but also on the network topology
Epidemicor
Extinction
Example of the effect of network topology
54
Epidemic threshold
The epidemic threshold τ is the value such that If β/δ < τ there is no epidemic where β = birth rate, and δ = death
rate
55
Previous models
Question: What is the epidemic threshold?
Answer #1: 1/<k>[Kephart and White ’91, ’93]
Answer #2: <k>/<k2>[Pastor-Satorras and Vespignani ’01]
Homogeneity assumption: All nodes have the same degree(but most graphs have power laws)
Mean-field assumption: All nodes of the same degree are equally affected(but susceptibility should depend on position in network too)
BUT
BUT
56
The full solution is intractable! The full Markov Chain
has 2N states intractable so, a simplification is needed.
Independence assumption: Probability that two neighbors are infected =
Product of individual probabilities of infection This is a point estimate of the full Markov Chain.
57
Our model
A non-linear dynamical system (NLDS) which makes no assumptions about the topology
1-pi,t = [1-pi,t-1 + δpi,t-1] . ∏ (1-β.Aji.pj,t-1)j=1
N
Probability of being infected
Adjacency matrix
Healthy at time t
Healthy at time t-1
Infected but cured
No infection received from another node
58
Epidemic threshold [Theorem 1] We have no epidemic if:
β/δ < τ = 1/ λ1,A
(Virus) Birth rate
(Virus) Death rate
Epidemic threshold
largest eigenvalueof adj. matrix A
► λ1,A alone decides viral epidemics!
59
Recall the definition of eigenvalues
A X X= λA
eigenvalue
λ1,A = largest eigenvalue
≈ size of the largest “blob”
60
Experiments (100-node Star)
β/δ = τ (close to the threshold)
β/δ < τ (below threshold)
β/δ > τ (above threshold)
……
……
61
Experiments (Oregon)
β/δ > τ (above threshold)
β/δ = τ (at the threshold)
β/δ < τ (below threshold)
10,900 nodes and 31,180 edges
62
Extensions
This dynamical-systems framework can exploited further
1. The rate of decay of the infection
2. Information survival thresholds in sensor/P2P networks
63
Extension #1
Below the threshold:How quickly does an infection die out?
[Theorem 2] Exponentially quickly
64
Experiment (10K Star Graph)
“Score” s = β/δ * λ1,A = “fraction” of threshold
Nu
mb
er
of in
fect
ed
nod
es
(lo
g-s
cale
)
Time-steps (linear-scale)
Linear on log-lin scale exponential decay
65
Experiment (Oregon Graph)
“Score” s = β/δ * λ1,A = “fraction” of threshold
Nu
mb
er
of in
fect
ed
nod
es
(lo
g-s
cale
)
Time-steps (linear-scale)
Linear on log-lin scale exponential decay
66
Extension #2
• Sensors gain new information
Information survival insensor networks[+ Leskovec, Faloutsos, Guestrin, Madden]
67
Extension #2
• Sensors gain new information
• but they may die due to harsh environment or battery failure
• so they occasionally try to transmit data to nearby sensors
• and failed sensors are occasionally replaced.
Information survival insensor networks[+ Leskovec, Faloutsos, Guestrin, Madden]
68
Extension #2
• Sensors gain new information
• but they may die due to harsh environment or battery failure
• so they occasionally try to transmit data to nearby sensors
• and failed sensors are occasionally replaced.
• Under what conditions does the information survive?
Information survival insensor networks[+ Leskovec, Faloutsos, Guestrin, Madden]
69
Extension #2
[Theorem 1] The information dies out exponentially quickly if
Retransmission rate
Resurrection rate
Failure rate of sensors
Largest eigenvalue of the “link quality” matrix
70
Roadmap
Specific applications
• Node grouping
• Viral propagation
General issues
• Realistic graph generation
• Graph patterns and “laws”3
4 Conclusions
How can we generate a “realistic” graph, that mimics
a given real-world?
1
2
Skip
71
Experiments (Clickstream bipartite graph)
In-degree
Users
Websites
Some personal webpage
Yahoo, Google and others
ClickstreamR-MAT
+ x
Cou
nt
72
Experiments (Clickstream bipartite graph)
Users
Websites
Email-checking surfers
“All-night” surfers
Out-degree
Cou
nt
ClickstreamR-MAT
+ x
73
Experiments (Clickstream bipartite graph)
Count vs Out-degree Count vs In-degree Hop-plot
Left “Network value” Right “Network value”
►R-MAT can match real-world graphs
Singular value vs Rank
74
Roadmap
Specific applications
• Node grouping
• Viral propagation
General issues
• Realistic graph generation
• Graph patterns and “laws”3
4 Conclusions
1
2
75
Conclusions
Two paths in graph mining: Specific applications:
Viral Propagation non-linear dynamical system, epidemic depends on largest eigenvalue
Node Grouping MDL-based approach for automatic grouping
General issues: Graph Patterns Marks of “realism” in a graph Graph Generators R-MAT, a scalable generator
matching many of the patterns
76
Software
http://www-2.cs.cmu.edu/~deepay/#Sw CrossAssociations
To find natural node groups. Used by “anonymous” large accounting firm. Used by Intel Research, Cambridge, UK. Used at UC, Riverside (net intrusion detection). Used at the University of Porto, Portugal
NetMine To extract graph patterns quickly + build realistic graphs. Used by Northrop Grumman corp.
F4 A non-linear time series forecasting package.
77
===CROSS-ASSOCIATIONS=== Why simultaneous groupin
g? Differences from co-cluster
ing and others? Other parameter-fitting crit
eria? Cost surface Exact cost function Exact complexity, wall-
clock times Soft clustering Different weights for code
and description costs?
Precision-recall for CLASSIC
Inter-group “affinities” Collaborative filtering and r
ecommendation systems? CA versus bipartite cores Extras General comments on CA
communities
78
===Viral Propagation===
Comparison with previous methods Accuracy of dynamical system Relationship with full Markov chain Experiments on information survival threshold Comparison with Infinite Particle Systems Intuition behind the largest eigenvalue Correlated failures
79
===R-MAT===
Graph patterns Generator desiderata Description of R-MAT Experiments on a directed graph R-MAT communities via Cross-Associations? R-MAT versus tree-based generators
80
===Graphs in general===
Relational learning Graph Kernels
81
Simultaneous grouping is useful Sparse blocks, with little
in common between rows
Grouping rows first would
collapse these two into one!
Index
82Index
Cross-Associations ≠ Co-clustering !Information-theoretic
co-clustering Cross-Associations
1. Lossy Compression.
2. Approximates the original matrix, while trying to minimize KL-divergence.
3. The number of row and column groups must be given by the user.
1. Lossless Compression.
2. Always provides complete information about the matrix, for any number of row and column groups.
3. Chosen automatically using the MDL principle.
83Index
Other parameter-fitting methods The Gap statistic [Tibshirani+ ’01]
Minimize the “gap” of log-likelihood of intra-cluster distances from the expected log-likelihood.
But Needs a distance function between graph nodes Needs a “reference” distribution Needs multiple MCMC runs to remove “variance
due to sampling” more time.
84
Other parameter-fitting methods Stability-based method [Ben-Hur+ ’02, ‘03]
Run clustering multiple times on samples of data, for several values of “k”
For low k, clustering is stable; for high k, unstable Choose this transition point.
But Needs many runs of the clustering algorithm Arguments possible over definition of transition
point
Index
85
Precision-Recall for CLASSIC
Index
86
Cost surface (total cost)
k
l
Surface plot
lk
Contour plot
With increasing k and l: Total cost decays very rapidly initially, but then starts increasing slowly
Index
87
Cost surface (code cost only)
k
ll
k
With increasing k and l: Code cost decays very rapidly
Surface plot Contour plot
Index
88
Encoding Cost Function
Total encoding cost =log*(k) + log*(l) + (cluster number)
N.log(N) + M.log(M) + (row/col order)
Σ log(ai) + Σ log(bj) + (cluster sizes) ΣΣ log(aibj+1) + (block densities)
ΣΣ aibj . H(pi,j)
Desc
rip
tion
co
st
Code cost
Index
89
Complexity of CA
O(E. (k2+l2)) ignoring the number of re-assign iterations, which is typically low.
Index
90
Complexity of CA
Number of edges
Tim
e /
Σ(k
+l)
Index
91
Inter-group distances
Nodes
No
des
Node Groups
Grp1
Grp2
Grp3
Two groups are “close”
Merging them does not increase cost by much
distance(i,j) = relative increase in cost on merging i and j
No
de
Gro
up
s
Index
92
Inter-group distances
No
de
Gro
up
s
Node Groups
Grp1
Grp2
Grp3
Two groups are “close”
Merging them does not increase cost by much
distance(i,j) = relative increase in cost on merging i and j
Grp1 Grp2
Grp3
5.5
4.55.1
Index
93
Experiments
Aut
hor
grou
ps
Author groups
Grp8Grp1
Inter-group distances can aid in visualization
Stonebraker, DeWitt, Carey
Index
94
Collaborative filtering and recommendation systems Q: If someone likes a product X, will (s)he like
product Y? A: Check if others who liked X also liked Y.
Focus on distances between people, typically cosine similarity
and not on clustering
Index
95
CA and bipartite cores: related but different
A 3x2 bipartite core
Hubs Authorities
Kumar et al. [1999] say that bipartite cores correspond to communities.
Index
96
CA and bipartite cores: related but different
CA finds two communities there: one for hubs, and one for authorities.
We gracefully handle cases where a few links are missing.
CA considers connections between all sets of clusters, and not just two sets.
Not each node need belong to a non-trivial bipartite core.
CA is (informally) a generalization
Index
97
Comparison with soft clustering Soft clustering each node belongs to each
cluster with some probability Hard clustering one cluster per node
Index
98
Comparison with soft clustering1. Far more degrees of freedom
1. Parameter fitting is harder
2. Algorithms can be costlier
2. Hard clustering is better for exploratory data analysis
3. Some real-world problems require hard clustering e.g., fraud detection for accountants
Index
99
Weights for code cost vs description cost Total = 1. (code cost) + 1. (description cost) Physical meaning: Total number of bits
Total = α. (code cost) + β. (description cost) Physical meaning: Number of encoding bits
under some prior
Index
100
Re-assign: for each row x
Formula for re-assigns
Column groups
Row
gro
ups
Index
101
Choosing k and ll = 5
k = 5
Split:1. Find the row group R with the maximum entropy per row
2. Choose the rows in R whose removal reduces the entropy per row in R
3. Send these rows to the new row group, and set k=k+1
Index
102
Experiments
User groups
Use
r gr
oups
Epinions dataset
• 75,888 users
• 508,960 “dots”, one “dot” per “trust” relationship
k=19 groups foundSmall dense “core”
Index
103
Comparison with previous methods Our threshold subsumes the homogeneous
model Proof We are more accurate than the Mean-Field
Assumption model.
Index
104
Comparison with previous methods 10K Star Graph
Index
105
Comparison with previous methods Oregon Graph
Index
106
Accuracy of dynamical system 10K Star Graph
Index
107
Accuracy of dynamical system Oregon Graph
Index
108
Accuracy of dynamical system 10K Star Graph
Index
109
Accuracy of dynamical system Oregon Graph
Index
110
Relationship with full Markov Chain The full Markov Chain is of the form:
Prob(infection at time t) = Xt-1 + Yt-1 – Zt-1
Independence assumption leads to a point estimate for Zt-1 non-linear dynamical system.
Still non-linear, but now tractable
Non-linear component
Index
111
Experiments: Information survival INTEL sensor map (54 nodes) MIT sensor map (40 nodes) and others…
Index
112
Experiments: Information survival
INTEL sensor map
Index
113
Survival threshold on INTEL
Index
114
Survival threshold on INTEL
Index
115
Experiments: Information survival
MIT sensor map
Index
116
Survival threshold on MIT
Index
117
Survival threshold on MIT
Index
118
Infinite Particle Systems
“Contact Process” ≈ SIS model Differences:
Infinite graphs only the questions asked are different
Very specific topologies lattices, trees Exact thresholds have not been found for these;
proving existence of thresholds is important Our results match those on the finite line
graph [Durrett+ ’88]
Index
119
Intuition behind the largest eigenvalue Approximately size of the largest “blob” Consider the special case of a “caveman”
graph Largest eigenvalue = 4
Index
120
Intuition behind the largest eigenvalue Approximately size of the largest “blob”
Largest eigenvalue = 4.016
Index
121
Graph Patterns
Power Laws
Count vs Outdegree
Count vs Indegree
The “epinions” graph with 75,888 nodes and508,960 edges
Index
122
Graph Patterns
Power Laws
Count vs Outdegree
Count vs Indegree
The “epinions” graph with 75,888 nodes and508,960 edges
Index
123
Graph Patterns
Power Laws and deviations (DGX/Lognormals [Bi+ ’01])
Degree
Cou
nt
Count vs Indegree
Index
124
Graph Patterns
Power Lawsand deviations
Small-world “Community” effect …
hops
Effective Diameter
# r
each
ab
le p
air
s
Index
125
Graph Generator Desiderata
Power Lawsand deviations
Small-world “Community” effect …
Most current graph generators fail to match some of these.
Other desiderata Few parameters Fast parameter-fitting Scalable graph
generation Simple extension to
undirected, bipartite and weighted graphs
Index
126
The R-MAT generator
[SIAM DM’04]
2n
2n
Subdivide the adjacency matrix
and choose one quadrant with probability (a,b,c,d)
a (0.5)
d (0.25)
c (0.15)
b (0.1)
From To
Intuition: The “80-20 law”
Index
127
The R-MAT generator
[SIAM DM’04]
2n
2n
Subdivide the adjacency matrix
and choose one quadrant with probability (a,b,c,d)
Recurse till we reach a 1*1 cell
where we place an edge and repeat for all edges.
a
c d
a
c d
b
Intuition: The “80-20 law”
Index
128
The R-MAT generator
[SIAM DM’04]
2n
2n
Only 3 parameters a, b and c (d = 1-a-b-c).
We have a fast parameter fitting algorithm.
a
c d
a
c d
b
Intuition: The “80-20 law”
Index
129
Experiments (Epinions directed graph)
Count vs Indegree Count vs Outdegree Hop-plot
Eigenvalue vs Rank “Network value” Count vs Stress
Effective Diameter
►R-MAT matches directed graphs
Index
130
R-MAT communities and Cross-Associations R-MAT builds communities in graphs, and
Cross-Associations finds them. Relationship?
R-MAT builds a hierarchy of communities, while CA finds a flat set of communities
Linkage in the sizes of communities found by CA: When the R-MAT parameters are very skewed, the
community sizes for CA are skewed and vice versa
Index
131
R-MAT and tree-based generators Recursive splitting in R-MAT ≈ following a
tree from root to leaf.
Relationship with other tree-based generators [Kleinberg ’01, Watts+ ’02]? The R-MAT tree has edges as leaves, the others
have nodes Tree-distance between nodes is used to connect
nodes in other generators, but what does tree-distance between edges mean?
Index
132
Comparison with relational learningRelational Learning
(typical) Graph Mining
(typical)
1. Aims to find small structure/patterns at the local level
2. Labeled nodes and edges
3. Semantics of labels are important
4. Algorithms are typically costlier
1. Emphasis on global aspects of large graphs
2. Unlabeled graphs
3. More focused on topological structure and properties
4. Scalability is more important
Index
133
===OTHER WORK===
OTHER WORK
134
Other Work
Time Series Prediction[CIKM 2002] We use the fractal dimension of the data This is related to chaos theory and Lyapunov exponents…
135
Other Work
Time Series Prediction[CIKM 2002]
Logistic Parabola
136
Other Work
Time Series Prediction[CIKM 2002]
Lorenz attractor
137
Other Work
Time Series Prediction[CIKM 2002]
Laser fluctuations
138
Other Work Adaptive histograms with error guarantees
[+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos]
Salary
Cou
nt
Prob.
• Maintain count probabilities for buckets
• to give statistically correct query result-size estimation
• and query feedback
• + …
Insertions, deletions
Count
139
Other Work
User-personalization Patent number 6,611,834 (IBM)
Relevance feedback in multimedia image search Filed for patent (IBM)
Building 3D models using robot camera and rangefinder data [ICML 2001]
140
===EXTRAS===
141
Conclusions Two paths in graph mining:
Specific applications: Viral Propagation Resilience testing, information
dissemination, rumor spreading Node Grouping automatically grouping nodes, AND
finding the correct number of groups
References:1. Fully automatic Cross-Associations,
by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 20042. AutoPart: Parameter-free graph partitioning and Outlier detection,
by Chakrabarti, in PKDD 20043. Epidemic spreading in real networks: An eigenvalue viewpoint,
by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003
142
Conclusions Two paths in graph mining:
Specific applications General issues:
Graph Patterns Marks of “realism” in a graph Graph Generators R-MAT, a fast, scalable generator
matching many of the patterns
References:1. R-MAT: A recursive model for graph mining,
by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004.2. NetMine: New mining tools for large graphs,
by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy
143
Other References
F4: Large Scale Automated Forecasting using Fractals,by D. Chakrabarti and C. Faloutsos, in CIKM 2002.
Using EM to Learn 3D Models of Indoor Environments with Mobile Robots,by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S. Thrun, in ICML 2001
Graph Mining: Laws, Generators and Algorithms,by D. Chakrabarti and C. Faloutsos, under submission to ACM Computing Surveys
144
References --- graphs
1. R-MAT: A recursive model for graph mining, by D. Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining 2004.
2. Epidemic spreading in real networks: An eigenvalue viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, in SRDS 2003
3. Fully automatic Cross-Associations, by D. Chakrabarti, S. Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004
4. AutoPart: Parameter-free graph partitioning and Outlier detection, by D. Chakrabarti, in PKDD 2004
5. NetMine: New mining tools for large graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SIAM 2004 Workshop on Link analysis, counter-terrorism and privacy
145
Roadmap
Specific applications
• Node grouping
• Viral propagation
General issues
• Realistic graph generation
• Graph patterns and “laws”31
2
4 Other Work
5 Conclusions
146
Experiments (Clickstream bipartite graph)
In-degree
Users
Websites
Some personal webpage
Yahoo, Google and others
Clickstream +
Cou
nt
147
Experiments (Clickstream bipartite graph)
Users
Websites
Email-checking surfers
“All-night” surfers
Out-degree
Cou
nt
Clickstream +
148
Experiments (Clickstream bipartite graph)
Users
Websites
Hops
# R
each
able
pai
rs
ClickstreamR-MAT
149
Graph Generation
Important for: Simulations of new algorithms Compression using a good graph generation
model Insight into the graph formation process
Our R-MAT (Recursive MATrix) generator can match many common graph patterns.
150
Recall the definition of eigenvalues
β/δ < τ = 1/ λ1,A
A X X= λA
λA = eigenvalue of A
λ1,A = largest eigenvalue
151
Tools for Large Graph Mining
Deepayan Chakrabarti Carnegie Mellon University