Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions,...
Transcript of Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions,...
![Page 1: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/1.jpg)
CMU SCS
Talk 3: Graph Mining Tools – Tensors, communities,
parallelism
Christos Faloutsos CMU
![Page 2: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/2.jpg)
CMU SCS
(C) 2011, C. Faloutsos 2
Overall Outline
• Introduction – Motivation • Talk#1: Patterns in graphs; generators • Talk#2: Tools (Ranking, proximity) • Talk#3: Tools (Tensors, scalability) • Conclusions
KAIST-2011
![Page 3: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/3.jpg)
CMU SCS
Outline • Task 4: time-evolving graphs – tensors • Task 5: community detection • Task 6: virus propagation • Task 7: scalability, parallelism and hadoop • Conclusions
KAIST-2011 (C) 2011, C. Faloutsos 3
![Page 4: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/4.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 4
Thanks to • Tamara Kolda (Sandia)
for the foils on tensor definitions, and on TOPHITS
![Page 5: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/5.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 5
Detailed outline
• Motivation • Definitions: PARAFAC and Tucker • Case study: web mining
![Page 6: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/6.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 6
Examples of Matrices: Authors and terms
data mining classif. tree ... John Peter Mary Nick
...
![Page 7: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/7.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 7
Motivation: Why tensors?
• Q: what is a tensor?
![Page 8: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/8.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 8
Motivation: Why tensors?
• A: N-D generalization of matrix:
data mining classif. tree ... John Peter Mary Nick
...
KDD’09
![Page 9: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/9.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 9
Motivation: Why tensors?
• A: N-D generalization of matrix:
data mining classif. tree ... John Peter Mary Nick
...
KDD’08
KDD’07
KDD’09
![Page 10: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/10.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 10
Tensors are useful for 3 or more modes
Terminology: ‘mode’ (or ‘aspect’):
data mining classif. tree ...
Mode (== aspect) #1
Mode#2
Mode#3
![Page 11: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/11.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 11
Notice
• 3rd mode does not need to be time • we can have more than 3 modes
...
IP destination
Dest. port
IP source
80 125
![Page 12: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/12.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 12
Notice • 3rd mode does not need to be time • we can have more than 3 modes
– Eg, fFMRI: x,y,z, time, person-id, task-id
http://denlab.temple.edu/bidms/cgi-bin/browse.cgi
From DENLAB, Temple U. (Prof. V. Megalooikonomou +)
![Page 13: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/13.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 13
Motivating Applications • Why tensors are useful?
– web mining (TOPHITS) – environmental sensors – Intrusion detection (src, dst, time, dest-port) – Social networks (src, dst, time, type-of-contact) – face recognition – etc …
![Page 14: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/14.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 14
Detailed outline
• Motivation • Definitions: PARAFAC and Tucker • Case study: web mining
![Page 15: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/15.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 15
Tensor basics
• Multi-mode extensions of SVD – recall that:
![Page 16: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/16.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 16
Reminder: SVD
– Best rank-k approximation in L2
A m
n
Σ m
n
U
VT
≈
![Page 17: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/17.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 17
Reminder: SVD
– Best rank-k approximation in L2
A m
n
≈ +
σ1u1°v1 σ2u2°v2
![Page 18: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/18.jpg)
KAIST-2011 (C) 2011, C. Faloutsos 18
Goal: extension to >=3 modes
~
I x R
A B
J x R
R x R x R
I x J x K
+…+ =
![Page 19: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/19.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 19
Main points:
• 2 major types of tensor decompositions: PARAFAC and Tucker
• both can be solved with ``alternating least squares’’ (ALS)
![Page 20: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/20.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 20
= U
I x R
V J x R
R x R x R
Specially Structured Tensors • Tucker Tensor • Kruskal Tensor
I x J x K
= U
I x R
V J x S
R x S x T
I x J x K
Our Notation
Our Notation
+…+ =
u1 uR
v1
w1
vR
wR
“core”
![Page 21: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/21.jpg)
KAIST-2011 (C) 2011, C. Faloutsos 21
Tucker Decomposition - intuition
I x J x K
~ A
I x R
B J x S
R x S x T
• author x keyword x conference • A: author x author-group • B: keyword x keyword-group • C: conf. x conf-group • G: how groups relate to each other
![Page 22: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/22.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 22
Intuition behind core tensor
• 2-d case: co-clustering • [Dhillon et al. Information-Theoretic Co-
clustering, KDD’03]
![Page 23: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/23.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 23
m
m
n
n l
k
k l
eg, terms x documents
![Page 24: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/24.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 24
term x term-group
doc x doc group
term group x doc. group
med. terms
cs terms common terms
med. doc cs doc
![Page 25: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/25.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 25
Tensor tools - summary
• Two main tools – PARAFAC – Tucker
• Both find row-, column-, tube-groups – but in PARAFAC the three groups are identical
• ( To solve: Alternating Least Squares )
![Page 26: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/26.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 26
Detailed outline
• Motivation • Definitions: PARAFAC and Tucker • Case study: web mining
![Page 27: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/27.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 27
Web graph mining
• How to order the importance of web pages? – Kleinberg’s algorithm HITS – PageRank – Tensor extension on HITS (TOPHITS)
![Page 28: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/28.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 28
Kleinberg’s Hubs and Authorities (the HITS method)
Sparse adjacency matrix and its SVD:
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
Kleinberg, JACM, 1999
![Page 29: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/29.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 29
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
HITS Authorities on Sample Data
We started our crawl from http://www-neos.mcs.anl.gov/neos,
and crawled 4700 pages, resulting in 560
cross-linked hosts.
![Page 30: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/30.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 30
Three-Dimensional View of the Web
Observe that this tensor is very sparse!
Kolda, Bader, Kenny, ICDM05
![Page 31: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/31.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 31
Three-Dimensional View of the Web
Observe that this tensor is very sparse!
Kolda, Bader, Kenny, ICDM05
![Page 32: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/32.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 32
Three-Dimensional View of the Web
Observe that this tensor is very sparse!
Kolda, Bader, Kenny, ICDM05
![Page 33: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/33.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 33
Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
term scores for 1st topic
term scores for 2nd topic
![Page 34: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/34.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 34
Topical HITS (TOPHITS) Main Idea: Extend the idea behind the HITS model to incorporate term (i.e., topical) information.
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic
from
to
term scores for 1st topic
term scores for 2nd topic
![Page 35: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/35.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 35
TOPHITS Terms & Authorities on Sample Data
TOPHITS uses 3D analysis to find the dominant groupings of web pages and terms.
authority scores for 1st topic
hub scores for 1st topic
hub scores for 2nd topic
authority scores for 2nd topic fro
m
to
term scores for 1st topic
term scores for 2nd topic
Tensor PARAFAC
wk = # unique links using term k
![Page 36: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/36.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 36
Conclusions
• Real data are often in high dimensions with multiple aspects (modes)
• Tensors provide elegant theory and algorithms – PARAFAC and Tucker: discover groups
![Page 37: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/37.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 37
References • T. G. Kolda, B. W. Bader and J. P. Kenny.
Higher-Order Web Link Analysis Using Multilinear Algebra. In: ICDM 2005, Pages 242-249, November 2005.
• Jimeng Sun, Spiros Papadimitriou, Philip Yu. Window-based Tensor Analysis on High-dimensional and Multi-aspect Streams, Proc. of the Int. Conf. on Data Mining (ICDM), Hong Kong, China, Dec 2006
![Page 38: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/38.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 38
Resources
• See tutorial on tensors, KDD’07 (w/ Tamara Kolda and Jimeng Sun):
www.cs.cmu.edu/~christos/TALKS/KDD-07-tutorial
![Page 39: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/39.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 39
Tensor tools - resources
• Toolbox: from Tamara Kolda: csmr.ca.sandia.gov/~tgkolda/TensorToolbox
2-39 Copyright: Faloutsos, Tong (2009) 2-39 ICDE’09
• T. G. Kolda and B. W. Bader. Tensor Decompositions and Applications. SIAM Review, Volume 51, Number 3, September 2009 csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/TensorReview-preprint.pdf
• T. Kolda and J. Sun: Scalable Tensor Decomposition for Multi-Aspect Data Mining (ICDM 2008)
![Page 40: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/40.jpg)
CMU SCS
Outline • Task 4: time-evolving graphs – tensors • Task 5: community detection • Task 6: virus propagation • Task 7: scalability, parallelism and hadoop • Conclusions
KAIST-2011 (C) 2011, C. Faloutsos 40
![Page 41: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/41.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 41
Detailed outline
• Motivation • Hard clustering – k pieces • Hard co-clustering – (k,l) pieces • Hard clustering – optimal # pieces • Observations
![Page 42: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/42.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 42
Problem
• Given a graph, and k • Break it into k (disjoint) communities
![Page 43: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/43.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 43
Problem
• Given a graph, and k • Break it into k (disjoint) communities
k = 2
![Page 44: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/44.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 44
Solution #1: METIS
• Arguably, the best algorithm • Open source, at
– http://www.cs.umn.edu/~metis
• and *many* related papers, at same url • Main idea:
– coarsen the graph; – partition; – un-coarsen
![Page 45: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/45.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 45
Solution #1: METIS • G. Karypis and V. Kumar. METIS 4.0:
Unstructured graph partitioning and sparse matrix ordering system. TR, Dept. of CS, Univ. of Minnesota, 1998.
• <and many extensions>
![Page 46: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/46.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 46
Solution #2 (problem: hard clustering, k pieces) Spectral partitioning: • Consider the 2nd smallest eigenvector of the
(normalized) Laplacian
![Page 47: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/47.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 47
Solutions #3, …
Many more ideas: • Clustering on the A2 (square of adjacency
matrix) [Zhou, Woodruff, PODS’04] • Minimum cut / maximum flow [Flake+,
KDD’00] • …
![Page 48: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/48.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 48
Detailed outline
• Motivation • Hard clustering – k pieces • Hard co-clustering – (k,l) pieces • Hard clustering – optimal # pieces • Soft clustering – matrix decompositions • Observations
![Page 49: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/49.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 49
Problem definition
• Given a bi-partite graph, and k, l • Divide it into k row groups and l row groups • (Also applicable to uni-partite graph)
![Page 50: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/50.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 50
Co-clustering
• Given data matrix and the number of row and column groups k and l
• Simultaneously – Cluster rows into k disjoint groups – Cluster columns into l disjoint groups
![Page 51: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/51.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 51 Copyright: Faloutsos, Tong (2009) 2-51
Co-clustering • Let X and Y be discrete random variables
– X and Y take values in {1, 2, …, m} and {1, 2, …, n} – p(X, Y) denotes the joint probability distribution—if
not known, it is often estimated based on co-occurrence data
– Application areas: text mining, market-basket analysis, analysis of browsing behavior, etc.
• Key Obstacles in Clustering Contingency Tables – High Dimensionality, Sparsity, Noise – Need for robust and scalable algorithms
Reference: 1. Dhillon et al. Information-Theoretic Co-clustering, KDD’03
![Page 52: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/52.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 52
m
m
n
n l
k
k l
eg, terms x documents
![Page 53: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/53.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 53
doc x doc group
term group x doc. group
med. terms
cs terms common terms
med. doc cs doc
term x term-group
![Page 54: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/54.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 54
Co-clustering
Observations • uses KL divergence, instead of L2 • the middle matrix is not diagonal
– we saw that earlier in the Tucker tensor decomposition
• s/w at: www.cs.utexas.edu/users/dml/Software/cocluster.html
![Page 55: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/55.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 55
Detailed outline
• Motivation • Hard clustering – k pieces • Hard co-clustering – (k,l) pieces • Hard clustering – optimal # pieces • Soft clustering – matrix decompositions • Observations
![Page 56: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/56.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 56
Problem with Information Theoretic Co-clustering
• Number of row and column groups must be specified
Desiderata:
Simultaneously discover row and column groups
" Fully Automatic: No “magic numbers”
Scalable to large graphs
![Page 57: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/57.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 57
Cross-association
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large matrices
Reference: 1. Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04
![Page 58: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/58.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 58
What makes a cross-association “good”?
versus
Column groups
Column groups
Row
gro
ups
Row
gro
ups
Why is this better?
![Page 59: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/59.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 59
What makes a cross-association “good”?
versus
Column groups
Column groups
Row
gro
ups
Row
gro
ups
Why is this better?
simpler; easier to describe easier to compress!
![Page 60: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/60.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 60
What makes a cross-association “good”?
Problem definition: given an encoding scheme • decide on the # of col. and row groups k and l • and reorder rows and columns, • to achieve best compression
![Page 61: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/61.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 61
Main Idea
sizei * H(xi) + Cost of describing cross-associations
Code Cost Description Cost
Σi Total Encoding Cost =
Good Compression
Better Clustering
Minimize the total cost (# bits)
for lossless compression
![Page 62: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/62.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 62
Algorithm k = 5 row
groups
k=1, l=2
k=2, l=2
k=2, l=3
k=3, l=3
k=3, l=4
k=4, l=4
k=4, l=5
l = 5 col groups
![Page 63: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/63.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 63
Experiments “CLASSIC”
• 3,893 documents
• 4,303 words
• 176,347 “dots”
Combination of 3 sources:
• MEDLINE (medical)
• CISI (info. retrieval)
• CRANFIELD (aerodynamics)
Doc
umen
ts
Words
![Page 64: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/64.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 64
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
Doc
umen
ts
Words
![Page 65: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/65.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 65
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
MEDLINE (medical)
insipidus, alveolar, aortic, death, prognosis, intravenous blood, disease, clinical, cell,
tissue, patient
![Page 66: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/66.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 66
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
CISI (Information Retrieval)
providing, studying, records, development, students, rules
abstract, notation, works, construct, bibliographies
MEDLINE (medical)
![Page 67: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/67.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 67
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
CRANFIELD (aerodynamics)
shape, nasa, leading, assumed, thin
CISI (Information Retrieval)
MEDLINE (medical)
![Page 68: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/68.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 68
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
paint, examination, fall, raise, leave, based
CRANFIELD (aerodynamics)
CISI (Information Retrieval)
MEDLINE (medical)
![Page 69: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/69.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 69
Algorithm Code for cross-associations (matlab):
www.cs.cmu.edu/~deepay/mywww/software/CrossAssociations-01-27-2005.tgz!
Variations and extensions: • ‘Autopart’ [Chakrabarti, PKDD’04] • www.cs.cmu.edu/~deepay!
![Page 70: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/70.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 70
Algorithm • Hadoop implementation [ICDM’08]
Spiros Papadimitriou, Jimeng Sun: DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. ICDM 2008: 512-521
![Page 71: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/71.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 71
Detailed outline
• Motivation • Hard clustering – k pieces • Hard co-clustering – (k,l) pieces • Hard clustering – optimal # pieces • Observations
![Page 72: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/72.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 72
Observation #1
• Skewed degree distributions – there are nodes with huge degree (>O(10^4), in facebook/linkedIn popularity contests!)
![Page 73: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/73.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 73
Observation #2
• Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+’04], [Leskovec+,’08]
![Page 74: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/74.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 74
Observation #2
• Maybe there are no good cuts: ``jellyfish’’ shape [Tauro+’01], [Siganos+,’06], strange behavior of cuts [Chakrabarti+,’04], [Leskovec+,’08]
? ?
![Page 75: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/75.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 75
Jellyfish model [Tauro+]
…
A Simple Conceptual Model for the Internet Topology, L. Tauro, C. Palmer, G. Siganos, M. Faloutsos, Global Internet, November 25-29, 2001
Jellyfish: A Conceptual Model for the AS Internet Topology G. Siganos, Sudhir L Tauro, M. Faloutsos, J. of Communications and Networks, Vol. 8, No. 3, pp 339-350, Sept. 2006.
![Page 76: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/76.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 76
Strange behavior of min cuts
• ‘negative dimensionality’ (!)
NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy
Statistical Properties of Community Structure in Large Social and Information Networks, J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. WWW 2008.
![Page 77: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/77.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 77
“Min-cut” plot • Do min-cuts recursively.
log (# edges)
log (mincut-size / #edges)
N nodes
Mincut size = sqrt(N)
![Page 78: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/78.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 78
“Min-cut” plot • Do min-cuts recursively.
log (# edges)
log (mincut-size / #edges)
N nodes
New min-cut
![Page 79: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/79.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 79
“Min-cut” plot • Do min-cuts recursively.
log (# edges)
log (mincut-size / #edges)
N nodes
New min-cut
Slope = -0.5
For a d-dimensional grid, the slope is -1/d
![Page 80: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/80.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 80
“Min-cut” plot
log (# edges)
log (mincut-size / #edges)
Slope = -1/d
For a d-dimensional grid, the slope is -1/d
log (# edges)
log (mincut-size / #edges)
For a random graph, the slope is 0
![Page 81: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/81.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 81
“Min-cut” plot • What does it look like for a real-world
graph?
log (# edges)
log (mincut-size / #edges)
?
![Page 82: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/82.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 82
Experiments • Datasets:
– Google Web Graph: 916,428 nodes and 5,105,039 edges
– Lucent Router Graph: Undirected graph of network routers from www.isi.edu/scan/mercator/maps.html; 112,969 nodes and 181,639 edges
– User Website Clickstream Graph: 222,704 nodes and 952,580 edges
NetMine: New Mining Tools for Large Graphs, by D. Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G. Blelloch, in the SDM 2004 Workshop on Link Analysis, Counter-terrorism and Privacy
![Page 83: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/83.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 83
Experiments • Used the METIS algorithm [Karypis, Kumar,
1995]
log (# edges)
log
(min
cut-s
ize
/ #ed
ges)
• Google Web graph
• Values along the y-axis are averaged
• We observe a “lip” for large edges
• Slope of -0.4, corresponds to a 2.5-dimensional grid!
Slope~ -0.4
![Page 84: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/84.jpg)
CMU SCS
Google graph
KAIST-2011 (C) 2011, C. Faloutsos 84
Log(#edges)
log (mincut-size / #edges)
Log(#edges)
All min-cuts averaged
![Page 85: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/85.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 85
Experiments • Same results for other graphs too…
Lucent Router graph
Clickstream graph
![Page 86: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/86.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 86
Conclusions – Practitioner’s guide
• Hard clustering – k pieces • Hard co-clustering – (k,l) pieces • Hard clustering – optimal # pieces • Observations
METIS
Co-clustering
Cross-associations
‘jellyfish’: Maybe, there are no good cuts
![Page 87: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/87.jpg)
CMU SCS
Outline • Task 4: time-evolving graphs – tensors • Task 5: community detection • Task 6: virus propagation • Task 7: scalability, parallelism and hadoop • Conclusions
KAIST-2011 (C) 2011, C. Faloutsos 87
![Page 88: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/88.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 88
Detailed outline • Problem definition • Analysis • Experiments
![Page 89: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/89.jpg)
CMU SCS Immunization and epidemic
thresholds • Q1: which nodes to immunize? • Q2: will a virus vanish, or will it create an
epidemic?
KAIST-2011 (C) 2011, C. Faloutsos 89
![Page 90: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/90.jpg)
CMU SCS
Q1: Immunization: • Given
• a network, • k vaccines, and • the virus details
• Which nodes to immunize?
KAIST-2011 90 (C) 2011, C. Faloutsos
![Page 91: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/91.jpg)
CMU SCS
Q1: Immunization: • Given
• a network, • k vaccines, and • the virus details
• Which nodes to immunize?
KAIST-2011 91 (C) 2011, C. Faloutsos
![Page 92: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/92.jpg)
CMU SCS
Q1: Immunization: • Given
• a network, • k vaccines, and • the virus details
• Which nodes to immunize?
KAIST-2011 92 (C) 2011, C. Faloutsos
![Page 93: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/93.jpg)
CMU SCS
Q1: Immunization: • Given
• a network, • k vaccines, and • the virus details
• Which nodes to immunize?
A: immunize the ones that maximally raise the `epidemic threshold’ [Tong+, ICDM’10]
KAIST-2011 93 (C) 2011, C. Faloutsos
![Page 94: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/94.jpg)
CMU SCS
Q2: will a virus take over? • Flu-like virus (no immunity, ‘SIS’) • Mumps (life-time immunity, ‘SIR’) • Pertussis (finite-length immunity, ‘SIRS’)
KAIST-2011 (C) 2011, C. Faloutsos 94
β: attack prob δ: heal prob
![Page 95: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/95.jpg)
CMU SCS
Q2: will a virus take over? • Flu-like virus (no immunity, ‘SIS’) • Mumps (life-time immunity, ‘SIR’) • Pertussis (finite-length immunity, ‘SIRS’)
KAIST-2011 (C) 2011, C. Faloutsos 95
β: attack prob δ: heal prob
Α: depends on connectivity (avg degree? Max degree? variance? Something else?
![Page 96: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/96.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 96
The model: SIS
• ‘Flu’ like: Susceptible-Infected-Susceptible • Virus ‘strength’ s= β/δ
Infected
Healthy
NN1
N3
N2 Prob. β
Prob. δ
![Page 97: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/97.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 97
Epidemic threshold τ of a graph: the value of τ, such that
if strength s = β / δ < τ an epidemic can not happen Thus, • given a graph • compute its epidemic threshold
![Page 98: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/98.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 98
Detailed outline • Problem definition • Analysis • Experiments
![Page 99: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/99.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 99
Epidemic threshold τ
What should τ depend on? • avg. degree? and/or highest degree? • and/or variance of degree? • and/or third moment of degree? • and/or diameter?
![Page 100: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/100.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 100
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
![Page 101: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/101.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 101
Epidemic threshold
• [Theorem] We have no epidemic, if
β/δ <τ = 1/ λ1,A
largest eigenvalue of adj. matrix A
attack prob.
recovery prob. epidemic threshold
Proof: [Wang+03] (proof: for SIS=flu only)
![Page 102: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/102.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 102
Beginning of proof Healthy @ t+1: - ( healthy or healed ) - and not attacked @ t
Let: p(i , t) = Prob node i is sick @ t+1
1 - p(i, t+1 ) = (1 – p(i, t) + p(i, t) * δ ) * Πj (1 – β aji * p(j , t) )
Below threshold, if the above non-linear dynamical system above is ‘stable’ (eigenvalue of Hessian < 1 )
![Page 103: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/103.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 103
Epidemic threshold for various networks
Formula includes older results as special cases: • Homogeneous networks [Kephart+White]
– λ1,A = <k>; τ = 1/<k> (<k> : avg degree)
• Star networks (d = degree of center) – λ1,A = sqrt(d); τ = 1/ sqrt(d)
• Infinite power-law networks – λ1,A = ∞; τ = 0 ; [Barabasi]
![Page 104: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/104.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 104
Epidemic threshold
• [Theorem 2] Below the epidemic threshold, the epidemic dies out exponentially
![Page 105: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/105.jpg)
CMU SCS
Recent generalization • [Prakash+, arxiv ‘10]: similar threshold, for
almost all virus propagation models (VPM) – SIS -> flu – SIR -> mumps – SIRS -> whooping cough (temporary
immunity) – SIIR (-> HIV) – …
KAIST-2011 (C) 2011, C. Faloutsos 105
![Page 106: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/106.jpg)
CMU SCS
A2: will a virus take over? • For all typical virus propagation models (flu,
mumps, pertussis, HIV, etc) • The only connectivity measure that matters, is
1/λ1 the first eigenvalue of the adj. matrix Proof for all VPM: [Prakash+, ‘10, arxiv]
KAIST-2011 (C) 2011, C. Faloutsos 106
![Page 107: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/107.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 107
Detailed outline • Epidemic threshold
– Problem definition – Analysis – Experiments
![Page 108: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/108.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 108
Experiments (Oregon)
β/δ > τ (above threshold)
β/δ = τ (at the threshold)
β/δ < τ (below threshold)
![Page 109: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/109.jpg)
KAIST-2011 (C) 2011, C. Faloutsos 109
SIS simulation - # infected nodes vs time
Time (linear scale)
#inf. (log scale)
above
at
below
Log - Lin
![Page 110: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/110.jpg)
KAIST-2011 (C) 2011, C. Faloutsos 110
SIS simulation - # infected nodes vs time
Log - Lin
Time (linear scale)
#inf. (log scale)
above
at
below
Exponential decay
![Page 111: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/111.jpg)
KAIST-2011 (C) 2011, C. Faloutsos 111
SIS simulation - # infected nodes vs time
Log - Log
Time (log scale)
#inf. (log scale)
above
at
below
![Page 112: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/112.jpg)
KAIST-2011 (C) 2011, C. Faloutsos 112
SIS simulation - # infected nodes vs time
Time (log scale)
#inf. (log scale)
above
at
below
Log - Log
Power-law Decay (!)
![Page 113: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/113.jpg)
How about other VPMs?
KAIST-2011 (C) 2011, C. Faloutsos P6-113
![Page 114: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/114.jpg)
CMU SCS
A2: will a virus take over? (SIRS case)
KAIST-2011 (C) 2011, C. Faloutsos 114
Fraction of infected
Time ticks
Below: exp. extinction
Above: take-over
Graph: Portland, OR 31M links 1.5M nodes
![Page 115: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/115.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 115
Conclusions
λ1,A : Eigenvalue of adjacency matrix determines the survival of (almost) any virus
• measure of connectivity (~ # paths) • Can answer ‘what-if’ scenarios
– May guide immunization policies
• Can help us avoid expensive simulations
![Page 116: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/116.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 116
References • D. Chakrabarti, Y. Wang, C. Wang, J.
Leskovec, and C. Faloutsos, Epidemic Thresholds in Real Networks, in ACM TISSEC, 10(4), 2008
• Ganesh, A., Massoulie, L., and Towsley, D., 2005. The effect of network topology on the spread of epidemics. In INFOCOM.
![Page 117: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/117.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 117
References (cont’d)
• Hethcote, H. W. 2000. The mathematics of infectious diseases. SIAM Review 42, 599–653.
• Hethcote, H. W. AND Yorke, J. A. 1984. Gonorrhea Transmission Dynamics and Control. Vol. 56. Springer. Lecture Notes in Biomathematics.
![Page 118: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/118.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 118
References (cont’d)
• Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, in SRDS 2003 (pages 25-34), Florence, Italy
![Page 119: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/119.jpg)
CMU SCS
Outline • Task 4: time-evolving graphs – tensors • Task 5: community detection • Task 6: virus propagation • Task 7: scalability, parallelism and hadoop • Conclusions
KAIST-2011 (C) 2011, C. Faloutsos 119
![Page 120: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/120.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-120
Scalability
• How about if graph/tensor does not fit in core?
• How about handling huge graphs?
![Page 121: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/121.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-121
Scalability
• How about if graph/tensor does not fit in core?
• [‘MET’: Kolda, Sun, ICMD’08, best paper award]
• How about handling huge graphs?
![Page 122: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/122.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-122
Scalability • Google: > 450,000 processors in clusters of
~2000 processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003]
• Yahoo: 5Pb of data [Fayyad, KDD’07] • Problem: machine failures, on a daily basis • How to parallelize data mining tasks, then?
![Page 123: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/123.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-123
Scalability • Google: > 450,000 processors in clusters of ~2000
processors each [Barroso, Dean, Hölzle, “Web Search for a Planet: The Google Cluster Architecture” IEEE Micro 2003]
• Yahoo: 5Pb of data [Fayyad, KDD’07] • Problem: machine failures, on a daily basis • How to parallelize data mining tasks, then? • A: map/reduce – hadoop (open-source clone)
http://hadoop.apache.org/
![Page 124: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/124.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-124
2’ intro to hadoop • master-slave architecture; n-way replication
(default n=3) • ‘group by’ of SQL (in parallel, fault-tolerant way) • e.g, find histogram of word frequency
– compute local histograms – then merge into global histogram
select course-id, count(*) from ENROLLMENT group by course-id
![Page 125: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/125.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-125
2’ intro to hadoop • master-slave architecture; n-way replication
(default n=3) • ‘group by’ of SQL (in parallel, fault-tolerant way) • e.g, find histogram of word frequency
– compute local histograms – then merge into global histogram
select course-id, count(*) from ENROLLMENT group by course-id map
reduce
![Page 126: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/126.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-126
User Program
Reducer
Reducer
Master
Mapper
Mapper
Mapper
fork fork fork
assign map assign
reduce
read local write
remote read, sort
Output File 0
Output File 1
write Split 0 Split 1 Split 2
Input Data (on HDFS)
By default: 3-way replication; Late/dead machines: ignored, transparently (!)
![Page 127: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/127.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-127
D.I.S.C.
• ‘Data Intensive Scientific Computing’ [R. Bryant, CMU] – ‘big data’ – www.cs.cmu.edu/~bryant/pubdir/cmu-
cs-07-128.pdf
![Page 128: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/128.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-128
~200Gb (Yahoo crawl) - Degree Distribution: • in 12 minutes with 50 machines • Many (link spams ?) at out-degree 1200
Analysis of a large graph
![Page 129: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/129.jpg)
CMU SCS
(C) 2011, C. Faloutsos 129
Centralized Hadoop/PEGASUS
Degree Distr. old old
Pagerank old old
Diameter/ANF old DONE
Conn. Comp old DONE
Triangles DONE Visualization STARTED
Outline – Algorithms & results
KAIST-2011
![Page 130: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/130.jpg)
CMU SCS
HADI for diameter estimation • Radius Plots for Mining Tera-byte Scale
Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10
• Naively: diameter needs O(N**2) space and up to O(N**3) time – prohibitive (N~1B)
• Our HADI: linear on E (~10B) – Near-linear scalability wrt # machines – Several optimizations -> 5x faster
(C) 2011, C. Faloutsos 130 KAIST-2011
![Page 131: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/131.jpg)
CMU SCS
????
19+ [Barabasi+]
131 (C) 2011, C. Faloutsos
Radius
Count
KAIST-2011
~1999, ~1M nodes
![Page 132: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/132.jpg)
CMU SCS
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied.
????
19+ [Barabasi+]
132 (C) 2011, C. Faloutsos
Radius
Count
KAIST-2011
??
~1999, ~1M nodes
![Page 133: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/133.jpg)
CMU SCS
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • Largest publicly available graph ever studied.
????
19+? [Barabasi+]
133 (C) 2011, C. Faloutsos
Radius
Count
KAIST-2011
14 (dir.) ~7 (undir.)
![Page 134: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/134.jpg)
CMU SCS
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • 7 degrees of separation (!) • Diameter: shrunk
????
19+? [Barabasi+]
134 (C) 2011, C. Faloutsos
Radius
Count
KAIST-2011
14 (dir.) ~7 (undir.)
![Page 135: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/135.jpg)
CMU SCS
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape?
????
135 (C) 2011, C. Faloutsos
Radius
Count
KAIST-2011
~7 (undir.)
![Page 136: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/136.jpg)
CMU SCS
136 (C) 2011, C. Faloutsos
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality (?!)
KAIST-2011
![Page 137: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/137.jpg)
CMU SCS
Radius Plot of GCC of YahooWeb.
137 (C) 2011, C. Faloutsos KAIST-2011
![Page 138: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/138.jpg)
CMU SCS
138 (C) 2011, C. Faloutsos
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores .
KAIST-2011
![Page 139: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/139.jpg)
CMU SCS
139 (C) 2011, C. Faloutsos
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores .
KAIST-2011
EN
~7
Conjecture: DE
BR
![Page 140: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/140.jpg)
CMU SCS
140 (C) 2011, C. Faloutsos
YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) • effective diameter: surprisingly small. • Multi-modality: probably mixture of cores .
KAIST-2011
~7
Conjecture:
![Page 141: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/141.jpg)
CMU SCS
Running time - Kronecker and Erdos-Renyi Graphs with billions edges.
details
![Page 142: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/142.jpg)
CMU SCS
(C) 2011, C. Faloutsos 142
Centralized Hadoop/PEGASUS
Degree Distr. old old
Pagerank old old
Diameter/ANF old DONE
Conn. Comp old DONE
Triangles DONE Visualization STARTED
Outline – Algorithms & results
KAIST-2011
![Page 143: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/143.jpg)
CMU SCS Generalized Iterated Matrix
Vector Multiplication (GIMV)
(C) 2011, C. Faloutsos 143
PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations. U Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. (ICDM) 2009, Miami, Florida, USA. Best Application Paper (runner-up).
KAIST-2011
![Page 144: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/144.jpg)
CMU SCS Generalized Iterated Matrix
Vector Multiplication (GIMV)
(C) 2011, C. Faloutsos 144
• PageRank • proximity (RWR) • Diameter • Connected components • (eigenvectors, • Belief Prop. • … )
Matrix – vector Multiplication
(iterated)
KAIST-2011
details
![Page 145: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/145.jpg)
CMU SCS
145
Example: GIM-V At Work • Connected Components – 4 observations:
Size
Count
(C) 2011, C. Faloutsos KAIST-2011
![Page 146: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/146.jpg)
CMU SCS
146
Example: GIM-V At Work • Connected Components
Size
Count
(C) 2011, C. Faloutsos KAIST-2011
1) 10K x larger than next
![Page 147: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/147.jpg)
CMU SCS
147
Example: GIM-V At Work • Connected Components
Size
Count
(C) 2011, C. Faloutsos KAIST-2011
2) ~0.7B singleton nodes
![Page 148: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/148.jpg)
CMU SCS
148
Example: GIM-V At Work • Connected Components
Size
Count
(C) 2011, C. Faloutsos KAIST-2011
3) SLOPE!
![Page 149: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/149.jpg)
CMU SCS
149
Example: GIM-V At Work • Connected Components
Size
Count 300-size
cmpt X 500. Why? 1100-size cmpt
X 65. Why?
(C) 2011, C. Faloutsos KAIST-2011
4) Spikes!
![Page 150: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/150.jpg)
CMU SCS
150
Example: GIM-V At Work • Connected Components
Size
Count
suspicious financial-advice sites
(not existing now)
(C) 2011, C. Faloutsos KAIST-2011
![Page 151: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/151.jpg)
CMU SCS
151
GIM-V At Work • Connected Components over Time • LinkedIn: 7.5M nodes and 58M edges
Stable tail slope after the gelling point
(C) 2011, C. Faloutsos KAIST-2011
![Page 152: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/152.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-152
Conclusions
• Hadoop: promising architecture for Tera/Peta scale graph mining
Resources: • http://hadoop.apache.org/core/ • http://hadoop.apache.org/pig/
Higher-level language for data processing
![Page 153: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/153.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos P8-153
References • Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplified Data Processing on Large Clusters, OSDI'04 • Christopher Olston, Benjamin Reed, Utkarsh Srivastava,
Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD 2008: 1099-1110
![Page 154: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/154.jpg)
CMU SCS
Overall Conclusions • Real graphs exhibit surprising patterns
(power laws, shrinking diameter, super-linearity on edge weights, triangles etc)
• SVD: a powerful tool (HITS, PageRank) • Several other tools: tensors, METIS, …
– But: good communities might not exist…
• Immunization: first eigenvalue • Scalability: hadoop/parallelism
KAIST-2011 (C) 2011, C. Faloutsos 154
![Page 155: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/155.jpg)
CMU SCS
(C) 2011, C. Faloutsos 155
Our goal:
Open source system for mining huge graphs:
PEGASUS project (PEta GrAph mining System)
• www.cs.cmu.edu/~pegasus • code and papers
KAIST-2011
![Page 156: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/156.jpg)
CMU SCS
(C) 2011, C. Faloutsos 156
Project info
Akoglu, Leman
Chau, Polo
Kang, U McGlohon, Mary
Tong, Hanghang
Prakash, Aditya
KAIST-2011
Thanks to: NSF IIS-0705359, IIS-0534205, CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab
www.cs.cmu.edu/~pegasus
Koutra, Danae
![Page 157: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/157.jpg)
CMU SCS
Extra material • E-bay fraud detection • Outlier detection
KAIST-2011 (C) 2011, C. Faloutsos 157
![Page 158: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/158.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 158
Detailed outline • Fraud detection in e-bay • Anomaly detection
![Page 159: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/159.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 159
E-bay Fraud detection
w/ Polo Chau & Shashank Pandit, CMU
NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, S. Pandit, D. H. Chau, S. Wang, and C. Faloutsos (WWW'07), pp. 201-210
![Page 160: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/160.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 160
E-bay Fraud detection
• lines: positive feedbacks • would you buy from him/her?
![Page 161: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/161.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 161
E-bay Fraud detection
• lines: positive feedbacks • would you buy from him/her?
• or him/her?
![Page 162: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/162.jpg)
CMU SCS
KAIST-2011 (C) 2011, C. Faloutsos 162
E-bay Fraud detection - NetProbe
Belief Propagation gives:
![Page 163: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/163.jpg)
CMU SCS
Popular press
And less desirable attention: • E-mail from ‘Belgium police’ (‘copy of
your code?’) KAIST-2011 (C) 2011, C. Faloutsos 163
![Page 164: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/164.jpg)
CMU SCS
Extra material • E-bay fraud detection • Outlier detection
KAIST-2011 (C) 2011, C. Faloutsos 164
![Page 165: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/165.jpg)
CMU SCS
OddBall: Spotting Anomalies in Weighted Graphs
Leman Akoglu, Mary McGlohon, Christos Faloutsos
Carnegie Mellon University School of Computer Science
PAKDD 2010, Hyderabad, India
![Page 166: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/166.jpg)
CMU SCS
Main idea For each node, • extract ‘ego-net’ (=1-step-away neighbors) • Extract features (#edges, total weight, etc
etc) • Compare with the rest of the population
(C) 2011, C. Faloutsos 166 KAIST-2011
![Page 167: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/167.jpg)
CMU SCS What is an egonet?
ego
167
egonet
(C) 2011, C. Faloutsos KAIST-2011
![Page 168: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/168.jpg)
CMU SCS
Selected Features Ni: number of neighbors (degree) of ego i Ei: number of edges in egonet i Wi: total weight of egonet i λw,i: principal eigenvalue of the weighted
adjacency matrix of egonet I
168 (C) 2011, C. Faloutsos KAIST-2011
![Page 169: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/169.jpg)
CMU SCS Near-Clique/Star
169 KAIST-2011 (C) 2011, C. Faloutsos
![Page 170: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/170.jpg)
CMU SCS Near-Clique/Star
170 (C) 2011, C. Faloutsos KAIST-2011
![Page 171: Talk 3: Graph Mining Tools – Tensors, communities, parallelismfor the foils on tensor definitions, and on TOPHITS . CMU SCS KAIST-2011 (C) 2011, C. Faloutsos 5 ... hard clustering,](https://reader033.fdocuments.net/reader033/viewer/2022052004/6017b10829f77d33e06d6947/html5/thumbnails/171.jpg)
CMU SCS
END
KAIST-2011 (C) 2011, C. Faloutsos 171