New Directions in Statistical Graph Mining
Transcript of New Directions in Statistical Graph Mining
![Page 1: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/1.jpg)
1
New Directions in Statistical Graph Mining
Max Planck Institute for Biological Cybernetics
Koji Tsuda
Joint work with Hiroto Saigo, Sebastian Nowozin, Nicole Kraemer, Takeaki Uno and Taku Kudo
![Page 2: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/2.jpg)
2
Graphs and itemsets
(40F,41L,43E,210W,211K,215Y)(43Q,75M,122E,210W,211K)(75I, 77L, 116Y,151M,184V)
![Page 3: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/3.jpg)
3
DNA Sequence
RNA
Texts in literature
Graph Structures in Biology
CC OC
CC
C
H
A C G C
Amitriptyline inhibits adenosine uptake
H
H
H
HH
Compounds
CG
CG
U U U U
UA
![Page 4: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/4.jpg)
4
Structurization of statistical methods
Statistical methods for numerical vectorsSVM, Boosting, Clustering, Regression, PCA, CCA..
Structurization: Modify the algorithm so that it accepts structured data as input
Recipe:If the algorithm has a forward feature selection step, replace it with a top-k pattern mining algorithmIf not, the algorithm has to be engineered to have a forward feature selection step
![Page 5: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/5.jpg)
5
Pattern Mining Algorithms
Pattern: Substructure Subset, subtree, subgraph etc
Combinatorial algorithm to enumerate all patterns satisfying a criterion
Frequency in databaseWeighted frequencyTop-k patterns for a given evaluation function
See a textbook for data mining (e.g., Han and Kamber, 2002)
A B
A B C D A B
A
B C
![Page 6: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/6.jpg)
6
Naïve StructurizationApply frequent pattern mining and create the complete feature spaceThen, apply a statistical method
Too many patterns!It is likely to miss important features if frequency threshold (minimum support) is set highMemory overflowSee tutorial by Xifeng Yan (KDD08)
![Page 7: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/7.jpg)
7
Progressive collection of patterns
Most statistical methods are iterativeBoosting, PLS, Clustering etc
In each iteration, call a pattern mining algorithm to collect a few patterns
Discover the optimal patterns based on the current parameter
![Page 8: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/8.jpg)
8
Feature space is gradually constructed
0/1 vector of pattern indicatorsDimensionality grows slightly in each iteration
patterns
![Page 9: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/9.jpg)
9
Structurization examples from my group
![Page 10: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/10.jpg)
10
gBoost (Saigo et al., 2006, 2008)
Graph Classification: L1-SVM + Graph MiningGraph Regression: LASSO + Graph Mining
![Page 11: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/11.jpg)
11
![Page 12: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/12.jpg)
12
Related papers from other groups
Stream Prediction Using a Generative Model Based on Frequent Episodes in Event Sequences. Srivatsan Laxman, VikramTankasali, Ryen W. White. (KDD 2008)
HMM + Sequence Mining
Fast Logistic Regression for Text Categorization with Variable-Length N-grams. Georgiana Ifrim, Goekhan Bakir, Gerhard Weikum. (KDD 2008)
Logistic regression + Sequence Mining
![Page 13: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/13.jpg)
13
Why not kernels?
Kernels take all features into accountGraph kernels, Tree kernels, Weighted degree kernels
The prediction rules are not accountable
Rescue: Mine the prediction rule of SVMPOIMS by S. Sonnenburg et al. (2008)Also meaningful direction to pursue
![Page 14: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/14.jpg)
14
Three generations of structurization
1G: AdaBoost (2004)
2G: L1 regularization & Column generation (- 2008)
gBoost, gLARS, iBoost etc..
3G: Krylov subspace methodsPartial least squares, Lanczos methods, conjugate gradient
#iter
1000s
100s
10s
![Page 15: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/15.jpg)
15
OverviewQuick Review on Graph MiningGraph Partial Least Squares Regression (gPLS) KDD 2008
Graph Principal Component Analysis (gPCA) ICDM 2008
![Page 16: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/16.jpg)
16
Quick Review of Graph Mining
![Page 17: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/17.jpg)
17
Graph MiningAnalysis of Graph Databases
Find all patterns satisfying predetermined conditionsFrequent Substructure Mining
Combinatorial, ExhaustiveRecently developed
AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)
![Page 18: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/18.jpg)
18
Graph Mining
Frequent Substructure MiningEnumerate all patterns occurred in at least m graphs
:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
![Page 19: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/19.jpg)
19
Gspan (Yan and Han, 2002)
Efficient Frequent Substructure Mining MethodDFS Code
Efficient detection of isomorphic patterns
Extend Gspan for our works
![Page 20: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/20.jpg)
20
Enumeration on Tree-shaped Search Space
Each node has a patternGenerate nodes from the root:
Add an edge at each step
![Page 21: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/21.jpg)
21
Tree PruningAnti-monotonicity:
If support(g) < m, stop exploring!
Not generated
Support(g): # of occurrence of pattern g
![Page 22: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/22.jpg)
22
Discriminative patterns:Weighted Substructure Mining
w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining
Patterns with large frequency differenceNot Anti-Monotonic: Use a bound for pruning
![Page 23: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/23.jpg)
23
Graph PLS
![Page 24: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/24.jpg)
24
Problem:QSAR (Quantitative Structure Activity Relationship)
• Input: Set of chemical compounds• Output: Activity
3
1.2
?
Chemical compounds
Activity
![Page 25: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/25.jpg)
25
Binary representation of feature vector
• Binary vector of pattern indicators• Huge dimensionality
– Too expensive to mine at once– Progressive feature collection
Cl
C C
C
CC
CC
OC
C
C
C
C
C
CC
CC
= ( 1, 0, … 1, 0, … 1, 0, 0, ……….…)
![Page 26: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/26.jpg)
26
PLS (Partial Least Squares Regression)
• Design matrix: X• Response: y• Dimensionality reduction
– Weight matrix W: Project examples to a lower dimensional space
– Latent components T: Projected examples T=WX
• Learn W to maximize the covariance between y and T
• How to perform PLS without accessing all columns of X ?
![Page 27: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/27.jpg)
27
Graph PLS
• PLS calls graph mining repeatedly• In i-th iteration,
– A set of graph patterns corresponding to i-thlatent component ti is collected by weighted subgraph mining
PLS gSpanweights
patterns
654321
Latent components
![Page 28: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/28.jpg)
28
PLS algorithm
![Page 29: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/29.jpg)
29
Idea of structurization
• Pre-weight vector computation (step 2)
• Collect the features with major pre-weights by weighted substructure mining
• Other pre-weights are set to zero
![Page 30: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/30.jpg)
30
Graph PLS
![Page 31: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/31.jpg)
31
Classification/regression performance
• 6 chemical compound datasets, 5 fold cross-validation• Accuracies of gPLS are competitive, but gPLS is
substantially faster. – Numerical computation much simpler than LP
over 24h0.8835095710039965AIDS3
over 24h0.7471675030040939AIDS2
0.7522997830.7730.06522901324AIDS1
0.86739186300.87014.13579437CAS
0.86234422.80.8620.47426.8684CPDB
0.63983.315.60.6470.002516.059EDKB
AUC/Q2Numerical Time
Mining Time
AUC/Q2Numerical TIme
Mining Time
datasize
gBoostgPLS
![Page 32: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/32.jpg)
32
gPLS efficiency• “Frequent mining + PLS” vs gPLS• gPLS scales better in larger maximum
pattern size threshold (maxpat)
![Page 33: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/33.jpg)
33
Latent components by gPLS10987654321
![Page 34: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/34.jpg)
34
Summary (gPLS)
• Introduction of a new PLS algorithm for use with graph mining.
• High efficiency compared with the naïve counterpart.
• Graph mining algorithm can be replaced with other frequent pattern mining algorithms such as itemset mining, sequence mining and tree mining.
![Page 35: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/35.jpg)
35
Graph PCA
![Page 36: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/36.jpg)
36
Graph PCA methodDesign matrix XGram matrixDerive eigenvalue and eigenvector of A
Equivalent to kernel PCA with linear kernelHow to perform eigendecomposition without accessing all columns of X ?
![Page 37: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/37.jpg)
37
Lanczos algorithm
Find an orthonormal matrix Q such that
is a tridiagonal matrixQ =(q1,…,qm): Lanczos vectorsThen, T is easily eigendecomposed
Eigenvalues preserved in TEigenvectors of A are recovered as
Eigenvectors of TEigenvectors of A
![Page 38: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/38.jpg)
38
Lanczos method for computing all eigenvectors
![Page 39: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/39.jpg)
39
Importance of initial Lanczosvector q1
If q1 corresponds to the first eigenvector, Lanczosfinishes in n iterations, creating all eigenvectorsTypically, the first vector is not known, so more iterations are necessary
If one needs only major eigenvectors, early stopping is possible
Major eigenvalues converge fasterNumber of iterations determined by convergence bound (Saad, 1980)
![Page 40: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/40.jpg)
40
Idea of structurization
SubstituteLanczos accesses X only through
Collect the features whose |gki| are large
Implemented by weighted subgraph mining
![Page 41: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/41.jpg)
41
![Page 42: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/42.jpg)
42
Three dimensional plot of EDKB-ER
![Page 43: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/43.jpg)
43
Principal Components of EDKB-ER
![Page 44: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/44.jpg)
44
Principal Components of EDKB-AR
![Page 45: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/45.jpg)
45
Summary (gPCA)
In gPCA, patterns are summarized in several components
Each component characterize different aspects of graph database
Easier to browse than frequent patterns
![Page 46: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/46.jpg)
46
ConclusionCombination of mining and learning algorithms result in a very powerful frameworkInterpretability is the keySuccessful applications to biological and computer vision domainsFurther applications possible to many different fields
![Page 47: New Directions in Statistical Graph Mining](https://reader034.fdocuments.net/reader034/viewer/2022052601/628cd52b79a7246fc776786f/html5/thumbnails/47.jpg)
47
Publication and software
iboost: itemset boostinghttp://www.kyb.mpg.de/bs/people/hiroto/iboost/
gboost: graph boosting http://www.kyb.mpg.de/bs/people/nowozin/gboost/
pboost: sequence boosting http://www.kyb.mpg.de/bs/people/nowozin/pboost/
H. Saigo et al., KDD 2008H. Saigo and K. Tsuda, ICDM 2008H. Saigo et al., Machine Learning, to appearH. Saigo et al., Bioinformatics, 2007S. Nowozin et al., CVPR 2007S. Nowozin et al., ICCV 2007K. Tsuda, ICML 2006, ICML 2007