Small World Clustering Algorithms
-
Upload
flavia-estes -
Category
Documents
-
view
34 -
download
0
description
Transcript of Small World Clustering Algorithms
![Page 1: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/1.jpg)
Small World Clustering Algorithms
Brant Chee
![Page 2: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/2.jpg)
Experiments
3 clustering algorithms Complete Link (Cluto) K means (Cluto) Small World
![Page 3: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/3.jpg)
Test CollectionsCollection Search Terms Number of
AbstractsNumber of Terms
C1 plasticity OR acetylcholine
81,746 267,981
C2 microarray OR muscarinic OR plasticity OR ((cholinergic OR noradrenergic) AND receptor)
74,533 285,623
![Page 4: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/4.jpg)
Experimental Setup
Parameters left at package defaults Clustered with n = 50,100,150 and 200. Clusters with less than 4 elements or more
than 50 elements were eliminated and the clustering which resulted in less than 40 clusters was chosen to be evaluated.
![Page 5: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/5.jpg)
Quantitative Results
Collection Algorithm Threshold Running Time (s)
SW N/A 40.54
C-Link 50 214.106
C1
K-Means 200 11.581
SW N/A 47.35
C-Link 100 198.147
C2
K-Means 200 5.538
![Page 6: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/6.jpg)
Quantitative Results II
Collection Algorithm # of Clusters Avg. # of Terms/
Cluster
Avg. # of Documents per Cluster
SW 21 6 15,413
C-Link 22 7 12,466
C1
K-Means 11 39 4,425
SW 40 12 10,258
C-Link 28 6 25,070
C2
K-Means 38 30 11,978
![Page 7: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/7.jpg)
Qualitative Evaluation
2 Criteria: Utility and Coherence 3 point scale: 1 good, 2 poor, 3 bad
Good: >60% of articles Poor: 59-41% Bad: <40%
Evaluate terms in cluster to get context.
![Page 8: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/8.jpg)
Quantitative Results Cont…
Collection SW C-Link K-Means
3 18 22 9
2 1 0 1
Utility
1 2 0 1
3 7 13 7
2 6 5 3
C1
Coherence
1 8 4 1
3 37 28 38
2 2 0 0
Utility
1 1 0 0
3 9 18 38
2 21 9 0
C2
Coherence
1 10 1 0
![Page 9: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/9.jpg)
Sample Session
![Page 10: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/10.jpg)
![Page 11: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/11.jpg)
![Page 12: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/12.jpg)
![Page 13: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/13.jpg)
![Page 14: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/14.jpg)
Other Approaches
Statistical Methods
![Page 15: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/15.jpg)
Other Clustering Approaches
Can we choose other types of clustering algorithms which could provide better quality results or provide better cluster labels? SOM (Self Organizing Map)
Slow for high numbers of dimensions and large numbers of objects.
Carrot2 Slow for large numbers of items. Huge memory consumption.
![Page 16: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/16.jpg)
Random Projection
Can we reduce the dimensionality of vectors (ie 50,0001000) while preserving distances? Speed up similarity calculations
Various methods: Random projection. “Latent semantic indexing”. Multi Dimensional Scaling
![Page 17: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/17.jpg)
A ∈ R× be our n points in D dimensions A x Random matrix ∈ RD×k
R of entries in {−1, 0, 1} with probabilty
O(nDk + n2k)
Very Sparse Random Projections
{1
2 D,1
1
D,
1
2 D}
![Page 18: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/18.jpg)
Reducing Dimensionality
Bank Dataset 11,000 articles from 11 categories in Dmoz. 11,000 articles reduced from 30K terms 1GB heap in 11s. Increase in Purity and decrease in Entropy (measures of
clustering quality).
Matrix Entropy Purity
Original 0.975 0.146
512_1 0.584 0.476
512_2 0.589 0.495
512_3 0.62 0.502
1000_1 0.533 0.532
1000_2 0.544 0.496
1000_3 0.546 0.485
![Page 19: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/19.jpg)
MI on Phrases
More context than single words More meaningful term clusters
![Page 20: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/20.jpg)
Other approaches
Knowledge Intensive Approaches
![Page 21: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/21.jpg)
Hypernym
“Is-a” relationship Shakespeare is an author. Pug is a dog.
Implicitly hierarchical. Basis of many ontology and semantic networks.
Wordnet UMLS
![Page 22: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/22.jpg)
Portion of the UMLS Semantic Network: Biologic Function
![Page 23: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/23.jpg)
Hypernym Relations NP such as {, NP}* {(or | and)} NP
Vegetables such as Beets, Carrots and Peas.
Such NP as {NP,}* {(or|and)} NP …works by such authors as Herrick, Goldsmith and Shakespeare.
NP {, NP}* {,} or|and other NP Bruises, …, broken bones or other injuries
NP {,} including {NP,} * {or|and} NP All common-law countries, including Canada and England …
NP {,} especially {NP,} * {or|and} NP … most European countries, especially France, England and Spain.
![Page 24: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/24.jpg)
Uses of Hypernym Trees
Search Query Expansion Facted metadata
Clustering Parent node defines a cluster
Keyword assignment
![Page 25: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/25.jpg)
Trivial Hypernyms organic compounds d-ribose organic compounds d-arabinose organic compounds l-arabinose organic compounds sucrose substances cortisone substances vitamins a and c substances zinc organs liver organs kidney sugar-containing products honey sugar-containing products jam sugar-containing products glucose sugar-containing products fruit juice concentrates sugar-containing products tomato largely populated countries china largely populated countries russia
![Page 26: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/26.jpg)
Bad Hypernyms suicidal patients appears other agents plasmin other agents plasminogen such common sensations illness phenomena founder effects phenomena migration phenomena gene flow clinical manifestations 80 chemical agents homocystine no other explanation anencephaly conditions azure a-0.5 % nahco3 solution conditions ph 8.1 fewer side-effects vegetative disfunction techniques carpentier techniques 's ring
![Page 27: Small World Clustering Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062721/56813655550346895d9dde14/html5/thumbnails/27.jpg)
Good? Hypernyms entirely synthetic steroids norgestrel and quingestanol menstrual disorders metrorrhagia menstrual disorders oligoamenorrhea menstrual disorders amenorrhea mild venous disorders swollen veins mild venous disorders heavy limbs mild venous disorders varicosities obstructive pulmonary lung diseases alveolar proteinosis obstructive pulmonary lung diseases pneumonia obstructive pulmonary lung diseases asthma obstructive pulmonary lung diseases bronchiectasis obstructive pulmonary lung diseases cystic fibrosis choline analogues n,n'-dimethylethanolamine choline analogues n-monomethylethanolamine choline analogues ethanolamine 3alpha-oh-containing steroids androsterone