C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.
-
Upload
suzan-hill -
Category
Documents
-
view
215 -
download
0
Transcript of C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.
C.Watters CS6403 1
Clustering
C.Watters CS6403 2
Clustering
• What
• Why
• How
• Results
C.Watters CS6403 3
Clustering
• Assign items to groups based on some calculation of degree of likeness between items
• Groups are not known before hand
• Uses multivariate analysis techniques
• Feature set determination critical
C.Watters CS6403 4
Example
• News data
• Sports, World news, Entertainment etc
• Short items, items with photos, items with names
C.Watters CS6403 5
Why
• Improve efficiency of retrieval
• Improve effectiveness of retrieval
• Ranking of retrieved results
• Visualization of results
• Karnaugh and SOM (self organizing maps)
• Discovery of content
• Discovery of relationships
C.Watters CS6403 6
How
• Put items into groups so that members have a high degree of association within the group
• AND items have low degree of association with items in other groups
• Association for IR documents?
• Feature set?
C.Watters CS6403 7
Feature Sets for IR Clustering
• Term occurrences
• Citations
• Names
• Structure (tags)
• Co-occurences (thesaurus construction)
C.Watters CS6403 8
Problems
• Choosing the best feature set
• Choosing the similarity measure
• Evaluation of results
• Updates
• Searching clusters
C.Watters CS6403 9
Measures of Similarity
• Need to quantify the degree of association of an item with others
• Generally want a measure that is normalized by document vector length
• Not clear that weighted document terms are better than binary ones in clustering
C.Watters CS6403 10
General Measures
• Dice coefficient
• Jaccard Coefficient
• Cosine Coefficient
C.Watters CS6403 11
Dice Coefficient
• Binary weights
C= Terms in common, A terms in i, and B terms in j
C.Watters CS6403 12
Jaccard Coefficient
• Binary Weights
C= Terms in common, A terms in i, and B terms in j
C.Watters CS6403 13
Cosine Coefficient
• Binary weights
C= Terms in common, A terms in i, and B terms in j
C.Watters CS6403 14
Now what?
• Need to be able to compare any doc to any other doc
• Need?11 12 13 14 15
21 22 23 24 25
31 32 33 34 35
41 42 43 44 45
51 52 53 54 55
Doc-Doc Similarity Matrix
C.Watters CS6403 15
Generating Similarity Matrix
• Use inverted file
• Documents with no terms in common do not need similarity calculation
• Generally generate only one row at a time as needed
C.Watters CS6403 16
Algorithms
• Problem: sort N things into M groups, where M=[1,N]
• Choice of algorithm determines– M– membership
C.Watters CS6403 17
General Classes of Algorithms
• Hierarchical
•Non-hierarchical
No overlap
Centroid
Nested groups
Pairwise connections made
C.Watters CS6403 18
Evaluation of results
• Was method appropriate for data set
• Do the clusters represent the data well
• Are the docs in the right cluster
C.Watters CS6403 19
How to test?
• Overlap test Run a known query set and evaluate against known results
• Randomly select docs and judge relevance to group members
• Examine distribution of docs in groups
• Density test = term occurrences
• docs x unique terms
C.Watters CS6403 20
Concepts to keep in mind
• Cluster hypothesis
• Nearest neighbour
• centroid
C.Watters CS6403 21
Cluster Hypothesis
• Associations between documents are related to the relevance of documents to queries
• Van Rijsbergen, 1979
C.Watters CS6403 22
Nearest Neighbour
• Find the document most similar to the given one
• This one is most likely closely related
• Works with terms, citations, & clusters
C.Watters CS6403 23
Centroids
• Representative of a cluster
• May be a document from that cluster
• May be a composite of doc features from that cluster
• Why: query-centroid calculations– higher level representations of data set– build ontologies and thesauri
C.Watters CS6403 24
Visualization of Clusters
• Kohonen Maps
• Star maps
• SOM (self organizing maps)
• Etc
C.Watters CS6403 25
Samples
C.Watters CS6403 26
Cluster Map
C.Watters CS6403 27
Starfield
C.Watters CS6403 28