C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

C.Watters CS6403 1

Clustering

C.Watters CS6403 2

Clustering

• What

• Why

• How

• Results

C.Watters CS6403 3

Clustering

• Assign items to groups based on some calculation of degree of likeness between items

• Groups are not known before hand

• Uses multivariate analysis techniques

• Feature set determination critical

C.Watters CS6403 4

Example

• News data

• Sports, World news, Entertainment etc

• Short items, items with photos, items with names

C.Watters CS6403 5

Why

• Improve efficiency of retrieval

• Improve effectiveness of retrieval

• Ranking of retrieved results

• Visualization of results

• Karnaugh and SOM (self organizing maps)

• Discovery of content

• Discovery of relationships

C.Watters CS6403 6

How

• Put items into groups so that members have a high degree of association within the group

• AND items have low degree of association with items in other groups

• Association for IR documents?

• Feature set?

C.Watters CS6403 7

Feature Sets for IR Clustering

• Term occurrences

• Citations

• Names

• Structure (tags)

• Co-occurences (thesaurus construction)

C.Watters CS6403 8

Problems

• Choosing the best feature set

• Choosing the similarity measure

• Evaluation of results

• Updates

• Searching clusters

C.Watters CS6403 9

Measures of Similarity

• Need to quantify the degree of association of an item with others

• Generally want a measure that is normalized by document vector length

• Not clear that weighted document terms are better than binary ones in clustering

C.Watters CS6403 10

General Measures

• Dice coefficient

• Jaccard Coefficient

• Cosine Coefficient

C.Watters CS6403 11

Dice Coefficient

• Binary weights

C= Terms in common, A terms in i, and B terms in j

C.Watters CS6403 12

Jaccard Coefficient

• Binary Weights


C.Watters CS6403 13

Cosine Coefficient

• Binary weights


C.Watters CS6403 14

Now what?

• Need to be able to compare any doc to any other doc

• Need?11 12 13 14 15

21 22 23 24 25

31 32 33 34 35

41 42 43 44 45

51 52 53 54 55

Doc-Doc Similarity Matrix

C.Watters CS6403 15

Generating Similarity Matrix

• Use inverted file

• Documents with no terms in common do not need similarity calculation

• Generally generate only one row at a time as needed

C.Watters CS6403 16

Algorithms

• Problem: sort N things into M groups, where M=[1,N]

• Choice of algorithm determines– M– membership

C.Watters CS6403 17

General Classes of Algorithms

• Hierarchical

•Non-hierarchical

No overlap

Centroid

Nested groups

Pairwise connections made

C.Watters CS6403 18

Evaluation of results

• Was method appropriate for data set

• Do the clusters represent the data well

• Are the docs in the right cluster

C.Watters CS6403 19

How to test?

• Overlap test Run a known query set and evaluate against known results

• Randomly select docs and judge relevance to group members

• Examine distribution of docs in groups

• Density test = term occurrences

• docs x unique terms

C.Watters CS6403 20

Concepts to keep in mind

• Cluster hypothesis

• Nearest neighbour

• centroid

C.Watters CS6403 21

Cluster Hypothesis

• Associations between documents are related to the relevance of documents to queries

• Van Rijsbergen, 1979

C.Watters CS6403 22

Nearest Neighbour

• Find the document most similar to the given one

• This one is most likely closely related

• Works with terms, citations, & clusters

C.Watters CS6403 23

Centroids

• Representative of a cluster

• May be a document from that cluster

• May be a composite of doc features from that cluster

• Why: query-centroid calculations– higher level representations of data set– build ontologies and thesauri

C.Watters CS6403 24

Visualization of Clusters

• Kohonen Maps

• Star maps

• SOM (self organizing maps)

• Etc

C.Watters CS6403 25

Samples

C.Watters CS6403 26

Cluster Map

C.Watters CS6403 27

Starfield

C.Watters CS6403 28

C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.

Documents

Transcript of C.WattersCS64031 Clustering. C.WattersCS64032 Clustering What Why How Results.