Evaluating Hiera r chical Clustering of Search Results

Evaluating Hierarchical Clustering of Search Results

Departamento de Lenguajes y

Sistemas Informáticos

UNED, Spain

Juan Cigarrán

Anselmo Peñas

Julio Gonzalo

Felisa Verdejo

nlp.uned.es

SPIRE 2005, Buenos Aires

Overview

Scenario Assumptions Features of a Good Hierarchical Clustering Evaluation Measures

– Minimal Browsing Area (MBA)– Distillation Factor (DF)– Hierarchy Quality (HQ)

Conclusion

Scenario Complex information needs

– Compile information from different sources– Inspect the whole list of documents

• More than 100 documents

Help to– Find the relevant topics– Discriminate from unrrelevant documents

Approach– Hierarchical Clustering – Formal Concept Analysis

Problem

How to define and measure the quality of a hierarchical clustering?

How to compare different clustering approaches?

Previous assumptions Each cluster contains only those documents

fully described by its descriptors

d1 d2 d3 d4

Physics X X X X

Nuclear physics

Astrophysics

d2, d3

d1Physics

Astrophysicsd4

Nuclear physics

d2, d3

d1, d2, d3, d4 Physics

Astrophysicsd4

Nuclear physics

Previous assumptions ‘Open world’ perspective

d1 d2 d3

Physics X X

Jokes X X

Jokes about physics

d1Physics

Jokes about physics

Jokesd2

Jokes about physicsd3

d1Physics

Jokesd2

Jokes about physicsd3

Good Hierarchical Clustering The content of the clusters.

– Clusters should not mix relevant with non relevant information

+ + + +

-- -+ - -

+ - + +

Good Hierarchical Clustering The hierarchical arrangement of the clusters

– Relevant information should be in the same path

+ + + +

- - -++ +

+ + + +

Good Hierarchical Clustering The number of clusters

– Number of clusters substantially lower than the number of documents

How clusters are described– Cognitive load of reading a cluster description– Ability to predict the relevance of the information

that it contains (not addressed here)

Evaluation Measures Criterion

– Minimize the browsing effort for finding ALL relevant information

Baseline– The original document list returned by a search

engine

(lattice) load cognitive

list) (ranked load cognitivelatticeQuality

Evaluation Measures Consider

– Content of clusters– Hierarchical arrangement of clusters– Size of the hierarchy– Cognitive load of reading a document (in the

baseline): Kd – Cognitive load of reading a node descriptor (in the

hierarchy): Kn

Requirement– Relevance assessments are available

Minimal Browsing Area (MBA) The minimal set of nodes the user has to traverse to

find ALL the relevant documents minimising the number of irrelevant ones

Distillation Factor (DF) Ability to isolate relevant information compared

with the original document list (Gain Factor, DF>1)

MBAMBAd

latticeloadCognitive

listrankedloadCognitiveLDF List RankedList Ranked

)_(_)(

List Ranked

MBA)(precision

precisionLDF

Considers only the cognitive load of reading documents

Equivalent to:

Distillation Factor (DF) Example

DF(L) = 7/5 = 1.4

Doc 1 +

Doc 2 -

Doc 3 +

Doc 4 +

Doc 5 -

Doc 6 -

Doc 7 +

Document List

Precision = 4/7 Precision MBA = 4/5

Distillation Factor (DF) Counterexample:

+ + - -+ -- +

Precision = 4/8

Precision MBA = 4/4

DF = 8/4 = 2

Bad clustering with good DF

Extend the DF measure considering the cognitive cost of taking browsing decisions HQ

Hierarchy Quality (HQ) Assumption:

– When a node (in the MBA) is explored, all its lower neighbours have to be considered: some will be in turn explored, some will be discarded

– Nview : subset of lower neighbours of each node belonging to the MBA

|Nview|=8

Hierarchy Quality (HQ)

Kn and Kd are directly related with the retrieval scenario in which the experiments take place The researcher must tune K=Kn/Kd before conducting the

experiment

HQ > 1 indicates an improvement of the clustering versus the original list

viewMBA

List Ranked

viewMBA

List Ranked)(N

(lattice) load cognitive

list) (ranked load cognitivelatticeQuality

Hierarchy Quality (HQ)

:Value Cutting

+ + - -+ -- +

:Value Cutting

Example

Conclusions and Future Work Framework for comparing different clustering

approaches taking into account:– Content of clusters– Hierarchical arrangement of clusters– Cognitive load to read document and node descriptions

Adaptable to the retrieval scenario in which experiments take place

Future work– Conduct user studies to compare their results with the

automatic evaluation• Results will reflect the quality of the descriptors• Will be used to fine-tune the kd and kn parameters

Thank you!

Evaluating Hiera r chical Clustering of Search Results

Documents

Transcript of Evaluating Hiera r chical Clustering of Search Results

Puppet Camp Portland 2015: Introduction to Hiera (Beginner)

proyecto chical el

Clustering: Partition Clustering

Clustering k-mean clustering

ASA Clustering within VMDC Architecture - Cisco€¦ · ASA Clustering within VMDC Architecture ASA Clustering Overview ASA Clustering Overview The clustering feature for the ASA

Text Classification Combining Clustering and Hierar chical Appr … · 2017-09-30 · Text Classification Combining Clustering and Hierar chical Appr oaches By ... Thus current day

Der Inhalt dieser Arbeit unterliegt dem deutschen ... · 1.1. Exkurs A: Euhemeros von Messene 1.2. Hiera Anagraphe 1.3. Exkurs B: Hiera Anagraphe als utopischer Reiseroman 2. Der

Iterative Reclassification in Agglomerative Clustering of... · Iterative Reclassification in Agglomerative Clustering ... clustering for finding improved ... flat clustering

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

Introduction to Hiera

Clustering. 2 Outline Introduction K-means clustering Hierarchical clustering: COBWEB.

Chapter19 Clustering Analysis. Content Similarity coefficient Hierarchical clustering analysis Dynamic clustering analysis Ordered sample clustering analysis.

ramew - lersse-dl.ece.ubc.calersse-dl.ece.ubc.ca/record/12/files/12.pdf · role-hiera rchies (RBA C 1), 3. user and role assignment constraints (RBA C 2), 4. b oth hiera rchies and

Impact Assessment of the Chical Integrated Recovery Action ... Support/Handout... · Impact Assessment of the Chical Integrated Recovery Action Project, Niger John C Burns • Omeno

Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.

Our Puppet Story – Patterns and Learnings · 2.Vagrant 3.Puppet Intro Dashboard&PE Facter&Hiera git Problems Misc. 1.Intro 2.Vagrant 3.Puppet Intro Dashboard&PE Facter&Hiera git

FUZZY CLUSTERING 2009/2010. 2 What is Data Clustering? Fuzzy C-Means Clustering Subtractive Clustering Data Clustering Using the Clustering GUI.

Using hiera with puppet

Affinity Clustering: Hierarchical Clustering at Scalepapers.nips.cc/paper/7262-affinity-clustering-hierarchical-clustering-at-scale.pdf · Afﬁnity Clustering: Hierarchical Clustering

Clustering Clustering