A comparison between Latent Semantic Analysis and...

A comparison between Latent Semantic Analysis and

Correspondence Analysis

Julie Séguéla, Gilbert Saporta

CNAM, Cedric LabMultiposting.fr

February 9th 2011 - CARME

Outline

1 Introduction

2 Latent Semantic AnalysisPresentationMethod

3 Application in a real contextPresentationMethodologyResults and comparisons

4 Conclusion

A comparison between LSA & CA February 9th 2011 - CARME 2 / 29

Introduction

Outline

1 Introduction

2 Latent Semantic Analysis

3 Application in a real context

4 Conclusion

Introduction

Context

Text representation for categorization task

Introduction

Objectives

Comparison of several text representation techniques through theoryand application

In particular, comparison between a statistical technique :Correspondence Analysis, and an information retrieval (IR) orientedmethod : Latent Semantic Analysis

Is there an optimal technique for performing document clustering ?

Latent Semantic Analysis

Outline

1 Introduction

2 Latent Semantic AnalysisPresentationMethod

4 Conclusion

Latent Semantic Analysis Presentation

Uses of LSA

LSA was patented in 1988 (US Patent 4,839,853) by Deerwester,Dumais, Furnas, Harshman, Landauer, Lochbaum and Streeter.

Find semantic relations between terms

Helps to overcome synonymy and polysemy problems

Dimensionality reduction (from several thousands of features to40-400 dimensions)

Applications

Document clustering and document classi�cation

Matching queries to documents of similar topic meaning (informationretrieval)

Text summarization

Latent Semantic Analysis Method

LSA theory

How to obtain document coordinates ?

1) Document-Term matrix 2) Weighting

. . . fij . . ....

. . . lij(fij) · gj(fij) . . ....

3) SVD 4) Document coordinates in the latent

semantic space :TW = UΣV ′ C = UkΣk

We need to �nd the optimal dimension for �nal representation

Common weighting functions

Local weighting

Term frequency

lij(fij) = fij

Binary

lij(fij) = 1 if term j occurs in

document i , else 0

Logarithm

lij(fij) = log(fij + 1)

Global weighting

Normalisation

gj(fij) = 1√∑i f

IDF (Inverse Document Frequency)

gj(fij) = 1 + log( nnj

n : number of documents

nj : number of documents in which term

j occurs

Entropy

gj(fij) = 1−∑

fijf.j

log(fijf.j

log(n)

LSA vs CA

Latent Semantic Analysis

1) T = [fij ]i ,j

2) TW = [lij(fij) · gj(fij)]i ,j

3) TW = UΣV ′

4) C = UkΣk

Correspondence Analysis

1) T = [fij ]i ,j

2) TW =

[fij√fi .f.j

3) TW = UΣV ′

3′) U = diag(

√f..

fi .)U

4) C = UkΣk

CA : lij(fij) =fij√fi.

and gj(fij) = 1√f.j

Application in a real context

Outline

1 Introduction

3 Application in a real contextPresentationMethodologyResults and comparisons

4 Conclusion

Application in a real context Presentation

Objectives

Corpus of job o�ers

Find the best representation method to assess "job similarity" betweeno�ers in a non-supervised framework

Comparison of several representation techniques

Discussion about the optimal number of dimensions to keep

Comparison between two similarity measures

Application in a real context Presentation

O�ers have been manually labeled by recruiters into 8 categoriesduring the posting procedure

Distribution among job categories :

Category Freq. % Category Freq. %

Sales/Business Development 360 24 Marketing/Product 141 10R&D/Science 69 5 Production/Operations 127 9Accounting/Finance 338 23 Human Resources 138 9Logistics/Transportation 118 8 Information Systems 192 13

Total 1483 100

We keep only the "title"+"mission description" parts ("�rmdescription" and "pro�le searched" are excluded)

Application in a real context Methodology

Preprocessing of texts

Lemmatisation and tagging

Filtering according to grammatical category (we keep nouns, verbs andadjectives)

Filtering terms occuring in less than 5 o�ers

Vector space model ("bag of words")

Several representations are compared

Representation method

LSA, weighting : Term Frequency

LSA, weighting : TF-IDF

LSA, weighting : Log Entropy

Dissimilarity measure

Euclidian distance between documents i and i ′

1 - cosine similarity between documents i and i ′

Method of clustering

Clustering steps

Computing of dissimilarity matrix from document coordinates in thelatent semantic space

Hierarchical Agglomerative Clustering until a 8 class partition

Computation of class centroids

K-means clustering initialized from previous centroids

Measures of agreement between two partitions

P1, P2 : two partitions of n objects with the same number of classes kN = [nij ]i=1,..,k

j=1,..,k

: corresponding contingency table

Rand index

R =2∑

∑j n

2ij −

∑i n

2i . −

∑j n

2.j + n2

n2, 0 ≤ R ≤ 1

Rand index is based on the number of pairs of units which belong tothe same clusters. It doesn't depend on cluster labeling.

Measures of agreement between two partitions

Cohen's Kappa and F-measure values are depending on clusters'labels. To overcome label switching, we are looking for their maximumvalues over all label allocations.

Cohen's Kappa

κopt = max

∑i nii −

∑i ni .n.i

1− 1n2

∑i ni .n.i

}, −1 ≤ κ ≤ 1

F -measure

F opt = max

{2 · 1k

∑inii

ni.· 1k

∑inii

ni.+ 1

∑inii

}, 0 ≤ F ≤ 1

Application in a real context Results and comparisons

Correlation between coordinates issued from the di�erent

methods

Clustering quality according to the method and the number

of dimensions : Rand index

of dimensions : Cohen's Kappa

of dimensions : F-measure

Clustering quality according to the dissimilarity function :

LSA + Log Entropy

Clustering quality according to the dissimilarity function : CA

Conclusion

Outline

1 Introduction

4 Conclusion

Conclusion

Conclusions

CA seems to be less stable than other methods but with cosinesimilarity, it provides better results under 100 dimensions

As it is said in literature, cosine similarity between vectors seems to bemore adapted to textual data than usual dot similarity : slight increaseof e�ciency and more stability for agreement measures

Optimal number of dimensions to keep ? It is varying according to thetype of text studied and the method used (around 60 dimensions withCA)

We should prefer a dissimilarity measure which provides stable resultswith the number of dimensions kept (in the context of automatedtasks, it's problematic if optimal dimension is depending too much onthe collection of documents)

Conclusion

Limitations & future work

Limitations of the study

Clusters obtained are compared with categories choosen by recruiters,which are sometimes subjective and could explain some errors

We are working on a very particular type of corpus : short texts,variable length, sometimes very similar but not really duplicates

Future work

Test other clustering methods (the representation to adopt maydepend on it)

Repeat the study with a supervised algorithm for classi�cation (indexvalues are disappointing in unsupervised framework)

Study the e�ect of using the di�erent parts of job o�ers forclassi�cation

Some references

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., &Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journalof the American Society for Information Science, 41, 391-407.

Greenacre, M. (2007). Correspondence Analysis in Practice, Second

Edition. London : Chapman & Hall/CRC.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction toLatent Semantic Analysis. Discourse Processes, 25, 259-284.

Landauer, T. K., McNamara, D., Dennis, S., & Kintsch, W. (Eds.)(2007). Handbook of Latent Semantic Analysis. Mahwah, NJ :Erlbaum.

Picca, D., Curdy, B., & Bavaud, F. (2006). Non-linear correspondenceanalysis in text retrieval : a kernel view. In JADT'06, pp. 741-747.

Wild, F. (2007). An LSA package for R. In LSA-TEL'07, pp. 11-12.

Thanks !

A comparison between Latent Semantic Analysis and...

Documents

Transcript of A comparison between Latent Semantic Analysis and...

Agrocampus Ouest

FICHE D'INSCRIPTION MASTER - agrocampus-ouest.fr · Master mention Nutrition et Sciences des Aliments co-accrédité avec ... TYPE DE FINANCEMENT : Joindre 1 attestation justifiant

Analyse des stratégies de mise en vente des produits de la ...halieutique.agrocampus-ouest.fr/pdf/3771.pdf · Les publications du Pôle halieutique AGROCAMPUS OUEST n°11 Page impaire

LES PLANS D’EXPÉRIENCES - Agrocampus Ouestmath.agrocampus-ouest.fr/.../74280_Plan_d_experiences.pdfPour cela on construit d’abord un plan complet à 2 facteurs. Ce plan comporte

Photo Ville d’Angers, DPJP - r2d2-2015.agrocampus-ouest.frr2d2-2015.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/80421_4_Francois... · Photo : Ville d’Angers, DPJP.

La représentation 3D sur R - Agrocampus Ouestmath.agrocampus-ouest.fr/infoglueDeliverLive/... · La représentation 3D sur R Brière Thomas, FlourentHélène,Franconnet Maëva, Morvan

Les Rencontres Végétal - AGROCAMPUS OUEST, …rencontres-du-vegetal.agrocampus-ouest.fr/infoglueDeliverLive/... · Innovation végétale et agro ... Conduite de la culture en présence

plancton et paludiers - Agrocampus Ouest

Past, Present, and Future of Multidimensional Scalingcarme2011.agrocampus-ouest.fr/slides/Groenen.pdfPast, Present, and Future of MDS – 4 – • First sentence in Borg and Groenen

Euclidean representations of a set of hierarchies using Multiple …carme2011.agrocampus-ouest.fr/slides/Cadoret_Le_Pages.pdf · 2014-05-16 · IntroductionData codingStatistical

K. Ono, Laboratoire d’Ecologie Halieutique (Agrocampus Rennes) E. Rivot (Agrocampus Rennes)

Analyse des correspondances (AFC) - Agrocampus Ouestmath.agrocampus-ouest.fr/infoglueDeliverLive/digital...de V1 Profil ligne moyen = distribution marginale de V2 Profil de l ensemble

Ingénieur en AGROALIMENTAIRE - agrocampus-ouest.fr · Innover pour bien nourrir les hommes : ... étude de marché, dé-veloppement de produit…). Le doctorat ... dans ce cas, exonérés

FrançoisHusson husson@agrocampus-ouestmath.agrocampus-ouest.fr/infoglueDeliverLive/digital...L’ACM:distanceentreproduits d2 i,i0 = I J XK k=1 (x ik −x i0k)2 I k = I J 1 3 + 1

Marine Cadoret & François Husson - Agrocampus Ouestmath.agrocampus-ouest.fr/infoglueDeliverLive/digital...Sorted napping Hierarchical sorting Flash pro le and Free choice pro ling

Fisheries Centre - Agrocampus Ouesthalieutique.agrocampus-ouest.fr/pdf/4807.pdf · Fisheries Centre. The University of British Columbia . Working Paper Series. Working Paper #2015

OPENSTREETMAP - Agrocampus Ouest

Weisz communication styles inventory (WCSI: Version 1.0 ...carme2011.agrocampus-ouest.fr/slides/Weisz_Karim.pdf · Weisz communication styles inventory (WCSI: Version 1.0): ... b

S. Rochette, O. Le Pape, E. Rivot Agrocampus Ouest, Rennes ...halieutique.agrocampus-ouest.fr/pdf/1024.pdf · Coupling an age-structured population model for fish dynamics with a

E. Chassot, D., Gascuel and T. Rouyer. Agrocampus Rennes, …halieutique.agrocampus-ouest.fr/pdf/3658.pdf · 3 Biological data Information dealing with population dynamics is very