Latent Semantic Indexing: A probabilistic Analysis

Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh

Vempala

Motivation

• Application in several areas:– querying– clustering, identifying topics– Other:

• synonym recognition (TOEFL..)

• Psychology test

• essay scoring

Motivation

• Latent Semantic Indexing is– Latent: Captures associations which are not

explicit– Semantic: Represents meaning as a function of

similarity to other entities– Cool: Lots of spiffy applications, and the

potential for some good theory too

Overview

• IR and two classical problems

• How LSI works

• Why LSI is effective: A probabilistic analysis

Information Retrieval

• Text corpus with many documents (docs)

• Given a query, find relevant docs

• Classical problems:– synonymy: missing docs with reference to

“automobile” when querying on “car”– polysemy: retrieving docs on internet when querying

on “surfing”

• Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval

• Represent each document as a word vector

• Represent corpus as term-document matrix (T-D matrix)

• A classical method:– Create new vector from query terms– Find documents with highest dot-product

Document vector space

Word 1

Word 2

Latent Semantic Indexing(LSI)

• Process term-document (T-D) matrix to expose statistical structure

• Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff

• Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters

• U = universe of terms

• n = number of terms

• m = number of docs

• A = n x m matrix with rank r– columns represent docs– rows represent terms

Singular Value Decomposition(SVD)

Documents

• LSI uses SVD, a linear analysis method:

SVD• r is the rank of A

• D diagonal matrix of the r singular values

• U and V matrices composed of orthonormal columns

• SVD is always possible

• numerical methods for SVD exist

• run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

mxnA’k

Documents

Synonymy

• LSI used in several ways: e.g. detecting synonymy

• A measure of similarity for two terms:– In original space: dot product of rows (terms)

and of ( , entry in )– Better: dot product of rows and of

TAAij A

kkkk DUVA i j

Tkk AA

( , entry in )j

“Semantic” Space

Domicile

Kumquat

Orange

Synonymy (intuition)

• Consider the term-term autocorrelation matrix

• If two terms co-occur (e.g. supply-demand) we get nearly identical rows

• Yields a small eigenvalue for

• The eigenvector will likely be projected out in as it gives a weak eigenvalue

Tkk AA

A Performance Evaluation

• Landauer & Dumais– Perform LSI on 30,000 encyclopedia articles– Take synonym test from TOEFL– Choose most similar wordLSI - 64.4% (52.2% corrected for guessing)People - 64.5% (52.7% corrected for guessing)Correlated .44 with incorrect alternatives

A Probabilistic Analysisoverview

• The model:– Topics sufficiently disjoint– Each doc drawn from a single (random) topic

• Result:– With high probability (whp) :

• Docs from the same topic will be similar

• Docs from different topics will be dissimilar

The Probabilistic Model

• K topics, each corresponding to a set of words

• The sets are mutually disjoint

• Below, all random choices are made uniformly at random

• A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.)

• choosing a doc:– choose length of the doc– choose a topic – Repeat times:

• With prob choose a word from topic

• With prob choose a word from other topics

Set up

• Let vector assigned to doc by the rank-k LSI performed on the corpus.

• The rank-k LSI is -skewed if

• (intuition) Docs from the same topic should be similar (high dot product), …

'' 1., ' ddddidd vvvvTvv ',dd vv

''., ' ddddjdid vvvvTvTv

The Result

• Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is

-skewed with probability )(O )1

Proof Sketch

• Show with k topics, we obtain k orthogonal subspaces– Assume strictly disjoint topics ( )

• show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic)

– ( ) relax by using a matrix perturbation analysis

Extensions

• Theory should go beyond explaining (ideally)

• Potential for speed up: – project the doc vectors onto a suitably small

space– perform LSI on this space

• Yields O(m( n + c log n)) compared to O(mnc)

Future work

• Learn more abstract algebra (math)!

• Extensions: – docs spanning multiple topics?– polysemy?– other positive properties?

• Another important role of theory:– Unify and generalize: spectral analysis has

found applications elsewhere in IR

Latent Semantic Indexing: A probabilistic Analysis

Documents

Transcript of Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing and Beyond

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

Lecture 15: Latent Semantic Indexing

Lecture 14: Latent Semantic Indexing +

A Latent Semantic Indexing-based approach to multilingual document clusteringccc.inaoep.mx/~villasen/bib/A Latent Semantic Indexing... · 2009-06-25 · A Latent Semantic Indexing-based

Discriminative Indexing for Probabilistic Image Patch Priorscg.postech.ac.kr/papers/2014_ECCV_CHO_deblur.pdf · Discriminative Indexing for Probabilistic Image Patch Priors 3 fore

Indexing by Latent Semantic Analysis Scott Deerwester Graduate ...

Probabilistic Latent Semantic Analysis - University of Pittsburgh

LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION.

Probabilistic Latent Semantic Analysis...Outline • Latent Semantic Analysis o Need o Overview o Drawbacks • Probabilistic Latent Semantic Analysis o Solution to drawbacks of LSA

Regularized Latent Semantic Indexing: A New Approach to ...junxu/publications/TOIS2013_RLSI.pdfRegularized Latent Semantic Indexing 5:3 into many suboptimization problems. These suboptimization

Using Latent Semantic Indexing (LSI) for Information Retrieval

LATENT SEMANTIC INDEXING MENGGUNAKAN SINGULAR …

Latent Semantic Indexing: A Probabilistic Analysis · Latent Semantic Indexing: A Probabilistic Analysis Christos H. Papadimitriou Prabhakar Raghavan Computer Science Division, U.

Latent Semantic Indexing - Inspiring Innovationnicholas/676/LSIfromIan.pdf · Lecture 12 Information Retrieval 3 Latent Semantic Indexing • Compute singular value decomposition

Probabilistic Latent Semantic Analysis (pLSA)

Lecture 5: Probabilistic Latent Semantic Analysis

Classification and clustering methods by probabilistic latent semantic indexing model

Probabilistic Latent Component Analysis for Gearbox ...cns.bu.edu/~mvss/stuff/PLCA_gearbox_PHM_2010_Final.pdf · Probabilistic Latent Component Analysis for ... synchronous average

Probabilistic Structural Latent Representation for ...