Post on 31-Dec-2015
description
Latent Semantic Indexing: A probabilistic Analysis
Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh
Vempala
Motivation
• Application in several areas:– querying– clustering, identifying topics– Other:
• synonym recognition (TOEFL..)
• Psychology test
• essay scoring
Motivation
• Latent Semantic Indexing is– Latent: Captures associations which are not
explicit– Semantic: Represents meaning as a function of
similarity to other entities– Cool: Lots of spiffy applications, and the
potential for some good theory too
Overview
• IR and two classical problems
• How LSI works
• Why LSI is effective: A probabilistic analysis
Information Retrieval
• Text corpus with many documents (docs)
• Given a query, find relevant docs
• Classical problems:– synonymy: missing docs with reference to
“automobile” when querying on “car”– polysemy: retrieving docs on internet when querying
on “surfing”
• Solution: Represent docs (and queries) by their underlying latent concepts
Information Retrieval
• Represent each document as a word vector
• Represent corpus as term-document matrix (T-D matrix)
• A classical method:– Create new vector from query terms– Find documents with highest dot-product
Document vector space
Query
Word 1
Word 2
Latent Semantic Indexing(LSI)
• Process term-document (T-D) matrix to expose statistical structure
• Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff
• Related to principal component analysis (PCA), multiple dimensional scaling (MDS)
Parameters
• U = universe of terms
• n = number of terms
• m = number of docs
• A = n x m matrix with rank r– columns represent docs– rows represent terms
Singular Value Decomposition(SVD)
=
=
mxnA
mxrU
rxrD
rxnVT
Terms
Documents
• LSI uses SVD, a linear analysis method:
SVD• r is the rank of A
• D diagonal matrix of the r singular values
• U and V matrices composed of orthonormal columns
• SVD is always possible
• numerical methods for SVD exist
• run time: O(m n c), where c denotes the average number of words per document
T-D Matrix Approximation
=
=
mxnA’k
mxkUk
kxkDk
kxnVT
k
Terms
Documents
Synonymy
• LSI used in several ways: e.g. detecting synonymy
• A measure of similarity for two terms:– In original space: dot product of rows (terms)
and of ( , entry in )– Better: dot product of rows and of
TAAij A
kkkk DUVA i j
Tkk AA
i j
( , entry in )j
i
i
“Semantic” Space
House
Home
Domicile
Kumquat
Apple
Orange
Pear
Synonymy (intuition)
• Consider the term-term autocorrelation matrix
• If two terms co-occur (e.g. supply-demand) we get nearly identical rows
• Yields a small eigenvalue for
• The eigenvector will likely be projected out in as it gives a weak eigenvalue
TAA
TAA
Tkk AA
A Performance Evaluation
• Landauer & Dumais– Perform LSI on 30,000 encyclopedia articles– Take synonym test from TOEFL– Choose most similar wordLSI - 64.4% (52.2% corrected for guessing)People - 64.5% (52.7% corrected for guessing)Correlated .44 with incorrect alternatives
A Probabilistic Analysisoverview
• The model:– Topics sufficiently disjoint– Each doc drawn from a single (random) topic
• Result:– With high probability (whp) :
• Docs from the same topic will be similar
• Docs from different topics will be dissimilar
The Probabilistic Model
• K topics, each corresponding to a set of words
• The sets are mutually disjoint
• Below, all random choices are made uniformly at random
• A corpus of m docs, each doc created as follows..
The Probabilistic Model (cont.)
• choosing a doc:– choose length of the doc– choose a topic – Repeat times:
• With prob choose a word from topic
• With prob choose a word from other topics
1l
T
T
l
Set up
• Let vector assigned to doc by the rank-k LSI performed on the corpus.
• The rank-k LSI is -skewed if
»
»
• (intuition) Docs from the same topic should be similar (high dot product), …
dv
'' 1., ' ddddidd vvvvTvv ',dd vv
''., ' ddddjdid vvvvTvTv
d
The Result
• Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is
-skewed with probability )(O )1
(1m
O
Proof Sketch
• Show with k topics, we obtain k orthogonal subspaces– Assume strictly disjoint topics ( )
• show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic)
– ( ) relax by using a matrix perturbation analysis
0k
Tk AA
0
Extensions
• Theory should go beyond explaining (ideally)
• Potential for speed up: – project the doc vectors onto a suitably small
space– perform LSI on this space
• Yields O(m( n + c log n)) compared to O(mnc)
2log
Future work
• Learn more abstract algebra (math)!
• Extensions: – docs spanning multiple topics?– polysemy?– other positive properties?
• Another important role of theory:– Unify and generalize: spectral analysis has
found applications elsewhere in IR