Latent Semanctic Analysis Auro Tripathy
-
Upload
auro-tripathy -
Category
Documents
-
view
1.131 -
download
15
Transcript of Latent Semanctic Analysis Auro Tripathy
Outline
Introduction
Singular Value Decomposition
Dimensionality Reduction
LSA in Information Retrieval
Latent Semantic Analysis
Introduction
Mathematical treatment capable of inferring meaning
Measures of word-word, word-passage, & passage-passage relations that correlate well with human understanding of semantic similarity
Similarity estimates are NOT based on contiguity frequencies, co-occurrence counts, or usage correlations
Mathematical way capable of inferring deeper relationships; hence “latent”
Akin to a well-read nun dispensing sex-advice
Analysis of text alone
Its knowledge does NOT come from perceived information about the physical world, NOT from instinct, NOT from feelings, NOT from emotions
Does NOT take into account word-order, phrases, syntactic relationships, logic,
It takes in large amounts of text and looks for mutual interdependencies in the text
Words and Passages
LSA represents the meaning of a word as the average of the meaning of all the passages in which it appears…
…and the meaning of the passage as an average of the meaning of the words it contains
word1word2word3
What is LSA?
LSA is a mathematical technique for extracting and inferring relations of expected contextual usage of words in documents
What LSA is not
Not a natural language processing program
Not an artificial intelligence program
Does NOT use dictionaries or databases
Does NOT use syntactic parsers
Does not use morphologies
Takes as input – words and text paragraphs
Example
Titles of N=9 technical memoranda
Five on human-computer interaction
Four on mathematical graph theory
Disjoint topics
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Sample Word-by-Document Matrix Word selection criteria – occurs in at least two of the
titles
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
How much was said about a topic
Semantic Similarity using Spearman rank coefficient correlation
The correlation between human and user is negative, -0.38
The correlation between human and minor is also negative, -0.29
Expected; words never in the same passage, no co-occurrences
http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
Spearman ρ (human.user) = -0.38
Spearman ρ (human.minor) = -0.29
Singular Value Decomposition
The Term Space
Term
s
Documents
Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
The Document Space
Term
s
Documents
Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
The Semantic Spaceone space for terms and documents
Represent terms AND documents in one space
Makes it possible to calculate similarities
Between documents
Between terms
Between terms and documents
The Decomposition
Splits the term-document matrix into three matrices New space, the SVD space
because new axes were found by SVD along which the terms and documents can be grouped
M
Term-by-
document
matrix
Term1Term2Term3
T
S DT
t x d
r x dr x r
t x r
New Term Vector, New Document Vector, & Singular Values
T contains in its rows the term vectors scaled to a new basis
DT contains the new vectors of the documents
S contains the singular values
σ1,σ2, …. σn
Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0
Dimensionality Reduction
To reveal the latent semantic structure
Reduce to k Dimensions
M
Term-by-
document
matrix
Term1Term2Term3
t x d
r x kk x k
t x k
T
S DT
ExampleTerm Vector Reduced to two Dimensions
T
S
D
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Reconstruction of the original matrix based on the reduced dimensions
NEW
Original
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Recomputed Semantic Similarity using Spearman rank coefficient correlation
Spearman ρ (human.user) = -0.38
Spearman ρ (human.minor) = -0.29Original
Spearman ρ (human.user) = +0.94
Spearman ρ (human.minor) = -0.83
NEW
Humans-user correlation went up and the human-minor correlation went down
Correlation between a title and all other titles – Raw Data
•Correlation between the human-computer interaction titles was low
•Average correlations, 0.2; half the Spearman correlations were 0
•Correlation between the four graph-theory papers (mx / my) was mixed
•Average Spearman correlation was 0.44, 0.
•Correlation between human-computer interaction titles and the
graph-theory papers was -0.3, despite no semantic overlap
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
Correlation in the reduced dimension (k=2) space
•Average correlations jumped from 0.2 to 0.92
•Correlation between the graph-theory papers (mx/my) was HIGH;1.0
•Correlation between human-computer interaction titles and the
graph-theory papers was strongly negative
Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham
LSA in Information Retrieval
How to treat a query
Matrix of term-by-document
Perform SVD, reduce dimensions to 50-400
A query is a “pseudo-document” Weighted average of the vector of the words it
contains
Use a similarity metric (such as cosine) between the query vector and the document-to-document vectors
Rank the results
The Query Vector
Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß
Does better that literal matches between terms in query documents
Superior when query and document use different words
References
• Latent Semantic Indexing and Information Retrieval, Johanna Geiß
• An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham