Latent Semanctic Analysis Auro Tripathy

Latent Semantic Analysis

Auro Tripathy

[email protected]

Outline

Introduction

Singular Value Decomposition

Dimensionality Reduction

LSA in Information Retrieval

Latent Semantic Analysis

Introduction

Mathematical treatment capable of inferring meaning

Measures of word-word, word-passage, & passage-passage relations that correlate well with human understanding of semantic similarity

Similarity estimates are NOT based on contiguity frequencies, co-occurrence counts, or usage correlations

Mathematical way capable of inferring deeper relationships; hence “latent”

Akin to a well-read nun dispensing sex-advice

Analysis of text alone

Its knowledge does NOT come from perceived information about the physical world, NOT from instinct, NOT from feelings, NOT from emotions

Does NOT take into account word-order, phrases, syntactic relationships, logic,

It takes in large amounts of text and looks for mutual interdependencies in the text

Words and Passages

LSA represents the meaning of a word as the average of the meaning of all the passages in which it appears…

…and the meaning of the passage as an average of the meaning of the words it contains

word1word2word3

What is LSA?

LSA is a mathematical technique for extracting and inferring relations of expected contextual usage of words in documents

What LSA is not

Not a natural language processing program

Not an artificial intelligence program

Does NOT use dictionaries or databases

Does NOT use syntactic parsers

Does not use morphologies

Takes as input – words and text paragraphs

Example

Titles of N=9 technical memoranda

Five on human-computer interaction

Four on mathematical graph theory

Disjoint topics

Source: An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Sample Word-by-Document Matrix Word selection criteria – occurs in at least two of the

titles


How much was said about a topic

Semantic Similarity using Spearman rank coefficient correlation

The correlation between human and user is negative, -0.38

The correlation between human and minor is also negative, -0.29

Expected; words never in the same passage, no co-occurrences

http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient

Spearman ρ (human.user) = -0.38

Spearman ρ (human.minor) = -0.29

Singular Value Decomposition

The Term Space

Term

s

Documents

Source: Latent Semantic Indexing and Information Retrieval, Johanna Geiß

The Document Space

Term

s

Documents


The Semantic Spaceone space for terms and documents

Represent terms AND documents in one space

Makes it possible to calculate similarities

Between documents

Between terms

Between terms and documents

The Decomposition

Splits the term-document matrix into three matrices New space, the SVD space

because new axes were found by SVD along which the terms and documents can be grouped

M

Term-by-

document

matrix

Term1Term2Term3

T

S DT

t x d

r x dr x r

t x r

New Term Vector, New Document Vector, & Singular Values

T contains in its rows the term vectors scaled to a new basis

DT contains the new vectors of the documents

S contains the singular values

σ1,σ2, …. σn

Where, σ1 ≥ σ2 ≥ …. ≥ σn ≥ 0

Dimensionality Reduction

To reveal the latent semantic structure

Reduce to k Dimensions

M

Term-by-

document

matrix

Term1Term2Term3

t x d

r x kk x k

t x k

T

S DT

ExampleTerm Vector Reduced to two Dimensions

T

S

D


Reconstruction of the original matrix based on the reduced dimensions

NEW

Original


Recomputed Semantic Similarity using Spearman rank coefficient correlation

Spearman ρ (human.user) = -0.38

Spearman ρ (human.minor) = -0.29Original

Spearman ρ (human.user) = +0.94

Spearman ρ (human.minor) = -0.83

NEW

Humans-user correlation went up and the human-minor correlation went down

Correlation between a title and all other titles – Raw Data

•Correlation between the human-computer interaction titles was low

•Average correlations, 0.2; half the Spearman correlations were 0

•Correlation between the four graph-theory papers (mx / my) was mixed

•Average Spearman correlation was 0.44, 0.

•Correlation between human-computer interaction titles and the

graph-theory papers was -0.3, despite no semantic overlap


Correlation in the reduced dimension (k=2) space

•Average correlations jumped from 0.2 to 0.92

•Correlation between the graph-theory papers (mx/my) was HIGH;1.0

•Correlation between human-computer interaction titles and the

graph-theory papers was strongly negative


LSA in Information Retrieval

How to treat a query

Matrix of term-by-document

Perform SVD, reduce dimensions to 50-400

A query is a “pseudo-document” Weighted average of the vector of the words it

contains

Use a similarity metric (such as cosine) between the query vector and the document-to-document vectors

Rank the results

The Query Vector


Does better that literal matches between terms in query documents

Superior when query and document use different words

References

• Latent Semantic Indexing and Information Retrieval, Johanna Geiß

• An Introduction to Latent Semantic Analysis, Landauer, Foltz, Laham

Latent Semanctic Analysis Auro Tripathy

Documents

Transcript of Latent Semanctic Analysis Auro Tripathy