Homework

Post on 06-Jan-2016

45 views 0 download

Tags:

description

Homework. Define a loss function that compares two matrices (say mean square error) b = svd(bellcore ) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2] ) b3 = b$u[,1 :3] %*% diag(b$d[1 :3] ) %*% t(b$v[,1 :3]) More generally, for all possible r - PowerPoint PPT Presentation

Transcript of Homework

Homework

• Define a loss function that compares two matrices (say mean square error)

• b = svd(bellcore)• b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])• b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3])• More generally, for all possible r

– Let b.r = b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r])• Compute the loss between bellcore and b.r as a function

of r• Plot the loss as a function of r

IR Models

• Keywords (and Boolean combinations thereof)• Vector-Space ‘‘Model’’ (Salton, chap 10.1)– Represent the query and the documents as V-

dimensional vectors– Sort vectors by

• Probabilistic Retrieval Model– (Salton, chap 10.3)– Sort documents by

sim(x,y) = cos(x, y) =

x i ⋅ y i

i

∑| x |⋅ | y |

score(d) =Pr(w | rel)

Pr(w | rel)w∈d

Information Retrieval and Web SearchAlternative IR models

Instructor: Rada Mihalcea

Some of the slides were adopted from a course tought at Cornell University by William Y. Arms

Latent Semantic Indexing

Objective

Replace indexes that use sets of index terms by indexes that use concepts.

Approach

Map the term vector space into a lower dimensional space, using singular value decomposition.

Each dimension in the new space corresponds to a latent concept in the original data.

Deficiencies with Conventional Automatic Indexing

Synonymy: Various words and phrases refer to the same concept (lowers recall).

Polysemy: Individual words have more than one meaning (lowers precision)

Independence: No significance is given to two terms that frequently appear together

Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)

Bellcore’s Examplehttp://en.wikipedia.org/wiki/Latent_semantic_analysis

 c1 Human machine interface for Lab ABC computer applications

 c2 A survey of user opinion of computer system response time

 c3 The EPS user interface management system

 c4 System and human system engineering testing of EPS

 c5 Relation of user-perceived response time to error measurement

m1 The generation of random, binary, unordered trees

m2 The intersection graph of paths in trees

m3 Graph minors IV: Widths of trees and well-quasi-ordering

m4 Graph minors: A survey

Term by Document Matrix

"bellcore"<-structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0,0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1), .Dim = c(12, 9), .Dimnames = list(c("human", "interface", "computer", "user","system", "response", "time", "EPS", "survey", "trees", "graph","minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4")))

help(dump)help(source)

Query ExpansionQuery:

Find documents relevant to human computer interaction

Simple Term Matching:

Matches c1, c2, and c4Misses c3 and c5

LargeCorrel-ations

Correlations: Too Large to Ignore

How to compute correlationsround(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4c1 100 -19 0 0 -33 -17 -26 -33 -33c2 -19 100 0 0 58 -30 -45 -58 -19c3 0 0 100 47 0 -21 -32 -41 -41c4 0 0 47 100 -31 -16 -24 -31 -31c5 -33 58 0 -31 100 -17 -26 -33 -33m1 -17 -30 -21 -16 -17 100 67 52 -17m2 -26 -45 -32 -24 -26 67 100 77 26m3 -33 -58 -41 -31 -33 52 77 100 56m4 -33 -19 -41 -31 -33 -17 26 56 100

round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minorshuman 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29user -38 19 19 100 23 76 76 19 19 -50 -50 -38system 43 4 4 23 100 4 4 82 4 -46 -46 -35response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29survey -29 -29 36 19 4 36 36 -29 100 -38 19 36trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100

plot(hclust(as.dist(-cor(t(bellcore)))))

plot(hclust(as.dist(-cor(bellcore))))

Correcting for

Large Correlations

Thesaurus

Term by Doc Matrix:

Before & After Thesaurus

Singular Value Decomposition (SVD)X = UDVT

X = U

VTD

t x d t x m m x dm x m

• m is the rank of X < min(t, d)

• D is diagonal

– D2 are eigenvalues (sorted in descending order)

• U UT = I and V VT = I

– Columns of U are eigenvectors of X XT

– Columns of V are eigenvectors of XT X

• m is the rank of X < min(t, d)

• D is diagonal

– D2 are eigenvalues (sorted in descending order)

• U UT = I and V VT = I

– Columns of U are eigenvectors of X XT

– Columns of V are eigenvectors of XT X

Dimensionality Reduction

X =

t x d t x k k x dk x k

k is the number of latent concepts

(typically 300 ~ 500)

U

D VT

^

Dimension Reduction in R

b = svd(bellcore)b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])dimnames(b2) = dimnames(bellcore)par(mfrow=c(2,2))plot(hclust(as.dist(-cor(bellcore))))plot(hclust(as.dist(-cor(t(bellcore)))))plot(hclust(as.dist(-cor(b2))))plot(hclust(as.dist(-cor(t(b2)))))

SVDB BT = U D2 UT

BT B = V D2 VT

Latent

Term

Doc

Dimension Reduction Block Structureround(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4c1 100 -19 0 0 -33 -17 -26 -33 -33c2 -19 100 0 0 58 -30 -45 -58 -19c3 0 0 100 47 0 -21 -32 -41 -41c4 0 0 47 100 -31 -16 -24 -31 -31c5 -33 58 0 -31 100 -17 -26 -33 -33m1 -17 -30 -21 -16 -17 100 67 52 -17m2 -26 -45 -32 -24 -26 67 100 77 26m3 -33 -58 -41 -31 -33 52 77 100 56m4 -33 -19 -41 -31 -33 -17 26 56 100> round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4c1 100 91 100 100 84 -86 -85 -85 -81c2 91 100 91 88 99 -57 -56 -56 -50c3 100 91 100 100 84 -86 -85 -85 -81c4 100 88 100 100 81 -89 -88 -88 -84c5 84 99 84 81 100 -44 -44 -43 -37m1 -86 -57 -86 -89 -44 100 100 100 100m2 -85 -56 -85 -88 -44 100 100 100 100m3 -85 -56 -85 -88 -43 100 100 100 100m4 -81 -50 -81 -84 -37 100 100 100 100

Dimension Reduction Block Structureround(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minorshuman 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29user -38 19 19 100 23 76 76 19 19 -50 -50 -38system 43 4 4 23 100 4 4 82 4 -46 -46 -35response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29survey -29 -29 36 19 4 36 36 -29 100 -38 19 36trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100> round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minorshuman 100 100 93 94 99 82 82 100 -12 -85 -84 -83interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80computer 93 95 100 100 96 98 98 93 26 -59 -57 -56user 94 96 100 100 97 97 97 94 23 -62 -60 -59system 99 100 96 97 100 88 88 100 -2 -79 -78 -77response 82 85 98 97 88 100 100 83 46 -40 -38 -37time 82 85 98 97 88 100 100 83 46 -40 -38 -37EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82survey -12 -7 26 23 -2 46 46 -11 100 63 65 66trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100

t1

t2

t3

d1 d2

The space has as many dimensions as there are terms in the word list.

The term vector space

• term

document

query

--- cosine > 0.9

Latent concept vector space