Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries,...

Vector Space Model

Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document

Similarity is determined by distance in a vector space Example: The cosine of the angle between the vectors

The SMART system: Developed at Cornell University, 1960-1999 Still used widely

Vector Space Model

Documents represented as vectors in a multi-dimensional Euclidean space Each axis = a term (token)

Coordinate of document d in direction of term t

determined by: Term frequency: TF(d,t)

number of times t occurs in document d, scaled in a

variety of ways to normalize document length Inverse document frequency: IDF(t)

to scale down the terms that occur in many documents

Term Frequency: Scaling

TF (d;t) = P

ü2W

n(d;ü)n(d;t) ;

The number of times t occurs in document d: n(d;t)

TF (d;t) = maxü2W

n(d;ü)n(d;t)

The Cornell SMART system uses:

TF (d;t) =01+ log(1+ log(n(d;t)))

úif n(d;t) = 0

otherwise

Inverse Document Frequency

Not all axes (terms) in the vector space are equally important.

IDF seeks to scale down the coordinates of terms that occur in many documents. The Cornell SMART system uses:IDF (t) = log jD tj

1+jDj

If the term t will enjoy a large IDF scale and vice versa.

jD tj ü jDj

Other variants are also used, these are mostly dampened functions of jD tj

jDj

TFIDF-space

An obvious way to combine TF-IDF: the coordinate of document in axis is given by

dt = TF (d;t) áIDF (t)

d t

General form of consists of three parts: dt

dt = L tdGtDd

L td :Local weight for term occurring in doc.t d

Gt :Global weight for term occurring in the corpust

Dd :Document normalization factor

Term-by-Document Matrix

A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix mâ n A

Summary

Tokenization

Removing stopwords Stemming

Term Weighting

TF: Local IDF: Global Normalization

TF-IDF Vector Space

Term-by-Document Matrix

Reuters-2157821578 docs – 27000 terms, and 135

classes

21578 documents 1-14818 belong to training set 14819-21578 belong to testing set

Reuters-21578 includes 135 categories by using ApteMod version of the TOPICS set

Result in 90 categories with 7,770 training documents and 3,019 testing documents

Preprocessing Procedures (cont.)

After Stopwords Elimination

After Porter Algorithm

BusinessUnderstanding

Deployment

DataUnderstanding

DataPreparation

Modeling

Evaluation

DATA

Problems with Vector Space Model

How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’)

How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc

How to define similarity/distance function?

How to store the term-by-document matrix?

Choice of ‘Basic Concepts’

Java

Microsoft

Starbucks

D1

Which one is better?

Vector Space Model: Similarity

Given A query q = (q1, q2,…, qn)

qi: term frequency of the i-th word

A document dk = (dk,1, dk,2,…, dk,n) dk,i: term frequency of the

i-th word

Similarity of a query q to a document dk

q

dk

)),(cos(

...

),(sim

22

,2,21,1

kkk dqdqdq

nknkk

k

dqdqdq

dq

),( dq

2,

22,

21,

222

21

,2,21,1

22

......

...

)),(cos(),(sim'

nkkkn

nknkk

k

dddqqq

dqdqdq

dq

k

kk dq

dqdq

Terms Documents

T1: Bab(y,ies,y’s) D1: Infant & Toddler First Aid

T2: Child(ren’s) D2: Babies & Children’s Room (For Your Home)

T3: Guide D3: Child Safety at Home

T4: Health D4: Your Baby’s Health and Safety: From Infant

T5: Home to Toddler

T6: Infant D5: Baby Proofing Basics

T7: Proofing D6: Your Guide to Easy Rust Proofing

T8: Safety D7: Beanie Babies Collector’s Guide

T9: Toddler

Aê =

0 1 0 1 1 0 10 1 1 0 0 0 00 0 0 0 0 1 10 0 0 1 0 0 00 1 1 0 0 0 01 0 0 1 0 0 00 0 0 0 1 1 00 0 1 1 0 0 01 0 0 1 0 0 0

2

666666666664

3

777777777775

The 9 x 7 term-by-document matrix before normalization, where the element is the number of times term appears in document title :

aêij ij

The 9 x 7 term-by-document matrix with unit columns:

A =

0 0:5774 0 0:4472 0:7071 0 0:70710 0:5774 0:5774 0 0 0 00 0 0 0 0 0:7071 0:70710 0 0 0:4472 0 0 00 0:5774 0:5774 0 0 0 0

0:7071 0 0 0:4472 0 0 00 0 0 0 0:7071 0:7071 00 0 0:5774 0:4472 0 0 0

0:7071 0 0 0:4472 0 0 0

2

666666666664

3

777777777775

A =

0 0:5774 0 0:4472 0:7071 0 0:70710 0:5774 0:5774 0 0 0 00 0 0 0 0 0:7071 0:70710 0 0 0:4472 0 0 00 0:5774 0:5774 0 0 0 0

0:7071 0 0 0:4472 0 0 00 0 0 0 0:7071 0:7071 00 0 0:5774 0:4472 0 0 0

0:7071 0 0 0:4472 0 0 0

2

666666666664

3

777777777775

val 0.5774

0.4472

0.7071

0.7071

0.5774

0.5774

0.7071

0.7071

0.4472

0.5774

0.5774

0.7071

0.4472

0.7071

0.7071

0.5774

0.4472

0.7071

0.4472

col_ind 2 4 5 7 2 3 6 7 4 2 3 1 4 5 6 3 4 1 4

row_ptr

1 5 7 9 10 12 14 16 18 20

RCS

A =

0 0:5774 0 0:4472 0:7071 0 0:70710 0:5774 0:5774 0 0 0 00 0 0 0 0 0:7071 0:70710 0 0 0:4472 0 0 00 0:5774 0:5774 0 0 0 0

0:7071 0 0 0:4472 0 0 00 0 0 0 0:7071 0:7071 00 0 0:5774 0:4472 0 0 0

0:7071 0 0 0:4472 0 0 0

2

666666666664

3

777777777775

val 0.7071

0.7071

0.5774

0.5774

0.5574

0.5574

0.5774

0.5774

0.4472

0.4472

0.4472

0.4472

0.4472

0.7071

0.7071

0.7071

0.7071

0.7071

0.7071

col_ind 6 9 1 2 5 2 5 8 1 4 6 8 9 1 7 3 7 1 3

row_ptr

1 3 6 9 14 16 18 20

CCS

Short Review of Linear Algebra

The Terms that You Have to Know!

Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection

Matrix Factorization

LU-Factorization: A = LU

QR-Factorization:

Very useful for solving linear system equations Some row exchanges are required

A = QR; A 2 Rmân; Q 2 Rmân;R 2 Rnân

Every matrix with linearly independent columns can be factored into . The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes anorthogonal matrix ( )

A 2 Rmâ n

A = QR Q

Rm = n Q

QTQ = I

QR Factorization SimplifiesLeast Squares Problem

The normal equation for LS problem: ATAx = ATb

ATAx = RTQTQRx = RTRx = RTQTb

, Rx = QTb (RT is invertible)

A ï j = Q áR ï j =P

k=1

n

RkjQ ï k

A

Note: The orthogonal matrix constructs the column space of matrix

Q

Motivation for Computing QR of the term-by-doc Matrix

The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection

A

cosòk = jjA ï kjj2jjqjj2

A Tï káq = jjQR ï kjj2jjqjj2

(QR ï k)Táq = jjR ï kjj2jjqjj2

R Tï k(Q

Táq)

Let be the angle between a query and the document vector

òk qA ï k

That means we can keep and instead of Q R A

QR also can be applied to dimension reduction

Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries,...

Documents

Transcript of Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries,...