Vector Space Model CS 652 Information Extraction and Integration.

Vector Space Model

CS 652 Information Extraction and Integration

2

IntroductionDocs

Information Need

Index Terms

doc

query

Rankingmatch

3

IntroductionA ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

A ranking is based on fundamental premises regarding the notion of relevance, such as:

common sets of index terms

sharing of weighted terms

likelihood of relevance

Each set of premises leads to a distinct IR model

4

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval

Browsing

U s e r

T a s k

Classic Models

Boolean Vector (Space) Probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector (Space) Latent Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

5

Basic Concepts

Each document is described by a set of representative keywords or index terms

Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document

However, search engines assume that all words are index terms (full text representation)

6

Basic ConceptsNot all terms are equally useful for representing the document contents

The importance of the index terms is represented by weights associated to them

Let

ki be an index term

dj be a document

wij is a weight associated with (ki,dj ), which quantifies the

importance of ki for describing the contents of dj

7

The Vector (Space) Model

Define:wij

> 0 whenever ki dj

wiq>= 0 associated with the pair (ki,q)

vec(dj ) = (w1j, w2j

, ..., wtj ), document vector of dj

vec(q) = (w1q, w2q

, ..., wtq ), query vector of q

The unitary vectors vec(di) and vec(qj) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

Queries and documents are represented as weighted vectors

8


Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t

i=1 wij wiq

] / ti=1 wij

2 ti=1 wiq

2

where is the inner product operator & |q| is the length of q

Since wij 0 and wiq

0, 1 sim(q, dj ) 0

A document is retrieved even if it matches the query terms only partially

i

jdj

q

9


Sim(q, dj ) = [ti=1 wij

wiq ] / |dj | |q|

How to compute the weights wij and wiq

?

A good weight must take into account two effects:

quantification of intra-document contents (similarity)

tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity)

idf factor, the inverse document frequency

wij = tf(i, j) idf(i)

10

The Vector (Space) ModelLet,

N be the total number of documents in the collectionni be the number of documents which contain ki

freq(i, j), the raw frequency of ki within dj

A normalized tf factor is given byf(i, j) = freq(i, j) / max(freq(l, j)),

where the maximum is computed over all terms which occur within the document dj

The inverse document frequency (idf) factor is idf(i) = log (N / ni )the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term ki.

11


The best term-weighting schemes use weights which are give by

wij = f(i, j) log(N / ni)

the strategy is called a tf-idf weighting scheme

For the query term weights, a suggestion isWiq

= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)

The vector model with tf-idf weights is a good ranking strategy with general collections

The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute

12

The Vector (Space) ModelAdvantages:

term-weighting improves quality of the answer set

partial matching allows retrieval of documents that approximate the query conditions

cosine ranking formula sorts documents according to degree of similarity to the query

A popular IR model because of its simplicity & speed

Disadvantages:

assumes mutually independence of index terms (??);

not clear that this is bad though

Naïve Bayes Classifier

CS 652 Information Extraction and Integration

14

Bayes Theorem

The basic starting point for inference problems using probability theory as logic

15

Bayes Theorem

.008 .992

.98 .02

.03 .97

P(+|cancer)P(cancer)=(.98).008=.0078

P(+|~cancer)P(~cancer)=(.03).992=.0298

16

Basic Formulas for Probabilities

17


18


19


20

Naïve Bayes Algorithm

21

Naïve Bayes Subtleties

22

Naïve Bayes Subtleties

m-estimate of probability

23

Learning to Classify Text

Classify text into manually defined groups Estimate probability of class membership Rank by relevance Discover grouping, relationships

– Between texts– Between real-world entities mentioned in text

24

Learn_Naïve_Bayes_Text(Example, V)

25

Calculate_Probability_Terms

26

Classify_Naïve_Bayes_Text(Doc)

27

How to Improve

More training data Better training data Better text representation

– Usual IR tricks (term weighting, etc.)– Manually construct good predictor features

Hand off hard cases to human being

Vector Space Model CS 652 Information Extraction and Integration.

Documents

Transcript of Vector Space Model CS 652 Information Extraction and Integration.