COMP791A: Statistical Language Processing

37
1 COMP791A: Statistical Language Processing Information Retrieval [M&S] 15.1-15.2 [J&M] 17.3

description

COMP791A: Statistical Language Processing. Information Retrieval [M&S] 15.1-15.2 [J&M] 17.3. The problem. The standard information retrieval (IR) scenario The user has an information need The user types a query that describes the information need - PowerPoint PPT Presentation

Transcript of COMP791A: Statistical Language Processing

Page 1: COMP791A: Statistical Language Processing

1

COMP791A: Statistical Language Processing

Information Retrieval[M&S] 15.1-15.2

[J&M] 17.3

Page 2: COMP791A: Statistical Language Processing

2

The problemThe standard information retrieval (IR) scenario

The user has an information need The user types a query that describes the information need The IR system retrieves a set of documents from a document

collection that it believes to be relevant The documents are ranked according to their likelihood of

being relevant

input: a (large) set/collection of documents a user query

output: a (ranked) list of relevant documents

Page 3: COMP791A: Statistical Language Processing

3

Example of IR

Page 4: COMP791A: Statistical Language Processing

4

Page 5: COMP791A: Statistical Language Processing

5

IR within NLP IR needs to process the large volumes of online text And (traditionally), NLP methods were not robust

enough to work on thousands of real world texts. so IR:

not based on NLP tools (ex. syntactic/semantic analysis) uses (mostly) simple (shallow) techniques based mostly on word frequencies

in IR, meaning of documents: is the composition of meaning of individual words ordering & constituency of words play are not taken into

account bag of word approach

I see what I eat.I eat what I see.

same meaning

Page 6: COMP791A: Statistical Language Processing

6

2 major topics

Indexing representing the document collection using words/terms for fast access to documents

Retrieval methods matching a user query to indexed documents 3 major models:

1. boolean model 2. vector-space model3. probabilistic model

Page 7: COMP791A: Statistical Language Processing

7

Indexing Most IR systems use an inverted file to represent the texts

in the collection Inverted file = a table of terms with a list of texts that

contain these terms

assassination {d1, d4, d95, d5, d90…}

murder {d3, d7, d95…}

Kennedy {d24, d7, d44…}

conspiracy {d3, d55, d90, d98…}

Page 8: COMP791A: Statistical Language Processing

8

Example of an inverted file For each term:

DocCnt: how many documents the term occurs in (used to compute IDF)

FreqCnt: how many times the term occurs in all documents

For each document: Freq: how many times the term occurs in this doc WordPosition: the offsets where these occurrences

are found in the document useful to search for terms within n words of each other to approximate phrases (ex. “car insurance”)

but… primitive notion of phrases… just word/byte position in document

“car insurance” “ insurance for car” to generate word-in-context snippets to highlight terms in the retrieved document …

Page 9: COMP791A: Statistical Language Processing

9

Basic Concept of a Retrieval Model documents and queries are represented by vectors of pairs <term-value>

term: all possible terms that occur in the query/document value: presence or absence of term in query/document

value can be binary (0, if term is absent ; 1, if term is present) some weigh (term frequency, tf.idf, or other)T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

Page 10: COMP791A: Statistical Language Processing

10

Vector-Space Model

binary values do not tell if a term is more important than others

so we should weight the terms by importance weight of terms (for document & query) can

be their raw frequency or other measure

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

Page 11: COMP791A: Statistical Language Processing

11

Term-by-document matrix the collection of documents is represented by a matrix of

weights called a term-by-document matrix

1 column = representation of one document 1 row = representation of 1 term across all documents cell wij = weight of term i in document j note: the matrix is sparse !!!

d1 d2 d3 d4 d5 …

term1 w11 w12 w13 w14 w15

term2 w21 w22 w23 w24 w25

term3 w31 w32 w33 w34 w35

TermN wn1 wn2 wn3 wn4 wn5

Page 12: COMP791A: Statistical Language Processing

12

An example The collection:

d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary}

d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary}

d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other languages regular languages ? is natural language context-free complexity and human processing summary}

The query:Q = {speech language processing}

Page 13: COMP791A: Statistical Language Processing

13

An example (con’t) The collection:

d1 = {introduction knowledge in speech and language processing ambiguity models and algorithms language thought and understanding the state of the art and the near-term future some brief history summary}

d2 = {hmms and speech recognition speech recognition architecture overview of the hidden markov models the viterbi algorithm revisited advanced methods in decoding acoustic processing of speech computing acoustic probabilities training a speech recognizer waveform generation for speech synthesis human speech recognition summary}

d3 = {language and complexity the chomsky hierarchy how to tell if a language isn’t regular the pumping lemma are English and other language regular language ? is natural language context-free complexity and human processing summary}

The query:Q = {speech language processing}

Page 14: COMP791A: Statistical Language Processing

14

d1 d2 d3 Q introduction … … … … knowledge … … … … … … … … … speech 1 6 0 1 language 2 0 5 1 processing 1 1 1 1 … … … … …

using raw term frequencies

vectors for the documents and the query can be seen as a point in a multi-dimensional space

where each dimension is a term from the query

An example (con’t)

Term

1

(speech

)

Term

3

(processing)

Term 2 (language)

d2 (6,0,1)

d1 (1,2,1)

d3 (0,5,1)q (1,1,1)

Page 15: COMP791A: Statistical Language Processing

15

Document similarity

The longer the document, the more chances it will be retrieved: it makes sense, because it may contain many of the

query's terms but then again, it may also contain lots of non-

pertinent terms…

we want to consider: vector (1, 2, 1) vector (2, 4, 2) (same distribution of words)

we can normalise raw term frequencies to convert all vectors to a standard length (ex. 1)

Page 16: COMP791A: Statistical Language Processing

16

Example

Query = speech language

original representation:

lan

gu

ag

e

speech

d2 (6, 0)

d1 (1, 2)

d3 (0, 5)

q (1, 1)

Normalization: - length of vector does not matter,- angle does.

Page 17: COMP791A: Statistical Language Processing

17

The cosine measure

similarity between two documents (or doc & query) is actually the cosine of the angle (in N-dimensions) between the 2 vectors

if 2 document-vectors are identical, they will have a cosine of 1

if 2 document-vectors are orthogonal (i.e. share no common term), they will have a cosine of 0

(Q) Query

(D) Document

(Q) Query

(D) Document

0Q)cos(D, 1Q)cos(D,

(Q) Query

(D) Document

0.7Q)cos(D,

Page 18: COMP791A: Statistical Language Processing

18

The cosine measure (con’t) The cosine of 2 vectors (in N dimensions)

also known as the normalized inner product

N

1i

2i

N

1i

2i

N

1iii

qd

q d

Q D

QD)Q,Dcos(

lengths of the vectors

inner product

Page 19: COMP791A: Statistical Language Processing

19

If you want proof… in 2-D space

to have vectors of length 1 (normalized vectors) divide all its components by the length of the vector in 2 dimensional space:

length) d(normalize 1L

YXL'

22

vector) of (length YXL 22

)coordinate X d(normalize Y X

X

LX

X’22

)coordinate Y d(normalize Y X

Y

LY

Y’22

can be skipped

Page 20: COMP791A: Statistical Language Processing

20

Normalized vectors

Query = speech language

lan

gu

ag

e

d3’(0, 1) speech

d2’(1, 0)

d1’(0.45, 0.89)

Q’(0.71, 0.71)

Q(1,1) --> normalized Q’ (0.71, 0.71)

d1(1,2) --> normalized d1’ (0.45, 0.89)

d2(6,0) --> normalized d2’ (1, 0)

d3(0,5) --> normalized d3’ (0, 1)

1.4111L 22

2.2421L 22

606L 22

550L 22

1

1

can be skipped

Page 21: COMP791A: Statistical Language Processing

21

Similarity between 2 vectors (2-D) In 2-D (ie. N= 2; nb of terms = 2)

with the original vectors: Q = (Xq, Yq) D = (Xd, Yd)

with the normalized vectors:

)Y(Y )X (X)w x(wD)sim(Q, dqdq

n

1i

id iq

L

Y ,

L

X )Y ,(X Q’

q

q

q

q'q

'q

LY

,LX

)Y ,(X D’d

d

d

d'd

'd

YX YX

YYXX

LL

YYXX

LY

L

Y

LX

L

X D)sim(Q,

2d

2d

2q

2q

dqdq

dq

dqdq

d

d

q

q

d

d

q

q

can be skipped

Page 22: COMP791A: Statistical Language Processing

22

Similarity in the general case (N-D)

in the general case of N-dimensions (N-terms)

which is the cosine of the angle between the vector D and vector Q in N-dimensions

N

1i

2id

N

1i

2iq

N

1i

id iq

w x w

)w x(wD)sim(Q,

but for normalized vectors

can be skipped

Page 23: COMP791A: Statistical Language Processing

23

The example again

Q = {speech language processing}query (1,1,1)

d1 (1,2,1) d2 (6,0,1) d3 (0,5,1)

d1 d2 d3 Q

introduction 1 0 0 0

knowledge 1 0 0 0

speech 1 6 0 1

language 2 0 5 1

processing 1 1 1 1

0.9433 x 6

121

)11(1 x )12(1

(1x1) (2x1) (1x1)Q),sim(d

2222221

0.6643 x 37

106

)11(1 x )10(6

(1x1) (0x1) (6x1)Q),sim(d

2222222

0.6803 x 26

150

)11(1 x )15(0

(1x1) (5x1) (0x1)Q),sim(d

2222223

N

1i

2i

N

1i

2i

N

1iii

qd

q d

Q D

QD)Q,Dcos( Q)sim(D,

Page 24: COMP791A: Statistical Language Processing

24

Term weights so far, we have used term frequency as the weights core of most weighting functions:

tfij term frequency frequency of a term i in document j if a term appears often in a document, then it

describes well the document contents intra-document characterization

dfi document frequency number of documents in the collection containing

the term i if a term appears in many documents, then it is not

useful for distinguishing a document inter-document characterization used to compute idf

Page 25: COMP791A: Statistical Language Processing

25

tf.idf weighting functions most widely used family of weighting functions let:

M = number of documents in the collection Inverse Document Frequency for term i // measures

weight

// of term i for the

// query

intuitively, if M = 1000 if dfi = 1000 --> log(1) = 0 --> term i is ignored ! (it appears in all docs) if dfi = 10 --> log(100) = 2 --> term i has weight of 2

in the query if dfi = 1 --> log(1000) = 3 --> term i has weight of 3 in the

query

weight of term i in document d is: wid = tfid x idfi

family of tf.idf functions

iidid df

Mlog x )log(tfw

ijd j

idid df

Mlog x

tf maxtf x 0.5

0.5w

frequency of most frequent term j in document d

ii df

Mlogidf

Page 26: COMP791A: Statistical Language Processing

26

Evaluation: Precision & Recall

I n reality, the document is… System says…

pertinent non pertinent

document is pertinent A B document is not pertinent C D

A

A+CRecall=

A+B

APrecision =

Recall and precision measure how good a set of retrieved documents is compared with an ideal set of relevant documents

Recall: What proportion of relevant documents are actually retrieved?

Precision: What proportion of retrieved documents are really relevant?

All docs that were retrieved

Pertinent docs that were retrieved

Pertinent docs that were retrieved

All pertinent docs (that should have been retrieved)

Page 27: COMP791A: Statistical Language Processing

27

Evaluation: Example of P&R

Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389

system1: d123 d84 d56

Precision : ?? Recall : ??

system2: d123 d84 d56 d6 d8 d9 Precision : ?? Recall : ??

Page 28: COMP791A: Statistical Language Processing

28

Evaluation: Example of P&R Relevant: d3 d5 d9 d25 d39 d44 d56 d71 d123 d389

system1: d123 d84 d56 Precision: 66% (2/3) Recall: 20% (2/10)

system2: d123 d84 d56 d6 d8 d9 Precision: 50% (3/6) Recall: 30% (3/10)

Page 29: COMP791A: Statistical Language Processing

29

Evaluation: Problems with P&R

P&R do not evaluate the ranking d123 d84 d84 d123

so other measures are often used: Document cutoff levels P&R curves ...

Page 30: COMP791A: Statistical Language Processing

30

Evaluation: Document cutoff levels fix the number of documents retrieved at several levels

ex. top 5, top 10, top 20, top 100, top 500… measure precision at each of these levels Ex: system 1 system 2 system 3

d1 d10 d6 d2 d9 d1 d3 d8 d2 d4 d7 d10 d5 d6 d9

d6 d1 d3 d7 d2 d5 d8 d3 d4 d9 d4 d7 d10 d5 d8

precision at 5 1.0 0.0 0.4 precision at 10 0.5 0.5 0.5

Page 31: COMP791A: Statistical Language Processing

31

Evaluation: P&R curve measure precision at different levels of recall

usually, precision at 11 recall levels (0%, 10%, 20%, …, 100%)

recall

pre

cisi

on

0% 20%0%

20%

40% 60% 80% 100%

40%

60%

80%

100%

Page 32: COMP791A: Statistical Language Processing

32

Which system performs better?

recall0% 20% 40% 60% 80% 100%

pre

cisi

on

0%

20%

40%

60%

80%

100%

Page 33: COMP791A: Statistical Language Processing

33

Evaluation: A Single Value Measure

cannot take mean of P&R if R = 50% P = 50% M = 50% if R = 100% P = 10% M = 55% (not fair)

take harmonic meanHM is high only when both P&R are high

if R = 50% and P = 50% HM = 50%

if R = 100% and P = 10% HM = 18.2%

take weighted harmonic meanwr: weight of R wp: weight of P a = 1/wr b= 1/wp

let β2 = a/b … which is called the F-

measure

P1

R1

2HM

P1

bRa

ba

Pbb

Rba

bb)(a

Pb

Ra

baWHM

1

RPβPR 1)(β

P1

1βWHM 2

2

2

2

Page 34: COMP791A: Statistical Language Processing

34

Evaluation: the F-measure

A weighted combination of precision and recall

represents the relative importance of precision and recall when = 1, precision & recall have same importance when > 1, precision is favored when < 1, recall is favored

R)P(β1)PR(β

F 2

2

Page 35: COMP791A: Statistical Language Processing

35

Evaluation: How to evaluate

Need a test collection document collection (few thousand - few

million documents) set of queries set of relevance judgements

humans must check all documents ??? use pooling (TREC)

take top 100 from every submission/system remove duplicates manually assess these only

Page 36: COMP791A: Statistical Language Processing

36

Evaluation: TREC Text Retrieval Conference/Competition

run by NIST (National Institute of Standards and Technology) 13th edition in 2004

Collection: about 3 Gigabytes > 1 million documents newswire & text news (AP, WSJ, …)

Queries + relevance judgments queries devised and judged by annotators

Participants various research and commercial groups compete

Tracks cross-lingual, Filtering Track, Genome Track video-track, Web

Track, QA, ...

Page 37: COMP791A: Statistical Language Processing

37