Information Retrieval

20
(C) 2003, The University of Michigan 1 Information Retrieval Handout #2 February 3, 2003

description

Information Retrieval. February 3, 2003. Handout #2. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/ - PowerPoint PPT Presentation

Transcript of Information Retrieval

Page 1: Information Retrieval

(C) 2003, The University of Michigan 1

Information Retrieval

Handout #2

February 3, 2003

Page 2: Information Retrieval

(C) 2003, The University of Michigan 2

Course Information

• Instructor: Dragomir R. Radev ([email protected])

• Office: 3080, West Hall Connector

• Phone: (734) 615-5225

• Office hours: M&F 11-12

• Course page: http://tangra.si.umich.edu/~radev/650/

• Class meets on Mondays, 1-4 PM in 409 West Hall

Page 3: Information Retrieval

(C) 2003, The University of Michigan 3

Queries and documents

Page 4: Information Retrieval

(C) 2003, The University of Michigan 4

Queries

• Single-word queries

• Context queries– Phrases– Proximity

• Boolean queries

• Natural Language queries

Page 5: Information Retrieval

(C) 2003, The University of Michigan 5

Pattern matching

• Words, prefixes, suffixes, substrings, ranges, regular expressions

• Structured queries (e.g., XML)

Page 6: Information Retrieval

(C) 2003, The University of Michigan 6

Relevance feedback

• Query expansion

• Term reweighting

• Pseudo-relevance feedback

• Latent semantic indexing

• Distributional clustering

Page 7: Information Retrieval

(C) 2003, The University of Michigan 7

Document processing

• Lexical analysis

• Stopword elimination

• Stemming

• Index term identification

• Thesauri

Page 8: Information Retrieval

(C) 2003, The University of Michigan 8

Porter’s algorithm• 1. The measure, m, of a stem is a function of sequences of vowels

followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y

Page 9: Information Retrieval

(C) 2003, The University of Michigan 9

Porter’s algorithm• Suffix conditions take the form current_suffix = = pattern

Actions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:

STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control

Page 10: Information Retrieval

(C) 2003, The University of Michigan 10

Porter’s algorithm

Example: the word “duplicatable”

duplicat rule 4duplicate rule 1b1duplic rule 3

The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

Page 11: Information Retrieval

(C) 2003, The University of Michigan 11

Porter’s algorithm

Computable Comput

Intervention Intervent

Retrieval Retriev

Document Docum

Representing Repres

Representative Repres

Page 12: Information Retrieval

(C) 2003, The University of Michigan 12

Relevance feedback

• Automatic

• Manual

• Method: identifying feedback termsQ’ = a1Q + a2R - a3N

Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

Page 13: Information Retrieval

(C) 2003, The University of Michigan 13

Example

• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -

relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-

relevant• R = ?• S = ?• Q’ = ?

Page 14: Information Retrieval

(C) 2003, The University of Michigan 14

Automatic query expansion

• Thesaurus-based expansion

• Distributional similarity-based expansion

Page 15: Information Retrieval

(C) 2003, The University of Michigan 15

WordNet and DistSim

wn reason -hypen - hypernyms

wn reason -synsn - synsets

wn reason -simsn - synonyms

wn reason -over - overview of senses

wn reason -famln - familiarity/polysemy

wn reason -grepn - compound nouns

/clair3/tools/relatedwords/relate reason

Page 16: Information Retrieval

(C) 2003, The University of Michigan 16

Related (substitutable) words

Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint

Distributional clustering:

Wordnet

Book: autobiography, essay, biography, memoirs, novelsComputer: adobe, computing, computers, developed, hardwareFruit: leafy, canned, fruits, flowers, grapesPolitician: activist, campaigner, politicians, intellectuals, journalistNewspaper: daily, globe, newspapers, newsday, paper

Page 17: Information Retrieval

(C) 2003, The University of Michigan 17

Indexing and searching

Page 18: Information Retrieval

(C) 2003, The University of Michigan 18

Computing term salience

• Term frequency (IDF)

• Document frequency (DF)

• Inverse document frequency (IDF)

N

wDFwIDF

)(log)(

Page 19: Information Retrieval

(C) 2003, The University of Michigan 19

Scripts to compute tf and idf

cd /clair4/class/ir-w03/hw2

./tf.pl 053.txt | sort -nr +1 | more

./tfs.pl 053.txt | sort -nr +1 | more

./stem.pl reasonableness

./build-idf.pl

./idf.pl | sort -n +2 | more

Page 20: Information Retrieval

(C) 2003, The University of Michigan 20

Applications of TFIDF

• Cosine similarity

• Indexing

• Clustering