Information Retrieval
description
Transcript of Information Retrieval
(C) 2003, The University of Michigan 1
Information Retrieval
Handout #2
February 3, 2003
(C) 2003, The University of Michigan 2
Course Information
• Instructor: Dragomir R. Radev ([email protected])
• Office: 3080, West Hall Connector
• Phone: (734) 615-5225
• Office hours: M&F 11-12
• Course page: http://tangra.si.umich.edu/~radev/650/
• Class meets on Mondays, 1-4 PM in 409 West Hall
(C) 2003, The University of Michigan 3
Queries and documents
(C) 2003, The University of Michigan 4
Queries
• Single-word queries
• Context queries– Phrases– Proximity
• Boolean queries
• Natural Language queries
(C) 2003, The University of Michigan 5
Pattern matching
• Words, prefixes, suffixes, substrings, ranges, regular expressions
• Structured queries (e.g., XML)
(C) 2003, The University of Michigan 6
Relevance feedback
• Query expansion
• Term reweighting
• Pseudo-relevance feedback
• Latent semantic indexing
• Distributional clustering
(C) 2003, The University of Michigan 7
Document processing
• Lexical analysis
• Stopword elimination
• Stemming
• Index term identification
• Thesauri
(C) 2003, The University of Michigan 8
Porter’s algorithm• 1. The measure, m, of a stem is a function of sequences of vowels
followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
(C) 2003, The University of Michigan 9
Porter’s algorithm• Suffix conditions take the form current_suffix = = pattern
Actions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:
STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control
(C) 2003, The University of Michigan 10
Porter’s algorithm
Example: the word “duplicatable”
duplicat rule 4duplicate rule 1b1duplic rule 3
The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.
(C) 2003, The University of Michigan 11
Porter’s algorithm
Computable Comput
Intervention Intervent
Retrieval Retriev
Document Docum
Representing Repres
Representative Repres
(C) 2003, The University of Michigan 12
Relevance feedback
• Automatic
• Manual
• Method: identifying feedback termsQ’ = a1Q + a2R - a3N
Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
(C) 2003, The University of Michigan 13
Example
• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -
relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-
relevant• R = ?• S = ?• Q’ = ?
(C) 2003, The University of Michigan 14
Automatic query expansion
• Thesaurus-based expansion
• Distributional similarity-based expansion
(C) 2003, The University of Michigan 15
WordNet and DistSim
wn reason -hypen - hypernyms
wn reason -synsn - synsets
wn reason -simsn - synonyms
wn reason -over - overview of senses
wn reason -famln - familiarity/polysemy
wn reason -grepn - compound nouns
/clair3/tools/relatedwords/relate reason
(C) 2003, The University of Michigan 16
Related (substitutable) words
Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint
Distributional clustering:
Wordnet
Book: autobiography, essay, biography, memoirs, novelsComputer: adobe, computing, computers, developed, hardwareFruit: leafy, canned, fruits, flowers, grapesPolitician: activist, campaigner, politicians, intellectuals, journalistNewspaper: daily, globe, newspapers, newsday, paper
(C) 2003, The University of Michigan 17
Indexing and searching
(C) 2003, The University of Michigan 18
Computing term salience
• Term frequency (IDF)
• Document frequency (DF)
• Inverse document frequency (IDF)
N
wDFwIDF
)(log)(
(C) 2003, The University of Michigan 19
Scripts to compute tf and idf
cd /clair4/class/ir-w03/hw2
./tf.pl 053.txt | sort -nr +1 | more
./tfs.pl 053.txt | sort -nr +1 | more
./stem.pl reasonableness
./build-idf.pl
./idf.pl | sort -n +2 | more
(C) 2003, The University of Michigan 20
Applications of TFIDF
• Cosine similarity
• Indexing
• Clustering