1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
-
Upload
shannon-hodge -
Category
Documents
-
view
224 -
download
0
Transcript of 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1
CS 430 / INFO 430 Information Retrieval
Lecture 2
Text Based Information Retrieval
2
Course Administration
Web site:
http://www.cs.cornell.edu/courses/cs430/2004fa
Notices:
See the course web site
Sign-up sheet:
If you did not sign up at the first class, please sign up now.
3
Course Administration
Please send all questions about the course to:
The message will be sent to
William ArmsAll Teaching Assistants
4
Course Administration
Discussion class, Wednesday, September 1Upson B17, 7:30 to 8:30 p.m.
Prepare for the class as instructed on the course Web site.
Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.
5
Discussion Classes
Format:
Questions.
Ask a member of the class to answer.
Provide opportunity for others to comment.
When answering:
Stand up.
Give your name. Make sure that the TA hears it.
Speak clearly so that all the class can hear.
Suggestions:
Do not be shy at presenting partial answers.
Differing viewpoints are welcome.
6
Information Retrieval from Collections of Textual Documents
Major Categories of Methods
1. Exact matching (Boolean)
2. Ranking by similarity to query (vector space model)
3. Ranking of matches by importance of documents (PageRank)
4. Combination methods
Course begins with Boolean, then similarity methods, then importance methods.
7
Text Based Information Retrieval
Most matching methods are based on Boolean operators.
Most ranking methods are based on the vector space model.
Web search methods combine vector space model with ranking based on importance of documents.
Many practical systems combine features of several approaches.
In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.
8
Documents
A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation.
The individual words and other symbols are known as tokens or terms.
A textual document can be:
• Free text, also known as unstructured text, which is a continuous sequence of tokens.
• Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup.
[Methods of markup, e.g., XML, are covered in CS 431.]
9
Word Frequency
Observation: Some words are more common than others.
Statistics: Most large collections of text documents have similar statistical characteristics. These statistics:
• influence the effectiveness and efficiency of data structures used to index documents
• many retrieval models rely on them
10
Word Frequency
Example
The following example is taken from:
Jamie Callan, Characteristics of Text, 1997
Sample of 19 million words
The next slide shows the 50 commonest words in rank order (r), with their frequency (f).
11
f f f
the 1130021 from 96900 or 54958 of 547311 he 94585 about 53713 to 516635 million 93515 market 52110 a 464736 year 90104 they 51359 in 390819 its 86774 this 50933 and 387703 be 85588 would 50828 that 204351 was 83398 you 49281 for 199340 company83070 which 48273 is 152483 an 76974 bank 47940 said 148302 has 74405 stock 47401 it 134323 are 74097 trade 47310 on 121173 have 73132 his 47116 by 118863 but 71887 more 46244 as 109135 will 71494 who 42142 at 101779 say 66807 one 41635 mr 101679 new 64456 their 40910 with 101210 share 63925
12
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
f is the frequency that w appears
r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.)
f
r
w has rank r and frequency f
13
Rank Frequency Example
The next slide shows the words in Callan's data normalized. In this example:
r is the rank of word w in the sample.
f is the frequency of word w in the sample.
n is the total number of word occurrences in the sample.
14
rf*1000/n rf*1000/n rf*1000/n
the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114
15
Zipf's Law
If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation:
r * f = c
Different collections have different constants c.
In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection.
For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see:
Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949
16
Methods that Build on Zipf's Law
Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems.
Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used.
Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.
17
1. Exact Matching (Boolean Model)
Query DocumentsIndex database
Mechanism for determining whether a document matches a query.
Set of hits
18
Evaluation of Matching: Recall and Precision
If information retrieval were perfect ...
Every hit would be relevant to the original query, and every relevant item in the body of information would be found.
Precision: percentage (or fraction) of the hits that are relevant, i.e., the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query.
Recall: percentage (or fraction) of the relevant items that are found by the query, i.e., the extent to which the query found all the items that satisfy the requirement.
19
Recall and Precision with Exact Matching: Example
• Collection of 10,000 documents, 50 on a specific topic
• Ideal search finds these 50 documents and reject all others
• Actual search identifies 25 documents; 20 are relevant but 5 were on other topics
• Precision: 20/ 25 = 0.8 (80% of hits were relevant)
• Recall: 20/50 = 0.4 (40% of relevant were found)
20
Measuring Precision and Recall
Precision is easy to measure:
• A knowledgeable person looks at each document that is identified and decides whether it is relevant.
• In the example, only the 25 documents that are found need to be examined.
Recall is difficult to measure:
• To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria.
• In the example, all 10,000 documents must be examined.
21
Query
A query is a string to match against entries in an index. The string might may contain:
search terms computation
operators computation and parallel
fields author = Newton
metacharacters b[aeiou]n*g
(Metacharacters can be used to build regular expressions, which will be covered later in the course.)
22
Boolean Queries
Boolean query: two or more search terms, related by logical operators, e.g.,
and or not
Examples:
abacus and actor
abacus or actor
(abacus and actor) or (abacus and atoll)
not actor
23
Boolean Diagram
A B
A and B
A or B
not (A or B)
24
Adjacent and Near Operators
abacus adj actor
Terms abacus and actor are adjacent to each other as in the string
"abacus actor"
abacus near 4 actor
Terms abacus and actor are near to each other as in the string
"the actor has an abacus"
Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).
25
Evaluation of Boolean Operators
Precedence of operators must be defined:
adj, near high
and, not
or low
Example
A and B or C and B
is evaluated as
(A and B) or (C and B)
26
Inverted File
Inverted file:
A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?"
In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.
27
Inverted File -- Basic Concept
Word Document
abacus 3
19 22
actor 2 19 29
aspen 5 atoll 11
34
Stop words are removed before building the index.
28
Inverted List -- Concept
Inverted List: All the entries in an inverted file that apply to a specific word, e.g.
abacus 3 19 22
Posting: Entry in an inverted list, e.g., there are three postings for "abacus".
29
Evaluating a Boolean Query
3 19 22 2 19 29
To evaluate the and operator, merge the two inverted lists
with a logical AND operation.
Examples: abacus and actor
Postings for abacus
Postings for actor
Document 19 is the only document that contains both terms, "abacus" and "actor".
30
Enhancements to Inverted Files -- Concept
Location: The inverted file can hold information about the location of each term within the document.
Uses
adjacency and near operatorsuser interface design -- highlight location of search term
Frequency: The inverted file includes the number of postings for each term.
Uses
term weightingquery processing optimization
31
Inverted File -- Concept (Enhanced)
Word Postings Document Location
abacus 4 3 94 19 7 19 212
22 56actor 3 2 66
19 213 29 45
aspen 1 5 43atoll 3 11 3
11 70 34 40
32
Evaluating an Adjacency Operation
Examples: abacus adj actor
Postings for abacus
Postings for actor
Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.
3 94 19 719 212 22 56
2 66 19 213 29 45