1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.

1

CS 430 / INFO 430 Information Retrieval

Lecture 2

Text Based Information Retrieval

2

Course Administration

Web site:

http://www.cs.cornell.edu/courses/cs430/2004fa

Notices:

See the course web site

Sign-up sheet:

If you did not sign up at the first class, please sign up now.

3


Please send all questions about the course to:

[email protected]

The message will be sent to

William ArmsAll Teaching Assistants

4


Discussion class, Wednesday, September 1Upson B17, 7:30 to 8:30 p.m.

Prepare for the class as instructed on the course Web site.

Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.

5

Discussion Classes

Format:

Questions.

Ask a member of the class to answer.

Provide opportunity for others to comment.

When answering:

Stand up.

Give your name. Make sure that the TA hears it.

Speak clearly so that all the class can hear.

Suggestions:

Do not be shy at presenting partial answers.

Differing viewpoints are welcome.

6

Information Retrieval from Collections of Textual Documents

Major Categories of Methods

1. Exact matching (Boolean)

2. Ranking by similarity to query (vector space model)

3. Ranking of matches by importance of documents (PageRank)

4. Combination methods

Course begins with Boolean, then similarity methods, then importance methods.

7

Text Based Information Retrieval

Most matching methods are based on Boolean operators.

Most ranking methods are based on the vector space model.

Web search methods combine vector space model with ranking based on importance of documents.

Many practical systems combine features of several approaches.

In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.

8

Documents

A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation.

The individual words and other symbols are known as tokens or terms.

A textual document can be:

• Free text, also known as unstructured text, which is a continuous sequence of tokens.

• Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup.

[Methods of markup, e.g., XML, are covered in CS 431.]

9

Word Frequency

Observation: Some words are more common than others.

Statistics: Most large collections of text documents have similar statistical characteristics. These statistics:

• influence the effectiveness and efficiency of data structures used to index documents

• many retrieval models rely on them

10

Word Frequency

Example

The following example is taken from:

Jamie Callan, Characteristics of Text, 1997

Sample of 19 million words

The next slide shows the 50 commonest words in rank order (r), with their frequency (f).

11

f f f

the 1130021 from 96900 or 54958 of 547311 he 94585 about 53713 to 516635 million 93515 market 52110 a 464736 year 90104 they 51359 in 390819 its 86774 this 50933 and 387703 be 85588 would 50828 that 204351 was 83398 you 49281 for 199340 company83070 which 48273 is 152483 an 76974 bank 47940 said 148302 has 74405 stock 47401 it 134323 are 74097 trade 47310 on 121173 have 73132 his 47116 by 118863 but 71887 more 46244 as 109135 will 71494 who 42142 at 101779 say 66807 one 41635 mr 101679 new 64456 their 40910 with 101210 share 63925

12

Rank Frequency Distribution

For all the words in a collection of documents, for each word w

f is the frequency that w appears

r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.)

f

r

w has rank r and frequency f

13

Rank Frequency Example

The next slide shows the words in Callan's data normalized. In this example:

r is the rank of word w in the sample.

f is the frequency of word w in the sample.

n is the total number of word occurrences in the sample.

14

rf*1000/n rf*1000/n rf*1000/n

the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114

15

Zipf's Law

If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation:

r * f = c

Different collections have different constants c.

In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection.

For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see:

Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949

16

Methods that Build on Zipf's Law

Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems.

Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used.

Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.

17

1. Exact Matching (Boolean Model)

Query DocumentsIndex database

Mechanism for determining whether a document matches a query.

Set of hits

18

Evaluation of Matching: Recall and Precision

If information retrieval were perfect ...

Every hit would be relevant to the original query, and every relevant item in the body of information would be found.

Precision: percentage (or fraction) of the hits that are relevant, i.e., the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query.

Recall: percentage (or fraction) of the relevant items that are found by the query, i.e., the extent to which the query found all the items that satisfy the requirement.

19

Recall and Precision with Exact Matching: Example

• Collection of 10,000 documents, 50 on a specific topic

• Ideal search finds these 50 documents and reject all others

• Actual search identifies 25 documents; 20 are relevant but 5 were on other topics

• Precision: 20/ 25 = 0.8 (80% of hits were relevant)

• Recall: 20/50 = 0.4 (40% of relevant were found)

20

Measuring Precision and Recall

Precision is easy to measure:

• A knowledgeable person looks at each document that is identified and decides whether it is relevant.

• In the example, only the 25 documents that are found need to be examined.

Recall is difficult to measure:

• To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria.

• In the example, all 10,000 documents must be examined.

21

Query

A query is a string to match against entries in an index. The string might may contain:

search terms computation

operators computation and parallel

fields author = Newton

metacharacters b[aeiou]n*g

(Metacharacters can be used to build regular expressions, which will be covered later in the course.)

22

Boolean Queries

Boolean query: two or more search terms, related by logical operators, e.g.,

and or not

Examples:

abacus and actor

abacus or actor

(abacus and actor) or (abacus and atoll)

not actor

23

Boolean Diagram

A B

A and B

A or B

not (A or B)

24

Adjacent and Near Operators

abacus adj actor

Terms abacus and actor are adjacent to each other as in the string

"abacus actor"

abacus near 4 actor

Terms abacus and actor are near to each other as in the string

"the actor has an abacus"

Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

25

Evaluation of Boolean Operators

Precedence of operators must be defined:

adj, near high

and, not

or low

Example

A and B or C and B

is evaluated as

(A and B) or (C and B)

26

Inverted File

Inverted file:

A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?"

In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.

27

Inverted File -- Basic Concept

Word Document

abacus 3

19 22

actor 2 19 29

aspen 5 atoll 11

34

Stop words are removed before building the index.

28

Inverted List -- Concept

Inverted List: All the entries in an inverted file that apply to a specific word, e.g.

abacus 3 19 22

Posting: Entry in an inverted list, e.g., there are three postings for "abacus".

29

Evaluating a Boolean Query

3 19 22 2 19 29

To evaluate the and operator, merge the two inverted lists

with a logical AND operation.

Examples: abacus and actor

Postings for abacus

Postings for actor

Document 19 is the only document that contains both terms, "abacus" and "actor".

30

Enhancements to Inverted Files -- Concept

Location: The inverted file can hold information about the location of each term within the document.

Uses

adjacency and near operatorsuser interface design -- highlight location of search term

Frequency: The inverted file includes the number of postings for each term.

Uses

term weightingquery processing optimization

31

Inverted File -- Concept (Enhanced)

Word Postings Document Location

abacus 4 3 94 19 7 19 212

22 56actor 3 2 66

19 213 29 45

aspen 1 5 43atoll 3 11 3

11 70 34 40

32

Evaluating an Adjacency Operation

Examples: abacus adj actor

Postings for abacus

Postings for actor

Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

3 94 19 719 212 22 56

2 66 19 213 29 45

1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.

Documents

Transcript of 1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.