LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory:...

31
LIS618 lecture 3 Thomas Krichel 2004-02-15

Transcript of LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory:...

Page 1: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

LIS618 lecture 3

Thomas Krichel

2004-02-15

Page 2: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Structure

• Revision of what was done last week.

• Theory: discussion of the Boolean model

• Theory: the vector model

• Practice: introducing Nexis

• More Nexis next week

Page 3: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

advantages of Boolean model

• supposedly easy to grasp by the user

• precise semantics of queries

• implemented in the majority of commercial systems

Page 4: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

problems of Boolean model

• sharp distinction between relevant and irrelevant documents

• no ranking possible

• users find it difficult to formulate Boolean queries

• users find it difficult to resolve Boolean queries

Page 5: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

vector model

• associates weights with each index term appearing in the query and in each database document.

• relevance can be calculated as the cosine between the two vectors, i.e. their cross product divided be the square roots of the squares of each vector. This measure varies between 0 and 1.

Page 6: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

tf/idf

• stands for term frequency / inverse document frequency

• This refers to a technique that gives term a high rank in a document if– the term appears frequently in a document– the term does not appear frequently in other

documents

• We will look at each component one at time.

Page 7: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

absolute & maximum term frequency

• Let F_t_d be the number of times term t appears in the document d. This is its absolute term frequency in the document.

• Let m_d be the maximum absolute term frequency achieved by any term in document d. Examples– Document 1: a b a a b c c d m_1 = 3, because "a" appears 3 times– Document 2: a b a f f f e d f a a m_2 = 4, because "a" or "f" appears 4 times

Page 8: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

relative document term frequency

• The relative term frequency f_t_d, is given by

f_t_d = F_t_d / m_d

that is the absolute term frequency of term t in document d divided by the maximum absolute term frequency of document d.

• This completes the "term frequency" part of the tf/idf formula.

• Let us look at this part through an example.

Page 9: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

main example, part I• Consider three documents

– 1: a b c a f o n l p o f t y x– 2: a m o e e e n n n a n p l– 3: r a e e f n l i f f f f x l

• First, look at the maximum frequency achieved by any term in a given document.m_1 = 2 ("a", "f" and "o" are there twice)

m_2 = 4 ("n" is there four times)

m_3 = 5 ("f" is there five times)

Page 10: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

main example part II

• Now look at some example of absolute term frequencyF_a_1 = 2 F_e_2 = 3 F_x_3 = 1

• and some examples of relative term frequency f_a_1 = F_a_1 / m_1 = 2 / 2 = 1

f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75

f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2

Page 11: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

inverse document frequency

• Let N be the number of documents in the datebase. N=3 in our example.

• Let n_t be the number of documents where the term t appears. In our examplen_a = 3 n_e = 2 n_x = 2

• N/n_t is an indication of inverse document frequency of a term. It is larger the less a term appears across documents in the database.

Page 12: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

intermezzo: the logarithm

• The logarithm, written log() is a mathematical function. You should know that– log() is an increasing function, i.e. the bigger

is x, the bigger is log(x). – log(1) = 0– log(x) > 0 if x > 1

• Your calculator will tell you what the logarithm of a number is.

Page 13: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

tf/idf formula

• Term frequency and inverse document frequency have to be combined.

• The final formula for the weight combines the terms as follows

w_t_d = f_t_d * log( N / n_t )

Page 14: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

main example part III

N = 3

w_a_1 = 1 * log(3/3) = log(1) = 0 !

w_e_2 = 0.75 * log(3/2)

w_x_3 = 0.2 * log(3/2)

where log(3/2) = 0.176, approximately

Page 15: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

practical operation

• The computer will search the documents for the query term and return the documents where the weight of term in the index for that document is strictly positive, by order of weights, highest to lowest.

• If there are several query terms the computer will perform a more complicated operation that we will not further study here, so we limit ourselves to the case of one query term.

Page 16: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

practical tests

• You ask the computer to query the term "a" in our example. What documents are being returned? – Compare with the result of the Boolean

model.

• You ask the computer to query the term "e". What documents are being returned, and in what order?

Page 17: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

advantages of vector model

• term weighting improves performance

• sorting is possible

• easy to compute, therefore fast

• results are difficult to improve without – query expansion– user feedback circle

Page 18: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Lexis/Nexis

• Lexis is a specialized legal research service

• Nexis is primarily a news services

• adds an important temporal component to all its contents

• restricts contents as compared to Dialog

• potentially bad competition from Google

• lives at http://www.nexis.com

Page 19: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

compilation of Nexis

• Uses a number of news sources such as newspapers.

• Uses company reports databases

• Uses web sites, the URLs of which are found in the news sources. Some of the material there can be of low value (remember the comments in the first lecture)

Page 20: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

SmartIndexing

• There is a controlled vocabulary of indexing terms

• A document is indexed– In full text view (except web sites)– With automatic addition of index terms that

correspond to the document.• Index terms are added• Weight of index terms is calculated

• http://www.lexis-nexis.com/infopro/products/index/ has more on it.

Page 21: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

equivalents

• Nexis has a number of "equivalents" where, depending on sources, it replaces one with the other. Contrary to their claims they also work in quick search

• First (second, third, etc.)is 1st (2nd, 3rd, etc.) Monday (All days ex. Sunday) Mon (Tues, Weds, etc.)

• January (Abbreviations work) Jan (Feb, Mar, etc.)• One (all numbers < 20) 1 (2, 3, etc.) • and & • company co• corporation corp • incorporated inc

Page 22: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Six interfaces to Nexis

• Quick search• Subject directory• Power search• Personal news• Search forms• Real time news• In the remainder of the lecture I will go

through some of these

Page 23: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Quick search

• Implicit OR between terms• Use quotes to require adjacency of terms• You can select from a drop-down box of sources• You can set the date range, though unclear what

it means• It seems to OR a plural to your search term.• Sometimes returns documents with none of the

search terms. “she is the one”

Page 24: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Quick search

• It is not clear what parts of documents are being searched

• Apparently it does not search the full text.

• But it seems to prioritize – TERM, i.e. smart keywords extracted,– HLEAD for news– TITLE for legal documents– WEB-SEARCH-TEXT for web pages

Page 25: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

relevance ranking concerns

• where terms appear within the document

• how many occurrences of the terms appear in the document

• how often those search terms appear throughout the document

• apparently not how much they occur, example search for "the" or "the the"

• seems that they guard algorithm a secret

Page 26: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Subject directory

• you can follow the subject tree but

• there seems to be only a tiny amount of documents

• categories are not particularly deep or developed

• there is a "more like this" feature of limited use, Thomas finds

Page 27: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Power search

• You can first create a customized set of sources to search

• Do this at the start, you browse a menu, then click “done, search now”

• This is a lot more efficient than trying to build a search strategy on a large set.

Page 28: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

power search truncation

• * represents a single character, present or absent– wom*n– labo*r

• ! truncates to the end of the word– bookk!

Page 29: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Power search connectors

• OR• AND• AND NOT • PRE/n, n is a number, ordered proximity• W/n, n is a number, unordered proximity • W/S words in same sentence • W/P words is the some paragraph • Use parentheses!• There is no implicit or as in the simple search,

so forget about the double quotes.

Page 30: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Power search expressions

• Parentheses group terms together• * for one or no letter• ! for any number of letters• ATLEAST n (term), where n is a minimum

number of occurrences• PLURAL (term) only the plural of term• SINGULAR (term) only the singular of term• ALLCAPS (term) only capitals of term• NOCAPS (term) no capitals of term• CAPS (term) capitalized term only

Page 31: LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

http://openlib.org/home/krichel

Thank you for your attention!