Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei:...

21
Boolean Retrieval

Transcript of Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei:...

Page 1: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

Boolean Retrieval

Page 2: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 2

Information Needs and Queries •  “What are the courses at SFU talking about

document indexes?” –  Issue a query “course, SFU, document indexes” to a

search engine •  Information need: the topic about which the user

desires to know more –  Unfortunately, often cannot be fed into a search engine

•  Query: what the user conveys to the computer in an attempt to communicate the information need –  Multiple queries may be formed to capture the same

information need –  A query may not capture the information need

sufficiently

Page 3: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 3

Relevance

•  Answers to a query may not all be relevant to the information need

•  A document is relevant if it is one that the user perceives as containing information of value with respect to their information need

•  How good are the returned answers? – Precision: the percentage of the returned results

that are relevant to the information need – Recall: the percentage of the relevant

documents in the collection that are returned

Page 4: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 4

Precision and Recall

•  Only return the exactly matched results? High precision, low recall

•  Return all documents? 100% recall, low precision •  More often than not, we have to keep balance

between precision and recall •  Classroom discussion: for web search, which one

is more important, precision or recall? Why? –  Can you give an application example where 100% recall

is required but accuracy can be traded off?

Page 5: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 5

Query Answering •  Which plays of Shakespeare contain the words “Brutus” and “Caesar” but not “Calpurnia”?

•  Scan “Shakespeare’s Collected Works” once, less than 1 million words –  Grepping: named after the UNIX command grep

•  Is linear scan capable in all situations? –  What if we have to search a large collection (e.g., the

web) which contains billions or trillions of words? –  How can we search for plays which contain “Brutus”

and “Caesar” in the same sentence? –  How can we rank the answers in relevance descending

order?

Page 6: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 6

Incidence Matrices

•  Two dimensional: documents and terms •  Cell M(t, d) = 1 if term t appears in

document d Documents

Term

s

Page 7: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 7

Term and Document Vectors

Documents

Term

s

Term vector

Document vector

Page 8: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 8

Query Answering •  Query: Brutus AND Caesar AND NOT Calpurnia •  VCalpurnia = 010000 NOT VCalpurnia = 101111 •  VBrutus AND VCaesar AND NOT VCalpurnia = 110100 AND

110111 AND 101111=100100 Documents

Term

s

Page 9: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 9

Query Results

•  Using the term vectors, we can only find whether the documents meet the query, but cannot find which parts of the documents meet the query

Page 10: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10

The Boolean Retrieval Model

•  We can pose any query which is in the form of a Boolean expression of terms, i.e., in which terms are combined with the operators AND, OR, and NOT – Each document is modeled as a set of words

•  Ad hoc retrieval: retrieve documents that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query

Page 11: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 11

Compressing Incidence Matrices

•  Suppose there are 1 million documents, each of about 1,000 words, and there are 500,000 distinct terms –  The incidence matrix has 500,000 rows and 1 million

columns = 500 billion cells – too big to fit into main memory

•  The matrix has no more than 1,000 x 1 million = 1 billion 1’s – 99.8% of the cells are zero –  We can save a lot of space if we only store the 1

positions

Page 12: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 12

Inverted Indexes (Files)

Inverted lists

Page 13: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 13

Page 14: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 14

Building an Inverted Index

•  Sorting according to document-ids •  Instances of the same term are grouped and

split into a dictionary and postings – Can use either singly linked lists or variable

length arrays •  The most efficient index for ad hoc search

Page 15: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 15

Processing Boolean Queries

•  Query: “Brutus AND Calpurnia” •  Steps

–  Locate Brutus in the dictionary, retrieve its postings –  Locate Calpurnia in the dictionary, retrieve its posting –  Intersect the two postings lists

Page 16: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 16

Intersection of Two Postings Lists

Similar to merge sort

Page 17: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 17

Conjunctive Queries of > 2 Terms

Page 18: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 18

Classroom Discussion

•  Why don’t we use a multi-way merge sort like method in answering a conjunctive query of more than 2 terms?

Page 19: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 19

Beyond the Boolean Model

•  Ranked retrieval models and free text queries – A query is one or more words – The system decides which documents best

satisfy the query and ranks them •  Boolean queries are precise and give more

control to users

Page 20: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 20

Summary

•  Information need and queries •  Boolean retrieval model •  Inverted index for ad hoc Boolean queries

– Structure – Construction algorithm – Query answering algorithm

Page 21: Boolean Retrieval - Simon Fraser University slides/L16 - Boolean retrieval.pdf · J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 10 The Boolean Retrieval Model

J. Pei: Information Retrieval and Web Search -- Boolean Retrieval 21

To-do List

•  Read Chapter 7.1 in the textbook