CSA3080: Adaptive Hypertext Systems I

21
1 of 21 [email protected] University of Malta CSA3080: Lecture 5 © 2003- Chris Staff CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department of Computer Science & AI University of Malta Lecture 5: Information Retrieval I

description

CSA3080: Adaptive Hypertext Systems I. Lecture 5: Information Retrieval I. Dr. Christopher Staff Department of Computer Science & AI University of Malta. Aims and Objectives. Aims and objectives of IR Boolean, Extended Boolean, Statistical Models. Aims and Objectives. - PowerPoint PPT Presentation

Transcript of CSA3080: Adaptive Hypertext Systems I

Page 1: CSA3080: Adaptive Hypertext Systems I

1 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

CSA3080:Adaptive Hypertext Systems I

Dr. Christopher StaffDepartment of Computer Science & AI

University of Malta

Lecture 5:Information Retrieval I

Page 2: CSA3080: Adaptive Hypertext Systems I

2 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Aims and Objectives

• Aims and objectives of IR

• Boolean, Extended Boolean, Statistical Models

Page 3: CSA3080: Adaptive Hypertext Systems I

3 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Aims and Objectives

• You should end up knowing the major differences between the simple matching algorithms

• And what each algorithm considers to be a relevant document…

• Bear in mind that we will use IR in AHS to find information relevant to our user so that we can present it/lead the user to it…

Page 4: CSA3080: Adaptive Hypertext Systems I

4 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Aims and Objectives of IR

• To facilitate the identification and retrieval of documents that contain information relevant to an information need expressed by a user

• We are particularly interested in the retrieval of information from unstructured data

Page 5: CSA3080: Adaptive Hypertext Systems I

5 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Boolean Information Retrieval

• Developed in 1950’s • A document is represented by a collection of

terms that occur in the document (index)• The unique terms occurring in the collection is

called the vocabulary• A document is represented by a bit sequence

with a 1 representing a term that is present, and 0 otherwise

Page 6: CSA3080: Adaptive Hypertext Systems I

6 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Boolean Information Retrieval

• How is the query expressed?– User thinks of terms that describe an information

need– Formalises query as a boolean expression

– (Term27 OR Term46) NOT (Term30 AND Term16)

Page 7: CSA3080: Adaptive Hypertext Systems I

7 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Boolean Information Retrieval

• How does the matching algorithm work?– Each term in the vocabulary has a set (or postings

list) of documents that contain the term– For each term in the query, the postings lists are

retrieved– Set operations (union/disjunction/intersection)– All documents in the results set are returned

Page 8: CSA3080: Adaptive Hypertext Systems I

8 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Boolean Information Retrieval

Page 9: CSA3080: Adaptive Hypertext Systems I

9 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Questions Arising…

• Is this really information retrieval?– Just because a document contains term x, does it

mean that the document is about term x?

• What about concepts?– What makes it possible for us to know that a fish

cake is not a dessert? That “she is the apple of my eye” does not make her a piece of fruit?

Page 10: CSA3080: Adaptive Hypertext Systems I

10 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Questions Arising…

• Can we rank the results of a boolean query?– All we are doing is checking the presence and

absence of terms– On what grounds would we rank?

• And doesn’t it look suspiciously like RDBMS/SQL???

Page 11: CSA3080: Adaptive Hypertext Systems I

11 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Does Boolean IR work?

• BIR works, and works well, when the vocabulary is reasonably small…

• … when there is no ambiguity in the meaning of terms• … when the presence of a term in a document is

significant• … when the absence of a term from a document means

that the document cannot be about that term

Page 12: CSA3080: Adaptive Hypertext Systems I

12 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Does Boolean IR work?

• Boolean IR is typically applied to a document surrogate

• And is used with tremendous success in RDBMS

• Most general purpose IR systems in use on the Internet are derived from BIR with some extensions…

Page 13: CSA3080: Adaptive Hypertext Systems I

13 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Vector Space Model of IR

• Briefly…– Documents (query) represented by vector of term

weights– Term weight describes relative importance of term

to document (query)– Similarity of document to query measured– The more similar the document to the query, the

more relevant it is

Page 14: CSA3080: Adaptive Hypertext Systems I

14 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Vector Space Model of IR

• VSM gives improved results over Boolean

• Can rank documents

• Can control output (limit the no. of documents returned)

• But… not as easy to construct query– Query does not contain any structure– Can’t express synonymy, etc.

Page 15: CSA3080: Adaptive Hypertext Systems I

15 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Extended Boolean Retrieval Model

• Developed to address ranking problem in BIR, using VSM-like approach, while retaining Boolean query structures

• E-BIR not as strict as BIR (fuzzy matches supported, as in VSM)

• Term features can include frequency, location, …• Reference:

– G. Salton, E. Fox, and U. Wu. (1983). Extended Boolean information retrieval. Communications of the ACM, 26(12):1022-1036.

Page 16: CSA3080: Adaptive Hypertext Systems I

16 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Extended Boolean Retrieval Model

• Matching is still based on presence or absence of terms, but now results can be ranked

• Terms in docs and query are weighted according to term features

• With structured documents (e.g., HTML), term features can also include structural information (title, heading, style, …)

Page 17: CSA3080: Adaptive Hypertext Systems I

17 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Extended Boolean Retrieval Model

• With location information possible to find terms NEAR each other– “computer NEAR science” not the same as

“computer AND science”– ADJ (adjacent) refines the proximity measure

Page 18: CSA3080: Adaptive Hypertext Systems I

18 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Questions Arising…

• Ranked results are an improvement

• NEAR is also useful to improve the quality of results

• … as is ADJ

• Are we any closer to information retrieval?

Page 19: CSA3080: Adaptive Hypertext Systems I

19 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Phrase Matching

• Concepts may be evidenced in text as complex/compound identifiers– New York, Computer Science, information retrieval,

database management systems, …

• Brings us closer to information retrieval, but still only identifies documents that contain phrases

• Reference:– W. Bruce Croft, Howard R. Turtle, and David D. Lewis, (1991), The use of phrases and

structured queries in information retrieval, ACM SIGIR, 32-45.

Page 20: CSA3080: Adaptive Hypertext Systems I

20 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Phrase Matching

• Extended/Boolean can express phrases using AND together with proximity operator

• VSM cannot, unless the phrase has been indexed!

• When is a sequence of words a phrase?– Croft et. al. use a probabilistic inference net

model…

Page 21: CSA3080: Adaptive Hypertext Systems I

21 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

Conclusion

• The Boolean and Extended Boolean Models give us a simple mechanism for representing documents

• If we can represent a user’s interest by the presence or absence of terms, then the user model could be used as a query to locate interesting document

• Phrase matching allows us to recognise complex nouns: useful only if phrase is pervasive