CSA3080: Adaptive Hypertext Systems I

1 of [email protected] University of Malta

CSA3080: Lecture 5© 2003- Chris Staff

CSA3080:Adaptive Hypertext Systems I

Dr. Christopher StaffDepartment of Computer Science & AI

University of Malta

Lecture 5:Information Retrieval I



Aims and Objectives

• Aims and objectives of IR

• Boolean, Extended Boolean, Statistical Models



Aims and Objectives

• You should end up knowing the major differences between the simple matching algorithms

• And what each algorithm considers to be a relevant document…

• Bear in mind that we will use IR in AHS to find information relevant to our user so that we can present it/lead the user to it…



Aims and Objectives of IR

• To facilitate the identification and retrieval of documents that contain information relevant to an information need expressed by a user

• We are particularly interested in the retrieval of information from unstructured data



Boolean Information Retrieval

• Developed in 1950’s • A document is represented by a collection of

terms that occur in the document (index)• The unique terms occurring in the collection is

called the vocabulary• A document is represented by a bit sequence

with a 1 representing a term that is present, and 0 otherwise




• How is the query expressed?– User thinks of terms that describe an information

need– Formalises query as a boolean expression

– (Term27 OR Term46) NOT (Term30 AND Term16)




• How does the matching algorithm work?– Each term in the vocabulary has a set (or postings

list) of documents that contain the term– For each term in the query, the postings lists are

retrieved– Set operations (union/disjunction/intersection)– All documents in the results set are returned



Questions Arising…

• Is this really information retrieval?– Just because a document contains term x, does it

mean that the document is about term x?

• What about concepts?– What makes it possible for us to know that a fish

cake is not a dessert? That “she is the apple of my eye” does not make her a piece of fruit?




• Can we rank the results of a boolean query?– All we are doing is checking the presence and

absence of terms– On what grounds would we rank?

• And doesn’t it look suspiciously like RDBMS/SQL???



Does Boolean IR work?

• BIR works, and works well, when the vocabulary is reasonably small…

• … when there is no ambiguity in the meaning of terms• … when the presence of a term in a document is

significant• … when the absence of a term from a document means

that the document cannot be about that term



Does Boolean IR work?

• Boolean IR is typically applied to a document surrogate

• And is used with tremendous success in RDBMS

• Most general purpose IR systems in use on the Internet are derived from BIR with some extensions…



Vector Space Model of IR

• Briefly…– Documents (query) represented by vector of term

weights– Term weight describes relative importance of term

to document (query)– Similarity of document to query measured– The more similar the document to the query, the

more relevant it is



Vector Space Model of IR

• VSM gives improved results over Boolean

• Can rank documents

• Can control output (limit the no. of documents returned)

• But… not as easy to construct query– Query does not contain any structure– Can’t express synonymy, etc.



Extended Boolean Retrieval Model

• Developed to address ranking problem in BIR, using VSM-like approach, while retaining Boolean query structures

• E-BIR not as strict as BIR (fuzzy matches supported, as in VSM)

• Term features can include frequency, location, …• Reference:

– G. Salton, E. Fox, and U. Wu. (1983). Extended Boolean information retrieval. Communications of the ACM, 26(12):1022-1036.




• Matching is still based on presence or absence of terms, but now results can be ranked

• Terms in docs and query are weighted according to term features

• With structured documents (e.g., HTML), term features can also include structural information (title, heading, style, …)




• With location information possible to find terms NEAR each other– “computer NEAR science” not the same as

“computer AND science”– ADJ (adjacent) refines the proximity measure




• Ranked results are an improvement

• NEAR is also useful to improve the quality of results

• … as is ADJ

• Are we any closer to information retrieval?



Phrase Matching

• Concepts may be evidenced in text as complex/compound identifiers– New York, Computer Science, information retrieval,

database management systems, …

• Brings us closer to information retrieval, but still only identifies documents that contain phrases

• Reference:– W. Bruce Croft, Howard R. Turtle, and David D. Lewis, (1991), The use of phrases and

structured queries in information retrieval, ACM SIGIR, 32-45.



Phrase Matching

• Extended/Boolean can express phrases using AND together with proximity operator

• VSM cannot, unless the phrase has been indexed!

• When is a sequence of words a phrase?– Croft et. al. use a probabilistic inference net

model…



Conclusion

• The Boolean and Extended Boolean Models give us a simple mechanism for representing documents

• If we can represent a user’s interest by the presence or absence of terms, then the user model could be used as a query to locate interesting document

• Phrase matching allows us to recognise complex nouns: useful only if phrase is pervasive

CSA3080: Adaptive Hypertext Systems I

Documents

Transcript of CSA3080: Adaptive Hypertext Systems I