Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst,...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Web Search - Summer Term 2006 II. Information Retrieval (Models, Cont.) (c) Wolfgang Hürst,...
Web Search - Summer Term 2006
II. Information Retrieval (Models, Cont.)
(c) Wolfgang Hürst, Albert-Ludwigs-University
Classic Retrieval Models
1. Boolean Model (set theoretic)
2. Vector Model (algebraic)
3. Probabilistic Models (probabilistic)
Probabilistic IR ModelsBased on probability theory
Basic idea: Given a document d and a query q,Estimate the likelihood of d being relevant for the information need represented by q, i.e. P(R|q,d)
Compared to previous models:Boolean and Vector Models: Ranking based on relevance value which is inter- preted as a similarity measure between q and dProbabilistic Models: Ranking based on estimated likelihood of d being relevant for query q
Probabilistic ModelingGiven: Documents dj = (t1, t2, ..., tn), queries qi
(n = no of docs in collection)We assume similar dependence between d and q as
before, i.e. relevance depends on term distribution(Note: Slightly different notation here than before!)
Estimating P(R|d,q) directly often impossible in practice. Instead: Use Bayes Theorem, i.e.
or
Probab. Modeling as Decision Strategy
Decision about which docs should be returned based on a threshold calculated with a cost function Cj
Example: Cj (R, dec) Retrieved Not retrieved
Relevant Doc. 0 1
Non-Rel. Doc. 2 0
Decision based on risk function that minimizes costs:
Probability EstimationDifferent approaches to estimate P(d|R) exist:
Binary Independence Retrieval Model (BIR)Binary Independence Retrieval Model (BII)Darmstadt Indexing Approach (DIA)
Generally we assume stochastic independence between the terms of one document, i.e.
Binary Independence Retr. Model (BIR)
Learning:Estimation of probability distribution based on- a query qk
- a set of documents dj
- respective relevance judgments
Application:Generalization to different documentsfrom the collection(but restricted to same query and terms from training)
DOCS
TERMS
QUERIES
LEARNING APPLICATION
BIR
Binary Indep. Indexing Model (BII)
Learning:Estimation of probability distribution based on- a document dj
- a set of queries qk
- respective relevance judgments
Application:Generalization to different queries(but restricted to same doc. and terms from training)
DOCS
TERMS
QUERIES
APPLICA-TION
BII
LEARNING
Learning:Estimation of probability distribution based on- a set of queries qk
- an abstract description of a set of documents dj- respective relevance judgments
Application:Generalization to different queriesand documents
Darmstadt Indexing Approach (DIA)
DOCS
TERMS
QUERIES
APPLICA-TION
DIA
LEARNING
DIA - Description StepBasic idea: Instead of term-document pairs,
consider relevance descriptions x(ti, dm)
These contain the values of certain attributes of term ti, document dm and their relation to each other
Examples:- Dictionary information about ti (e.g. IDF)- Parameters describing dm (e.g. length or no. of unique terms)- Information about the appearance of ti in dm (e.g. in title, abstract), its frequency, the distance between two query terms, etc.
REFERENCE: FUHR, BUCKLEY [4]
DIA - Decision StepEstimation of probability P(R | x(ti, dm))
P(R | x(ti, dm)) is the probability of a document dm being relevant to an arbitrary query given that a term common to both document and query has a relevance description x(ti, dm).
Advantages:- Abstraction from specific term-doc pairs and thus generalization to random docs and queries- Enables individual, application-specific relevance descriptions
DIA - (Very) Simple Example
RELEVANCE DESCRIPTION:
x(ti, dm) = (x1, x2) withQUERY DOC. REL. TERM x
q1 d1 REL. t1
t2
t3
(1,1)
(0,1)
(1,2)
q1 d2 NOT REL.
t1
t3
t4
(0,2)
(1,1)
(0,1)
q2 d1 REL. t2
t5
t6
t7
(0,2)
(0,2)
(1,1)
(1,2)
q2 d3 NOT REL.
t5
t7
(0,1)
(0,1)
x Ex
(0,1) 1/4
(0,2) 2/3
(1,1) 2/3
(1,2) 1
TRAINING SET: q1, q2, d1, d2, d3
EVENT SPACE:
1, if ti title of dm
0, otherwise
1, if ti dm once2, if ti dm at least twice
x1 =
x2 =
DIA - Indexing FunctionBecause of relevance descriptions:
Generalization to random docs and queries
Another advantage: Instead of probabilities, we can also use a general indexing function e(x(ti, dm))
Note: We have a typical pattern recognition problem here, i.e.- Given: Set of features / parameters and different classes (here: rel. and not rel.)- Goal: Classification based on these featuresApproaches such as Neural Networks, SVMs, etc. can be used.
Models for IR - Taxonomy
Classic models:
Boolean model(based on set theory)
Vector space model (based on algebra)
Probabilistic models (based on probability theory)
Fuzzy set modelExtended Boolean model
Generalized vector modelLatent semantic indexingNeural networks
Inference networksBelief network
SOURCE: R. BAEZA-YATES [1], PAGE 20+21
Further models:Structured ModelsModels for BrowsingFiltering
References & Recommended Reading[1] R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN IR, ADDISON
WESLEY, 1999CHAPTER 2-2.5 (IR MODELS), CH. 5 (RELEVANCE FEEDBACK)
[2] N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, AVAILABLE ONLINE AT THE COURSE HOME PAGE http://www.is.informatik.uni-duisburg.de/courses/ir_ss06/index.htmlOR DIRECTLY AThttp://www.is.informatik.uni-duisburg.de/courses/ir_ss06/folien/irskall.pdfCHAPTER 5.1-5.3, 5.5, 6 (IR MODELS)
[3] F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS DOCUMENT RELEVANT? ... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC. 1998CHAPTER 1-3.4 (PROBABILISTIC MODELS)
[4] N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991CHAPTER 2 AND 4 (PROBABILISTIC MODELS)
Web Search - Summer Term 2006
II. Information Retrieval (Basics: Relevance
Feedback)
(c) Wolfgang Hürst, Albert-Ludwigs-University
Relevance FeedbackMotivation:
Formulating a good query is often difficultIdea:
Improve search result by indicating the relevance of the initially returned docs
Possible usage:- Get better search results- Re-train the current IR model
Different approaches based on- User feedback- Local information in the initial result set- Global information in the whole doc. coll.
Relev. Feedb. based on User InputProcedure:
- User enters initial query- System returns result to user based on this query- User marks relevant documents- System selects important terms from marked docs.- System returns new result based on these terms
Two approaches:- Query Expansion- Term Re-weighting
Advantages:- Breaks down search task in smaller steps- Relevance judgments easier to make than (re-)formulation of a query- Controlled process to emphasize relevant terms and
de-emphasize irrelevant ones
Query Expansion & Term Re-Weighting for the Vector Model
Vector Space Model: Representation of documents and queries as weighted vectors of terms
Assumption:- Large overlap of term sets from relevant documents- Small overlap of term sets from irrelevant docs.
Basic idea:Re-formulate the query in order to get the query vector closer to the documents marked as relevant
Optimum Query Vector= Set of returned docs marked as rel.= Set of returned docs marked as irrel.= Set of all relevant docs in the doc. coll.
= No. of docs in the respective doc. sets= Constant factors (for fine-tuning)
Best query vector to distinguish relevant from non-relevant docs:
Query Expansion & Term Re-Weighting
Based on the relevance feedback from the user we incrementally change the initial query vector q to create a better query vector qm
Goal: Approximation of the optimum query vector qopt
Other approaches exist, e.g. Ide_Regular, Ide_Dec_Hi
Standard_Rochio approach:
Relev. Feedb. without User InputDifferent approaches based on
- User feedback- Local information in the initial result set- Global information in the whole doc. coll.
Basic idea of relevance feedback: Clustering, i.e. the docs marked as relevant contain additional terms which describe a larger cluster of relevant docs
So far: Get user feedback to create this term setNow: Approaches to get these term sets automatically
Two approaches:- Local strategies (based on returned result set)- Global strategies (based on whole doc. Collection)
Query Exp. Through Local Clustering
Motivation: Given a query q, there exists a local relationship between relevant documents
Basic idea: Expand query q with additional terms based on a clustering of the documents from the initial result set
Different approaches exist: - Association Clusters: Assume a correlation between terms co-occurring in different docs - Metric Clusters: Assume a correlation between terms close to each other (in a document) - Scalar Clusters: Assume a correlation between terms with a similar neighborhood
Metric ClustersNote: In the following we consider word stems s instead of terms (analogous to the literature; works similar w. terms)
= Distance between two terms ti and tj
in document d (in no. of terms)
= root of term t
= set of all words with root s
Define a local stem-stem correlation matrix s with elements su,v based on the correlation cu,v
or normalized:
Query Exp. With Metric Clusters
Clusters based on Metric-Correlation-Matrices: Generated by returning the n terms (roots) sv with the highest entries su,v values given a term su
Use these clusters for query expansion
Comments: - Clusters do not necessarily contain synonyms - Non-normalized clusters often contain high frequency terms - Normalized clusters often group terms that appear less often - Therefore: Combined approaches exist (i.e. using normalized and non-normalized clusters)
Overview of Approaches
Based on user feedback
Based on local information in the initial result set- Local clustering- Local context analysis (combine local and global info)
Based on global information in the whole document collection, examples:- Query expansion using a similarity thesaurus- Query expansion using a statistical thesaurus
References (Books)
R. BAEZA-YATES, B. RIBEIRO-NETO: MODERN INFORMATION RETRIEVAL, ADDISON WESLEY, 1999
WILLIAM B. FRAKES, RICARDO BAEZA-YATES (EDS.): INFORMATION RETRIEVAL – DATA STRUCTURES AND ALGORITHMS, P T R PRENTICE HALL, 1992
C. J. VAN RIJSBERGEN: INFORMATION RETRIEVAL, 1979, http://www.dcs.gla.ac.uk/Keith/Preface.html
C. MANNING, P. RAGHAVAN, H. SCHÜTZ: INTRODUCTIONTO INFORMATION RETRIEVAL (TO APPEAR 2007)http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html
I. WITTEN, A. MOFFAT, T. BELL: MANAGING GIGABYTES, MORGAN KAUFMANN PUBLISHING, 1999
N. FUHR: SKRIPTUM ZUR VORLESUNG INFORMATION RETRIEVAL, SS 2006
AND MANY MORE!
References (Articles)G. SALTON: A BLUEPRINT FOR AUTOMATIC INDEXING,
ACM SIGIR FORUM, VOL. 16, ISSUE 2, FALL 1981F. CRESTANI, M. LALMAS, C.J. RIJSBERGEN, I. CAMPBELL: IS THIS
DOCUMENT RELEVANT? ... PROBABLIY: A SURVEY OF PROBABILISTIC MODELS IN INFORMATION RETRIEVAL, ACM COMPUTING SURVEYS, VOL. 30, NO. 4, DEC. 1998
N. FUHR, C. BUCKLEY: A PROBABILISTIC LEARNING APPROACH FOR DOCUMENT INDEXING, ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 9, NO. 3, JULY 1991
Further SourcesIR-RELATED CONFERENCES:ACM SIGIR International Conference on Information RetrievalACM / IEEE Joint Conference on Digital Libraries (JCDL)ACM Conference on Information Knowledge and Management (CIKM)Text REtrieval Conference (TREC), http://trec.nist.gov
INDEX
Recap: IR System & Tasks Involved
INFORMATION NEED DOCUMENTS
User Interface
PERFORMANCE EVALUATION
QUERY
QUERY PROCESSING (PARSING & TERM
PROCESSING)
LOGICAL VIEW OF THE INFORM. NEED
SELECT DATA FOR INDEXING
PARSING & TERM PROCESSING
SEARCHING
RANKING
RESULTS
DOCS.
RESULT REPRESENTATION
ScheduleIntroductionIR-Basics (Lectures) Overview, terms and definitions Index (inverted files) Term processing Query processing Ranking (TF*IDF, …) Evaluation IR-Models (Boolean, vector, probabilistic)IR-Basics (Exercises)Web Search (Lectures and exercises)
Organizational Remarks
Exercises:Please, register for the exercises by sending me
([email protected]) an email containing- Your name,- Matrikelnummer,- Studiengang (BA, MSc, Diploma, ...)- Plans for exam (yes, no, undecided)
This is just to organize the exercises, i.e. there are no consequences if you decide to drop this course.
Registrations should be done before the exercises start. Later registration might be possible under certain circumstances (contact me).