Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance...

31
Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local analysis (pseudo-relevance feedback) - Global analysis (thesaurus) 4- Evaluation 5- Issues

Transcript of Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance...

Page 1: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Query operations

1- Introduction

2- Relevance feedback with user relevance information

3- Relevance feedback without user relevance information

- Local analysis (pseudo-relevance feedback)

- Global analysis (thesaurus)

4- Evaluation

5- Issues

Page 2: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Introduction (1)

No detailed knowledge of collection and retrieval environment difficult to formulate queries well designed for retrieval Need many formulations of queries for effective retrieval

First formulation: often naïve attempt to retrieve relevant information Documents initially retrieved:

Examined for relevance information (user, automatically) Improve query formulations for retrieving additional relevant documents

Query reformulation: Expanding original query with new terms Reweighting the terms in expanded query

Page 3: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Introduction (2)

Approaches based on feedback from users (relevance feedback)

Approaches based on information derived from set of initially retrieved documents (local set of documents)

Approaches based on global information derived from document collection

Page 4: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Relevance feedback with user relevance information (1)

Most popular query reformulation strategy Cycle:

User presented with list of retrieved documents User marks those which are relevant

In practice: top 10-20 ranked documents are examined Incremental

Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query

Expected: New query moves towards relevant documents and away from non-

relevant documents

Page 5: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Relevance feedback with user relevance information (2)

Two basic techniques Query expansion

Add new terms from relevant documents Term reweighting

Modify term weights based on user relevance judgements

Advantages Shield users from details of query reformulation process Search broken down in sequence of small steps Controlled process

Emphasise some terms (relevant ones) De-emphasise other terms (non-relevant ones)

Page 6: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Relevance feedback with user relevance information (3)

Query expansion and term reweighting in the vector space model

Term reweighting in the probabilistic model

Page 7: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Query expansion and term reweighting in thevector space model

Term weight vectors of documents assessed relevantSimilarities among themselves

Term weight vectors of documents assessed non-relevantDissimilar for those of relevant documents

Reformulated query:Closer to term weight vectors of relevant documents

Page 8: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Query expansion and term reweighting in thevector space model

For query q Dr: set of relevant documents among retrieved documents Dn: set of non-relevant documents among retrieved documents Cr: set of relevant documents among all documents in collection ,,: tuning constants

Assume that Cr is known (unrealistic!)

Best query vector for distinguishing relevant documents from non-relevant documents

qopt

1

Crd

j

d j Cr

1

N Crd

j

d jCr

Page 9: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Query expansion and term reweighting in thevector space model

Problem: |Cr| is unknown Approach

Formulate initial query Incrementally change initial query vector Use |Dr| and |Dn| instead

Rochio formula Ide formula

Page 10: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Rochio formula

Direct application of previous formula + add query Initial formulation =1 Usually information in relevant documents more important than in

non-relevant documents (<<) Positive relevance feedback (=0)

qi1

qi

Dr

dj

d jDr

Dnd

j

d jDn

Page 11: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Rochio formula in practice (SMART)

=1 Terms

Original query Appear in more relevant documents that non-relevant documents Appear in more than half the relevant documents

Negative weights ignored

q i1 q i Dr

d j

d jDr

Dnd j

d jDn

Page 12: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Ide formula

Initial formulation = = =1 Same comments as for the Rochio formula

Both Ide and Rochio: no optimal criterion

qi1

qi d

j

d jDr

dj

d j Dn

Page 13: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

(see note on the BIR model)

Use idf to rank documents for original query

Calculate

Predict relevanceImproved (optimal) retrieval function

g(D) c id ii1,n

Page 14: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

Independence assumptions I1: distribution of terms in relevant documents is independent

and their distribution in all documents is independent I2: distribution of terms in relevant documents is independent

and their distribution in irrelevant documents is independent

Ordering principle O1: probable relevance based on presence of search terms in documents O2: probable relevance based on presence of search terms in documents

and their absence from documents

Page 15: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

Various combinations

IdependenceAssumption I1

IndependenceAssumption I2

Orderingprinciple O1

F1 F2

Ordering principle O2

F3 F4

Page 16: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

F1 formula

ri = number of relevant documents containing ti

ni = number of documents containing ti

ratio of the proportion of relevant documents in which the query term ti occurs to the proportion of all documents in which the term ti occurs

R = number of relevant documents

N= number of documents in collection

c i logrR

nN

Page 17: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

F2 formula

ri = number of relevant documents containing ti

ni = number of documents containing ti

proportion of relevant documents in which the term ti occurs to the proportion of all irrelevant documents in which ti occurs

R = number of relevant documents

N= number of documents in collection

c i logr

R(n r)

(N R)

Page 18: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

ratio of “relevance odds” (ratio of relevant documents containing term ti and non-relevant documents containing term ti) and “collection odds” (ratio of documents containing ti and documents not containing ti)

ri = number of relevant documents containing ti

ni = number of documents containing ti

F3 formula

R = number of relevant documents

N= number of documents in collection

c i logr(R r)

n(N n)

Page 19: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Term reweighting for the probabilistic model

ratio of “relevance odds” and “non-relevance odds” (ratio of relevant documents not containing ti and the non-relevant documents not containing ti)

ri = number of relevant documents containing ti

ni = number of documents containing ti

F4 formula

R = number of relevant documents

N= number of documents in collection

c i logr

(R r)(n r)

(N n R r)

Page 20: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Experiments

F1, F2, F3 and F4 outperform no relevance weighting and idf F1 and F2; F3 and F4 perform in the same range

F3 and F4 > F1 and F2 F4 slightly > F3

O2 is correct (looking at presence and absence of terms)

No conclusion with respect to I1 and I2, although I2 seems a more realistic assumption.

Page 21: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Relevance feedback without user relevance

Relevance feedback with user relevance Clustering hypothesis

known relevant documents contain terms which can be used to describe a larger cluster of relevant documents

Description of cluster built interactively with user assistance

Relevance feedback without user relevance Obtain cluster description automatically Identify terms related to query terms

(e.g. synonyms, stemming variations, terms close to query terms in text)

Local strategies Global strategies

Page 22: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Local analysis (pseudo-relevance feedback)

Examine documents retrieved for query to determine query expansion

No user assistance

Clustering techniques

Query “drift”

Page 23: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Clusters (1)

Synonymy association (one example): terms that frequently co-occur inside local set of documents

Term-term (e.g., stem-stem) association matrix (normalised)

c i, j tf (t i,d) tf (t j, d)dDl

m i ,j c i, j

c i,i c j, j c i, j

Page 24: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Clusters (2)

For term ti

Take the n largest values mi,j

The resulting terms tj form cluster for ti

Query q Finding clusters for the |q| query terms Keep clusters small Expand original query

Page 25: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Global analysis

Expand query using information from whole set of documents in collection

Thesaurus-like structure using all documents

Approach to automatically built thesaurus (e.g. similarity thesaurus based on co-occurrence frequency)

Approach to select terms for query expansion

Page 26: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Evaluation of relevance feedback strategies

Use qi and compute precision and recall graph

Use qi+1 and compute precision recall graph

Use all documents in the collection

Spectacular improvements Also due to relevant documents ranked higher Documents known to user Must evaluate with respect to documents not seen by user

Three techniques

Page 27: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Evaluation of relevance feedback strategies

Freezing

Full-freezingTop n documents are frozen (ones used in RF)Remaining documents are re-ranked Precision-recall on whole rankingChange in effectiveness thus come from unseen documentsWith many iteration, higher contribution of frozen documents may

lead to decrease in effectiveness

Modified freezingRank position of the last marked relevant document

Page 28: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Evaluation of relevance feedback strategies

Test and control group

Random splitting of documents: test documents and group documentsQuery reformulation performed on test documentsNew query run against the control documents

RF performed only on control group

Difficulty in splitting the collectionDistribution of relevant documents

Page 29: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Evaluation of relevance feedback strategies

Residual ranking

Documents used in assessing relevance are removed Precision-recall on “residual collection”

Consider effect of unseen documents

Results not comparable with original ranking (fewer relevant documents)

Page 30: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Issues

Interface Allow user to quickly identify relevant and non-relevant documents What happen with 2D and 3D visualisation?

Global analysis On the web? Yahoo!

Local analysis Computation cost (on-line)

Interactive query expansion User choose the terms to be added

Page 31: Query operations 1- Introduction 2- Relevance feedback with user relevance information 3- Relevance feedback without user relevance information - Local.

Negative relevance feedback

Documents explicitly marked as non-relevant by users

Implementation

Clarity

Usability