6/25/2015 4:26 PMCopyright © 2001 S. Kambhampati 8/24 --Are you getting e-mails sent to class...

66
03/21/22 16:38 Copyright © 2001 S. Kambhampati 8/24 you getting e-mails sent to class iling list? if not, send me email) da: -Intro wrap up -Start on text retrieval k du jour: a9.com ries that can’t be answered well?
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of 6/25/2015 4:26 PMCopyright © 2001 S. Kambhampati 8/24 --Are you getting e-mails sent to class...

04/18/23 21:44 Copyright © 2001 S. Kambhampati

8/24

--Are you getting e-mails sent to class mailing list? (if not, send me email)

Agenda: --Intro wrap up --Start on text retrieval

Link du jour: a9.com

Queries that can’t be answered well?

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Course Overview(take 2)

Is it needed??

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Web as a collection of information

• Web viewed as a large collection of__________– Text, Structured Data, Semi-structured data– (multi-media/Updates/Transactions etc. ignored for now)

• So what do we want to do with it?– Search, directed browsing, aggregation, integration,

pattern finding

• How do we do it?– Depends on your model (text/Structured/semi-structured)

Structure helps querying --If there is structure, exploit it. If not, extract it (Information Extraction—mining; clustering; tagging)

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Extracting structure• There are broadly two approaches to extract structure

– Information Extraction (NLP-lite): involves trying to analyze the text pages to look for formal data/knowledge tuples

– Extraction is easier if the “text” is produced by translating the tuples from a backend database into pidgin english. Harder if it is on non-templated text.

» You use “wrapper” software to unwrap it back to its tuple-state– Extraction can be buggy (e.g. L. Back, T. Stock, G. Forward example ;-)

» But could still help a great deal

• Citeseer extracts citations from PDF files• CBioC—a project at ASU attempts to extract gene interaction knowledge from

biological abstracts

– Semantic Web view: Get webpage writers to annotate their pages with tags (e.g The capital of <country> India </country> is <city>New Delhi</city>)

• People may not want to Tag, and the tags may differ across pages (e.g. someone might put <capital-city> instead of <city> as the tag

– Can intercede through page-maker software such as frontpage

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Structure

• How will search and querying on these three types of data differ?

A genericweb page

containing text

A movie review

[English]

[SQL]

[XML]

Semi-Structured

An employee record

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Structure helps querying• Expressive queries

• Give me all pages that have key words “Get Rich Quick”

• Give me the social security numbers of all the employees who have stayed with the company for more than 5 years, and whose yearly salaries are three standard deviations away from the average salary

• Give me all mails from people from ASU written this year, which are relevant to “get rich quick”

• Efficient searching – equality vs. “similarity”

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Does Web have Structured data?• Isn’t web all text?

– The invisible web • Most web servers have back end database servers

• They dynamically convert (wrap) the structured data into readable english– <India, New Delhi> => The capital of India is New Delhi.

– So, if we can “unwrap” the text, we have structured data!

» (un)wrappers, learning wrappers etc…

– Note also that such dynamic pages cannot be crawled...

– The Semi-structured web• Most pages are at least “semi”-structured

• XML standard is expected to ease the presenatation/on-the-wire transfer of such pages. (BUT…..)

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Adapting old disciplines for Web-age• Information (text) retrieval

– Scale of the web

– Hyper text/ Link structure

– Authority/hub computations

• Databases– Multiple databases

• Heterogeneous, access limited, partially overlapping

– Network (un)reliability

• Datamining [Machine Learning/Statistics/Databases]– Learning patterns from large scale data

04/18/23 21:44 Copyright © 2001 S. Kambhampati

How does one do clustering • (in response to a question)Two important issues

– how to define distance measures• Distance measures will capture what we think are similarities

– Two orthogonal similarity/distance measures for scientific papers» Textual similarity measure (degree of common words in the papers)» Co-citation similarity measure (number of papers that refer to both papers)

• How to define inter-cluster distance» Distance between centroids of clusters» Distance between the two nearest members belonging to each of the clusters

– How to use the distance measures to find clusters such that• Intra-cluster distances are minimized• Inter-cluster distances are maximized

– Many algorithms

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Information IntegrationDatabase Style Retrieval

• Traditional Model (relational)– Given:

• A single relational database– Schema

– Instances

• A relational (sql) query

– Return:• All tuples satisfying the query

• Evaluation– Soundness/Completeness

– efficiency

• Web-induced headaches• Many databases• all are partially complete• overlapping• heterogeneous schemas• access limitations• Network (un)reliability

• Consequently• Newer models of DB• Newer notions of completeness• Newer approaches for query

planning

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Learning Patterns (Web/DB mining)• Traditional classification

learning (supervised)– Given

• a set of structured instances of a pattern (concept)

– Induce the description of the pattern

• Evaluation:– Accuracy of classification on

the test data– (efficiency of learning)

• Mining headaches– Training data is not obvious– Training data is massive– Training instances are noisy and

incomplete

• Consequently– Primary emphasis on fast

classification• Even at the expense of accuracy

– 80% of the work is “data cleaning”

04/18/23 21:44 Copyright © 2001 S. Kambhampati

Week by Week (from Spring 2004)• Introduction (1/20;) • Text retrieval; vectorspace ranking • Indexing/Retrieval (1/22;) • Correlation analysis; LSI (2/3;2/5) • Search engine technology (2/10;2/12;2/16) • Page rank computation; Crawling; Anatomy of a search engine (2/19;) • Clustering (2/26;) • Collaborative and Content-based Filtering(3/4;3/11); Classification Learning (

NBC);( Text classification (Vector vs. unigram models of text); Spam mail classification.(3/22;3/25)

• A web-oriented review of Databases (Given by Ullas Nambiar) • XML • Semantic web and its standards... • Data/Information Integration • Learning Sources Stats (BibFinder) • The DB/IR intersection • Final class

Outline of IR topics

Background Definitions, etc.

The Problem 100,000+ pages

The Solution Ranking docs Vector space

Extensions Relevance feedback, clustering, query expansion, etc.

Information Retrieval Traditional Model

Given a set of documents A query expressed as a

set of keywords Return

A ranked set of documents most relevant to the query

Evaluation: Precision: Fraction of

returned documents that are relevant

Recall: Fraction of relevant documents that are returned

Efficiency

Web-induced headaches Scale (billions of

documents) Hypertext (inter-

document connections) Consequently

Ranking that takes link structure into account Authority/Hub

Indexing and Retrieval algorithms that are ultra fast

What is Information Retrieval Given a large repository of documents,

how do I get at the ones that I want Examples: Lexus/Nexus, Medical reports,

AltaVista Keyword based [can’t handle synonymy,

polysemy] Different from databases

Unstructured (or semi-structured) data Information is (typically) text Requests are (typically) word-based

In principle, this

requires NLP!

--NLP too hard as yet

--IR trie

s to get by

with syntactic methods

What is IR cont.

IR: representation, storage, organization of, and access to information items

Focus is on the user information need User information need:

Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.

Emphasis is on the retrieval of information (not data)

Information vs. Data Data retrieval

which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure!

• A single missed object implies failure too.. Information retrieval

information about a subject or topic semantics is frequently loose small errors are tolerated

IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important

IR - Past and Present

IR at the center of the stage IR in the last 20 years:

classification and categorization systems and languages user interfaces and visualization

The Web has renewed focus on IR universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to

finding the solutions!

Classic IR Models - Basic Concepts Each document represented by a

set of representative keywords or index terms Query is seen as a

“mini”document An index term is a document word

useful for remembering the document main themes Usually, index terms are nouns

because nouns have meaning by themselves [However, search engines

assume that all words are index terms (full text representation)]

Docs

Information Need

Index Terms

doc

query

Rankingmatch

Measuring Performance

Precision Proportion of selected

items that are correct

Recall Proportion of target

items that were selected Precision-Recall curve

Shows tradeoff

tn

fp tp fn

System returned these

Actual relevant docs

fptp

tp

fntp

tp

Recall

Precision

Why don’t we use precision/recall measurements for databases?

1.0 precision ~ Soundness ~ nothing but the truth1.0 recall ~ Completeness ~ whole truth

Analogy: Swearing-in witnesses in courts

Why can’t search engines have 100% precision and 100% recall? Because relevance is in the eye of

the beholder… I think that a page pointing to

culture of Kalahari Bushmen is highly relevant to my query “bush”

The campus republicans might find that it is a lousy answer..

Precision/Recall Curves11-point recall-precision curve plots precision at recalls

0,.1,.2,.3….1.0

Example: Suppose for a given query, 10 documents are relevant. Suppose when all documents are ranked in descending similarities, we have

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19

d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 …

recall

pre

cisi

on

.1 .3 1.0

.2 recall happens at the third docHere the precision is 2/3= .66.3 recall happens at 6th doc. Here thePrecision is 3/6=0.5

Precision Recall Curves…When evaluating the retrieval effectiveness of a text

retrieval system or method, a large number of queries are used and their average 11-point recall-precision curve is plotted.

Methods 1 and 2 are better than method 3. Method 1 is better than method 2 for high recalls.

recall

pre

cisi

on

Method 1Method 2Method 3

Combining precision and recall into a single measure We can consider a

weighted summation of precision and recall into a single quantity What is the best

way to combine? Arithmetic

mean? Geometric

mean? Harmonic

mean?rp

prf

rp

prf

rpf

2

2 )1(

2

11

2

11

F-measure (aka F1-measure)(harmonic mean of precision and recall)

If you travel at 40mph onThe way out and 60mphOn the return, what isYour average speed?

Sophie’s choice: Web version

If you can either have precision or recall but not both, which would you rather keep? If you are a medical doctor trying to

find the right paper on a disease

If you are Joe Schmoe surfing on the web?

8/29

Precision/Recall Curves11-point recall-precision curve plots precision at recalls

0,.1,.2,.3….1.0

Example: Suppose for a given query, 10 documents are relevant. Suppose when all documents are ranked in descending similarities, we have

d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19

d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 …

recall

pre

cisi

on

.1 .3 1.0

.2 recall happens at the third docHere the precision is 2/3= .66.3 recall happens at 6th doc. Here thePrecision is 3/6=0.5

Classic IR Models - Basic Concepts Each document represented by a

set of representative keywords or index terms Query is seen as a

“mini”document An index term is a document word

useful for remembering the document main themes Usually, index terms are nouns

because nouns have meaning by themselves [However, search engines

assume that all words are index terms (full text representation)]

Docs

Information Need

Index Terms

doc

query

Rankingmatch

UserInterface

Text Operations (stemming, noun phrase detection etc..)

Query Operations(elaboration, relevance feedback

Indexing

Searching(hash tables etc.)

Ranking(vector models ..)

Index

Text

query

user need

user feedback

ranked docs

retrieved docs

logical viewlogical view

inverted file

DB Manager Module

4, 10

6, 7

5 8

2

8

Text Database

Text

The Retrieval Process

A quick glimpse at inverted filesDictionary PostingsTerm Doc # Freq

a 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Generating keywords (index terms) in traditional IR

structure

Accentsspacing stopwords

Noungroups stemming

Manual indexingDocs

structure Full text Index terms

Stop-word elimination

Noun phrase detection

“data structure” “computer architecture”

Stemming (Porter Stemmer for English)

If suffix of a word is “IZATION” and prefix contains at least one vowel followed by a consonant, then replace suffix with “IZE” (e.g. BinarizationBinarize)

•Generating index terms•Improving quality of terms.

(e.g. Synonyms, co-occurence detection, latent semantic indexing..

The number of Web pages on the World Wide Web was

estimated to be over 800 million in 1999.

Stop word eliminationStemming

Example of Stemming and Stopword Elimination

So does Google use stemming? All kinds of stemming?

Stopword elimination?Any non-obvious stop-words?

Why don’t search engines do much text-ops?

User population is too large and is easily impressed with reasonably relevant answers We are not talking of medical doctors looking for the

most relevant paper describing the cure for the symptoms of their patient

A search engine can do well even if all the doctors give it low marks Corollary: All of these text-ops may well be relevant

for “Vertical” (topic-specific) search engines Some of the text-ops were put in place as a way of

dealing with the computational limitations E.g. indexing in terms of only few keywords These are not as relevant in the era of current day

computers…

Ranking

A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

A ranking is based on fundamental premisses regarding the notion of relevance, such as: common sets of index terms sharing of weighted terms likelihood of relevance

Each set of premisses leads to a distinct IR model

The biggie

Difficulties in designing ranking methods We want a ranking algorithm that

captures the user’s relevance metric Only the user’s relevance metric is not

fully captured by the short keyword query Worse when the query has 10 words limit

(as in most search engines) So, we hypothesize what might be

underlying the user’s relevance judgment Similarity of words Similarity of co-citation Popularity of the document

..and hope that our hypotheses are good

We dance round in a ring and suppose, But the Secret sits in the middle and knows.-- Robert Frost.

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext

Digression: Similarity vs. Duplicate detection Duplicate detection (as used in plagiarism

detection) is different from similarity computation Highly similar documents may not necessarily

be plagiarized versions of each other Often, duplicate detection may require

comparing documents at the level of “Shingles” A shingle is a contiguous chunk of text

• A plagiarized document may have many of the shingles of the original document but re-arranged

• See http://www-db.stanford.edu/~shiva/Pubs/DlMag/dlmag.html

The Boolean Model Simple model based on set theory

Documents as sets of keywords Queries specified as boolean expressions

q = ka (kb kc) precise semantics

Terms are either present or absent. Thus, wij {0,1} Consider

q = ka (kb kc) vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) vec(qcc) = (1,1,0) is a conjunctive component

AI Folks: This is DNF as against CNF which

you used in 471

The Boolean Model

q = ka (kb kc)

sim(q,dj) = 1 if vec(qcc) | (vec(qcc) vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of

partial matching No ranking of the documents is provided (absence of a grading

scale) Information need has to be translated into a Boolean expression

which most users find awkward The Boolean queries formulated by the users are most often too

simplistic As a consequence, the Boolean model frequently returns either too

few or too many documents in response to a user query• Keyword (vector model) is not necessarily better—it just annoys the users

somewhat less

Documents as bags of words

a: System and human system engineering testing of EPS

b: A survey of user opinion of computer system response time

c: The EPS user interface management system

d: Human machine interface for ABC computer applications

e: Relation of user perceived response time to error measurement

f: The generation of random, binary, ordered trees

g: The intersection graph of paths in trees h: Graph minors IV: Widths of trees and

well-quasi-ordering i: Graph minors: A survey

a b c d e f g h IInterface 0 0 1 0 0 0 0 0 0User 0 1 1 0 1 0 0 0 0System 2 1 1 0 0 0 0 0 0Human 1 0 0 1 0 0 0 0 0Computer 0 1 0 1 0 0 0 0 0Response 0 1 0 0 1 0 0 0 0Time 0 1 0 0 1 0 0 0 0EPS 1 0 1 0 0 0 0 0 0Survey 0 1 0 0 0 0 0 0 1Trees 0 0 0 0 0 1 1 1 0Graph 0 0 0 0 0 0 1 1 1Minors 0 0 0 0 0 0 0 1 1

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Documents as bags of keywords (another eg)

Jaccard Similarity Metric Although vector similarity measure is used widely, another

similarity measure with useful properties is Jaccard Similarity metric Estimates the degree of overlap between sets (or bags)

For bags, intersection and union are defined in terms of max & min If A has 5 oranges and 8 apples and B has 3 oranges and

12 apples A .intersection. B is 3 oranges and 8 apples A .union. B is 5 oranges and 12 apples Jaccard similarity is (3+8)/(5 +12)= 11/17

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Documents as bags of keywords (another eg)

Similarity(d1,d2)

= (24+10+5)/32+21+9+3+3=0.57

What about d1 and d1d1 (which is a twice concatenated version of d1)? --need to normalize the bags (e.g. divide coeffs by bag size)

--Also can better differentiate the ceffs (tf/idf metrics)

8/31

The Vector Model Documents/Queries bags are seen as Vectors over

keyword space vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ...,

wtq)• wiq >= 0 associated with the pair (ki,q)

– wij > 0 whenever ki dj

To each term ki is associated a unitary vector vec(i) The unitary vectors vec(i) and vec(j) are assumed to

be orthonormal (i.e., index terms are assumed to occur independently within the documents)

– Is this Reasonable?????? The t unitary vectors vec(i) form an orthonormal basis

for a t-dimensional space

Each ve

ctor h

olds a

place fo

r eve

ry term

in

the colle

ction

Therefore, most

vecto

rs are sp

arse

Similarity Function

The similarity or closeness of a document d = ( w1, …, wi, …, wn )

with respect to a query (or another document) q = ( q1, …, qi, …, qn )

is computed using a similarity (distance) function.

Many similarity functions exist

Eucledian distance, dot product, normalized dot product (cosine-theta)

Eucledian distance

Given two document vectors d1 and d2

i

wiwiddDist 2)21()2,1(

Dot Product distancesim(q, d) = dot(q, d) = q1 w1 + … + qn wn

Example: Suppose d = (0.2, 0, 0.3, 1) and

q = (0.75, 0.75, 0, 1), then

sim(q, d) = 0.15 + 0 + 0 + 1 = 1.15

Observations of the dot product function. Documents having more terms in common with a query tend to

have higher similarities with the query. For terms that appear in both q and d, those with higher

weights contribute more to sim(q, d) than those with lower weights.

It favors long documents over short documents. The computed similarities have no clear upper bound.

A normalized similarity metric

Sim(q,dj) = cos()

= [vec(dj) vec(q)] / |dj| * |q|

= [ wij * wiq] / |dj| * |q| Since wij > 0 and wiq > 0,

0 <= sim(q,dj) <=1 A document is retrieved even if it matches

the query terms only partially

i

j

dj

q system

interfaceuser

a

c

b

||||)cos(

BA

BAAB

a b cInterface 0 0 1User 0 1 1System 2 1 1

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Eucledian

Cosine

Comparison of Eucledianand Cosine distance metrics

Whiter => more similar

Answering Queries

Represent query as vector

Compute distances to all documents

Rank according to distance

Example “database

index”

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Given Q={database, index} = {1,0,1,0,0,0}

Term Weights in the Vector Model Sim(q,dj) = [ wij * wiq] / |dj| * |q| How to compute the weights wij and wiq ?

Simple keyword frequencies tend to favor common words E.g. Query: The Computer Tomography

A good weight must take into account two effects: quantification of intra-document contents (similarity)

tf factor, the term frequency within a document quantification of inter-documents separation (dissi-milarity)

idf factor, the inverse document frequency wij = tf(i,j) * idf(i)

Tf-IDF Let,

N be the total number of docs in the collection ni be the number of docs which contain ki freq(i,j) raw frequency of ki within dj

A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(i,j))

where the maximum is computed over all terms which occur within the document dj

The idf factor is computed as idf(i) = log (N/ni)

the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

Document/Query Representation using TF-IDF The best term-weighting schemes use weights which are given by

wij = f(i,j) * log(N/ni) the strategy is called a tf-idf weighting scheme

For the query term weights, several possibilities: wiq = (0.5 + 0.5 * [freq(i,q) / max(freq(i,q)]) * log(N/ni)

Alternatively, just use the IDF weights (to give preference to rare words) Let the user give the weights to the keywords to reflect her *real*

preferences Easier said than done... Users are often dunderheads..

• Help them with “relevance feedback” techniques.

t1= databaset2=SQLt3=indext4=regressiont5=likelihoodt6=linear

Given Q={database, index} = {1,0,1,0,0,0}

Note: In this case, the weights used in query were 1 for t1 and t3,and 0 for the rest.

The Vector Model:Summary The vector model with tf-idf weights is a good ranking strategy with general

collections The vector model is usually as good as the known ranking alternatives. It is also

simple and fast to compute. Advantages:

term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the

query Disadvantages:

assumes independence of index terms Does not handle synonymy/polysemy Query weighting may not reflect user relevance criteria.

The Vector Model:Summary The vector model with tf-idf weights is a good ranking strategy with general

collections The vector model is usually as good as the known ranking alternatives. It is also

simple and fast to compute. Advantages:

term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the

query Disadvantages:

assumes independence of index terms Does not handle synonymy/polysemy Query weighting may not reflect user relevance criteria.

What is missing?

Reasons that ideal effectiveness hard to achieve:

1. Similarity function used not be good enough.

2. Importance/weight of a term in representing a document and query may be inaccurate

3. Document representation loses information.

4. Users’ inability to describe queries precisely.

5. Same term may have multiple meanings and different terms may have similar meanings.

Query expansionRelevance feedback

LSICo-occurrence

analysis

Some improvements

Query expansion techniques (for 1) relevance feedback co-occurrence analysis (local and global thesauri)

Improving the quality of terms [(2), (3) and (5).] Latent Semantic Indexing Phrase-detection

Relevance Feedback for Vector Model

Crdj

CrNCrdj

CroptdjdjQ 11

Cr = Set of documents that are truly relevant to QN = Total number of documents

In the “ideal” case where we know the relevant Documents a priori

Rocchio Method

Dndj

DnDrdj

Dr djdjQQ ||||01

Qo is initial query. Q1 is the query after one iterationDr are the set of relevant docsDn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically.

Other variations possible, but performance similar

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Example Rocchio Calculation

)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(

12

25.0

75.0

1

)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(

)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.

)120,.100,.100,.025,.050,.002,.020,.009,.020(.

)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.

121

1

2

1

new

new

Q

SRRQQ

Q

S

R

R

Relevantdocs

Non-rel doc

Original Query

Constants

Rocchio Calculation

Resulting feedback query

Rocchio Method

Rocchio automatically re-weights terms adds in new terms (from relevant docs)

have to be careful when using negative terms

Rocchio is not a machine learning algorithm Most methods perform similarly

results heavily dependent on test collection Machine learning methods are proving to

work better than standard IR approaches like Rocchio

Using Relevance Feedback Known to improve results

in TREC-like conditions (no user involved)

What about with a user in the loop? How might you measure this?

Precision/Recall figures for the unseen documents need to be computed