Introduction to Information Retrieval - Stanford...

5/21/17

IntroductiontoInformationRetrieval

Introductionto

InformationRetrieval

ProbabilisticInformationRetrievalChristopherManning andPanduNayak

FROMBOOLEANTORANKEDRETRIEVAL…INTWOSTEPS

Rankedretrieval§ Thusfar,ourquerieshaveallbeenBoolean

§ Documentseithermatchordon’t§ Canbegoodforexpertuserswithpreciseunderstandingoftheirneedsandthecollection§ Canalsobegoodforapplications:Applicationscaneasilyconsume1000sofresults

§ Notgoodforthemajorityofusers§ MostusersincapableofwritingBooleanqueries

§ Ortheyare,buttheythinkit’stoomuchwork

§ Mostusersdon’twanttowadethrough1000sofresults.§ Thisisparticularlytrueofwebsearch

Ch. 6 IntroductiontoInformationRetrieval

ProblemwithBooleansearch:feastorfamine§ Booleanqueriesoftenresultineithertoofew(=0)ortoomany(1000s)results.

§ Query1:“standarduserdlink 650”→200,000hits§ Query2:“standarduserdlink 650nocardfound”:0hits

§ Ittakesalotofskilltocomeupwithaquerythatproducesamanageablenumberofhits.§ ANDgivestoofew;ORgivestoomany

Who are these people?

Stephen Robertson Keith van RijsbergenKaren Spärck Jones

WhyprobabilitiesinIR?

User Information Need

DocumentsDocument

Representation

QueryRepresentation

How to match?

In traditional IR systems, matching between each document andquery is attempted in a semantically imprecise space of index terms.

Probabilities provide a principled foundation for uncertain reasoning.Can we use probabilities to quantify our uncertainties?

Uncertain guess ofwhether document has relevant content

Understandingof user need isuncertain

5/21/17

ProbabilisticIRtopics

1. Classicalprobabilisticretrievalmodel§ Probabilityrankingprinciple,etc.§ Binaryindependencemodel(≈NaïveBayestextcat)§ (Okapi)BM25

2. Bayesiannetworksfortextretrieval3. LanguagemodelapproachtoIR

§ Animportantemphasisinrecentwork

ProbabilisticmethodsareoneoftheoldestbutalsooneofthecurrentlyhottopicsinIR

§ Traditionally:neatideas,butdidn’twinonperformance§ Itseemstobedifferentnow

Thedocumentrankingproblem§ Wehaveacollectionofdocuments§ Userissuesaquery§ Alistofdocumentsneedstobereturned§ RankingmethodisthecoreofmodernIRsystems:

§ Inwhatorderdowepresentdocumentstotheuser?§ Wewantthe“best”documenttobefirst,secondbestsecond,etc….

§ Idea:Rankbyprobabilityofrelevanceofthedocumentw.r.t.informationneed§ P(R=1|documenti,query)

§ ForeventsAandB:

§ Bayes’ Rule

§ Odds:

p(A,B) = p(A∩B) = p(A | B)p(B) = p(B | A)p(A)

p(A | B) = p(B | A)p(A)p(B)

=p(B | A)p(A)

p(B | X)p(X)X=A,A∑

Recallafewprobabilitybasics

O(A) = p(A)p(A)

1− p(A)

Posterior

“Ifareferenceretrievalsystem’sresponsetoeachrequestisarankingofthedocumentsinthecollectioninorderofdecreasingprobabilityofrelevancetotheuserwhosubmittedtherequest,wheretheprobabilitiesareestimatedasaccuratelyaspossibleonthebasisofwhateverdatahavebeenmadeavailabletothesystemforthispurpose,theoveralleffectivenessofthesystemtoitsuserwillbethebestthatisobtainableonthebasisofthosedata.”

§ [1960s/1970s]S.Robertson,W.S.Cooper,M.E.Maron;van Rijsbergen (1979:113);Manning&Schütze (1999:538)

TheProbabilityRankingPrinciple

ProbabilityRankingPrinciple

Letx representadocumentinthecollection.LetR representrelevanceofadocumentw.r.t.given(fixed)queryandletR=1 representrelevantandR=0 notrelevant.

p(R =1| x) = p(x | R =1)p(R =1)p(x)

p(R = 0 | x) = p(x | R = 0)p(R = 0)p(x)

p(x|R=1), p(x|R=0) - probability that if a relevant (not relevant) document is retrieved, it is x.

Needtofindp(R=1|x) – probabilitythatadocumentx isrelevant.

p(R=1),p(R=0) - prior probabilityof retrieving a relevant or non-relevantdocument

p(R = 0 | x)+ p(R =1| x) =1

ProbabilityRankingPrinciple(PRP)§ Simplecase:noselectioncostsorotherutilityconcernsthatwoulddifferentiallyweighterrors

§ PRPinaction:Rankalldocumentsbyp(R=1|x)

§ Theorem:UsingthePRPisoptimal,inthatitminimizestheloss(Bayesrisk)under1/0loss§ Provableifallprobabilitiescorrect,etc.[e.g.,Ripley1996]

5/21/17

ProbabilityRankingPrinciple

§ Morecomplexcase:retrievalcosts.§ Letd beadocument§ C– costofnotretrievingarelevant document§ C’ – costofretrievinganon-relevant document

§ ProbabilityRankingPrinciple:if

foralld’notyetretrieved,thend isthenextdocumenttoberetrieved

§ Wewon’tfurtherconsidercost/utilityfromnowon

!C ⋅ p(R = 0 | d)−C ⋅ p(R =1| d) ≤ !C ⋅ p(R = 0 | !d )−C ⋅ p(R =1| !d )

ProbabilityRankingPrinciple§ Howdowecomputeallthoseprobabilities?

§ Donotknowexactprobabilities,havetouseestimates§ BinaryIndependenceModel(BIM)– whichwediscussnext– isthesimplestmodel

§ Questionableassumptions§ “Relevance”ofeachdocumentisindependentofrelevanceofotherdocuments.§ Really,it’sbadtokeeponreturningduplicates

§ Booleanmodelofrelevance§ Thatonehasasinglestepinformationneed

§ Seeingarangeofresultsmightletuserrefinequery

ProbabilisticRetrievalStrategy

§ Estimatehowtermscontributetorelevance§ Howdootherthingsliketermfrequencyanddocumentlengthinfluenceyourjudgmentsaboutdocumentrelevance?§ NotatallinBIM§ AmorenuancedansweristheOkapi(BM25)formulae[nexttime]

§ Spärck Jones/Robertson

§ Combinetofinddocumentrelevanceprobability

§ Orderdocumentsbydecreasingprobability

ProbabilisticRanking

Basic concept:“For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.

By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically.”

Van Rijsbergen

BinaryIndependenceModel§ TraditionallyusedinconjunctionwithPRP§ “Binary”=Boolean:documentsarerepresentedasbinary

incidencevectorsofterms(cf.IIRChapter1):§

§ iff termi ispresentindocumentx.§ “Independence”: termsoccurindocumentsindependently§ Differentdocumentscanbemodeledasthesamevector

),,( 1 nxxx !"=1=ix

BinaryIndependenceModel§ Queries:binarytermincidencevectors§ Givenqueryq,

§ foreachdocumentd needtocomputep(R|q,d).§ replacewithcomputingp(R|q,x) where x isbinarytermincidencevectorrepresentingd.

§ Interestedonlyinranking§ WilluseoddsandBayes’Rule:

O(R | q, x) = p(R =1| q, x)p(R = 0 | q, x)

p(R =1| q)p(x | R =1,q)p(x | q)

p(R = 0 | q)p(x | R = 0,q)p(x | q)

5/21/17

BinaryIndependenceModel

• Using Independence Assumption:

O(R | q, x) =O(R | q) ⋅ p(xi | R =1,q)p(xi | R = 0,q)i=1

p(x | R =1,q)p(x | R = 0,q)

=p(xi | R =1,q)p(xi | R = 0,q)i=1

O(R | q, x) = p(R =1| q, x)p(R = 0 | q, x)

=p(R =1| q)p(R = 0 | q)

⋅p(x | R =1,q)p(x | R = 0,q)

Constant for a given query Needs estimation

• Since xi is either 0 or 1:

O(R | q, x) =O(R | q) ⋅ p(xi =1| R =1,q)p(xi =1| R = 0,q)xi=1

∏ ⋅p(xi = 0 | R =1,q)p(xi = 0 | R = 0,q)xi=0

• Let pi = p(xi =1| R =1,q); ri = p(xi =1| R = 0,q);

• Assume, for all terms not occurring in the query (qi=0) ii rp =

O(R | q, x) =O(R | q) ⋅ p(xi | R =1,q)p(xi | R = 0,q)i=1

O(R | q, x) =O(R | q) ⋅ pirixi=1

∏ ⋅(1− pi )(1− ri )xi=0

document relevant(R=1) notrelevant(R=0)termpresent xi =1 pi ritermabsent xi =0 (1– pi) (1 – ri)

All matching terms Non-matching query terms

All matching termsAll query terms

O(R | q, x) =O(R | q) ⋅ pirixi=1

∏ ⋅1− ri1− pi

⋅1− pi1− ri

xi=1qi=1

∏ 1− pi1− rixi=0

O(R | q, x) =O(R | q) ⋅ pi (1− ri )ri (1− pi )xi=qi=1

∏ ⋅1− pi1− riqi=1

O(R | q, x) =O(R | q) ⋅ pirixi=qi=1

∏ ⋅1− pi1− rixi=0

Constant foreach query

Only quantity to be estimated for rankings

ÕÕ=== -

×=11 11

)1()1()|(),|(

iii q i

prrpqROxqRO !

RetrievalStatusValue:

åÕ==== -

=11 )1(

)1(log)1()1(log

iiii qx ii

prrpRSV

AllboilsdowntocomputingRSV.

åÕ==== -

=11 )1(

)1(log)1()1(log

iiii qx ii

prrpRSV

ii qxicRSV

)1()1(logii

iii pr

So,howdowecomputeci’s fromourdata?

Theci arelogoddsratiosTheyfunctionasthetermweightsinthismodel

5/21/17

BinaryIndependenceModel• EstimatingRSVcoefficients intheory• Foreachtermi lookatthistableofdocumentcounts:

Documents

Relevant Non-Relevant Total

xi=1 s n-s n xi=0 S-s N-n-S+s N-n Total S N-S N

Sspi » )(

)(SNsnri -

)()()(log),,,(

sSnNsnsSssSnNKci +---

• Estimates: For now,assume nozero terms.Remembersmoothing.

)1()1(logii

iii pr

Estimation– keychallenge

§ Ifnon-relevantdocumentsareapproximatedbythewholecollection,thenri (prob.ofoccurrenceinnon-relevantdocumentsforquery)isn/Nand

log1− riri

= log N − n− S + sn− s

≈ log N − nn

≈ log Nn= IDF!

Collectionvs.Documentfrequency

§ Collectionfrequencyoft isthetotal numberofoccurrencesoft inthecollection(incl.multiples)

§ Documentfrequencyisnumberofdocstisin§ Example:

§ Whichwordisabettersearchterm(andshouldgetahigherweight)?

Word Collection frequency Document frequency

insurance 10440 3997

try 10422 8760

Sec. 6.2.1 IntroductiontoInformationRetrieval

Estimation– keychallenge

§ pi (probabilityofoccurrenceinrelevantdocuments)cannotbeapproximatedaseasily

§ pi canbeestimatedinvariousways:§ fromrelevantdocumentsifyouknowsome

§ Relevanceweightingcanbeusedinafeedbackloop

§ constant(CroftandHarpercombinationmatch)– thenjustgetidf weightingofterms(withpi=0.5)

§ proportionaltoprob.ofoccurrenceincollection§ Greiff (SIGIR1998)arguesfor1/3+2/3dfi/N

RSV = log Nnixi=qi=1

ProbabilisticRelevanceFeedback1. GuessapreliminaryprobabilisticdescriptionofR=1

documents;useittoretrieveasetofdocuments2. Interactwiththeusertorefinethedescription:

learnsomedefinitememberswithR=1andR=03. Re-estimatepi andri onthebasisofthese

§ Ifi appearsinVi withinsetofdocumentsV:pi =|Vi|/|V|§ Orcancombinenewinformationwithoriginalguess(use

Bayesianprior):

4. Repeat,thusgeneratingasuccessionofapproximationstorelevantdocuments

=|||| )1(

VpVp ii

iκ is prior

weight

Iterativelyestimatingpi and ri(=Pseudo-relevancefeedback)1. Assumethatpi isconstantoverallxi inqueryandri

asbefore§ pi =0.5(evenodds)foranygivendoc

2. Determineguessofrelevantdocumentset:§ V isfixedsizesetofhighestrankeddocumentsonthis

model3. Weneedtoimproveourguessesforpi andri,so

§ Usedistributionofxi indocsinV.LetVi besetofdocumentscontainingxi§ pi =|Vi|/|V|

§ Assumeifnotretrievedthennotrelevant§ ri =(ni – |Vi|)/(N– |V|)

4. Goto2.untilconvergesthenreturnranking

5/21/17

PRPandBIM

§ Gettingreasonableapproximationsofprobabilitiesispossible.

§ Requiresrestrictiveassumptions:§ Termindependence§ Termsnotinquerydon’taffecttheoutcome§ Booleanrepresentationofdocuments/queries/relevance

§ Documentrelevancevaluesareindependent§ Someoftheseassumptionscanberemoved§ Problem:eitherrequirepartialrelevanceinformationor

seeminglyonlycanderivesomewhatinferiortermweights

Removingtermindependence§ Ingeneral,indextermsaren’t

independent§ Dependenciescanbecomplex§ vanRijsbergen (1979)proposed

simplemodelofdependenciesasatree§ ExactlyFriedmanand

Goldszmidt’s TreeAugmentedNaiveBayes(AAAI13,1996)

§ Eachtermdependentononeother

§ In1970s,estimationproblemsheldbacksuccessofthismodel

Secondstep:Termfrequency§ Rightinthefirstlecture,wesaidthatapageshouldrankhigherifitmentionsawordmore§ Perhapsmodulatedbythingslikepagelength

§ Wemightwantamodelwithtermfrequencyinit.

§ We’llseeaprobabilisticonenexttime– BM25

§ Quicksummaryofvectorspacemodel

Summary– vectorspaceranking

§ Representthequeryasaweightedtermfrequency/inversedocumentfrequency(tf-idf)vector

§ Representeachdocumentasaweightedtf-idf vector§ Computethecosinesimilarityscoreforthequeryvectorandeachdocumentvector

§ Rankdocumentswithrespecttothequerybyscore§ ReturnthetopK (e.g.,K =10)totheuser

IntroductiontoInformationRetrieval IntroductiontoInformationRetrieval

Cosinesimilarity

Sec. 6.3

5/21/17

tf-idfweightinghasmanyvariants

Sec. 6.4 IntroductiontoInformationRetrieval

ResourcesS.E.RobertsonandK.Spärck Jones.1976.RelevanceWeightingofSearch

Terms.JournaloftheAmericanSocietyforInformationSciences27(3):129–146.

C.J.vanRijsbergen.1979.InformationRetrieval. 2nded.London:Butterworths,chapter6.[Mostdetailsofmath]http://www.dcs.gla.ac.uk/Keith/Preface.html

N.Fuhr.1992.ProbabilisticModelsinInformationRetrieval.TheComputerJournal,35(3),243–255.[Easiestread,withBNs]

F.Crestani,M.Lalmas,C.J.vanRijsbergen,andI.Campbell.1998.IsThisDocumentRelevant?…Probably:ASurveyofProbabilisticModelsinInformationRetrieval.ACMComputingSurveys 30(4):528–552.http://www.acm.org/pubs/citations/journals/surveys/1998-30-4/p528-crestani/[Addsverylittlematerialthatisn’tinvanRijsbergen orFuhr ]

Introduction to Information Retrieval - Stanford...

Documents

Transcript of Introduction to Information Retrieval - Stanford...

Introducon to Informa)on Retrieval The course thus farweb.stanford.edu/class/cs276/handouts/spell_correction-6up.pdf · § Peter Norvig’s list of counts of single-edit errors §

ARCHIVE RECORDS STORAGE AND RETRIEVAL SERVICES …€¦ · ARCHIVE RECORDS STORAGE AND RETRIEVAL SERVICES ... ARCHIVE RECORDS STORAGE AND RETRIEVAL SERVICES ... faster retrieval of

Dancing with Giants: Wimpy Kernels for On-demand Isolated I/O Presenter: Probir Roy Computer Science Department College of William & Mary.

PHY layer access misbehavior in WLAN networks Master thesis presentation Radio Communication Systems, KTH Probir Khaskel Advisor: Olav Queseth & Examiner:

MSCT probir raka pluća - repo.ozs.unist.hr

Cross-Modal Music Retrieval - Jazzomat · Cross-Modal Music Retrieval Meinard Müller Music Retrieval Textual metadata – Traditional retrieval – Searching for artist, title, …

Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.

WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.

Modern Information Retrieval Chapter 3 Retrieval Evaluation.

Stjepan Gamulin - neuron.mefst.hrneuron.mefst.hr/docs/graduate school/novosti/KE nacela.pdf · dijagnoza ishod liječenje probir prevencija primarna sekundarna tercijalna ... 321-33.

Robotics - University of Calgary in Albertapages.cpsc.ucalgary.ca/.../Slides/08-Robotics-6up.pdf · Chapter 8 Robotics —————————————— 8. Robotics Introduction

의료정보검색 (Information Retrieval) 2003. 9. 24. 최진욱. 2 정보검색이란 Information Retrieval 원하는 정보를 찾는 것 Data retrieval vs. Information retrieval.

EFFICIENT IMAGE RETRIEVAL USING REGION BASED IMAGE RETRIEVAL

, Ambar Ghosal and Probir Roya a ... - arXiv

Retrieval Model Overview Boolean Retrieval Retrieval INFO 4300 / CS 4300 ! Retrieval models – Older models » Boolean retrieval » Vector Space model – Probabilistic Models »

PROBIR BANERJEE, PondyCAN / AGG / All-for-Water-for-All nd ...

Information Retrieval: Introduction · Information Retrieval: Introduction De nition of Information Retrieval De nition of Information Retrieval Information Retrieval (IR) is about

Knowledge Base ―Information Retrieval― Masaharu Yoshiokayoshioka/kb/kby-IR.pdfInformation Retrieval v.s. Database Retrieval nDatabase retrieval –Retrieve data that satisfy given

Introduction to Information Retrieval - Stanford University...Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search

CS54701: Information Retrieval - Purdue University · CS54701: Information Retrieval Probabilistic Retrieval Models 28 January 2016 Prof. Chris Clifton Summary: Vector Space Retrieval