Lecture1 Intro

57
Introduction to Information Retrieval Introducing Information Retrieval and Web Search

description

notes

Transcript of Lecture1 Intro

Introducton toInformation RetrievalIntroducng Informaton Retreva and Web SearchInformaton RetrevaInformaton Retreva (IR) s ndng matera (usuay documents) of an unstructured nature (usuay text) that satses an nformaton need from wthn arge coectons (usuay stored on computers).These days we frequenty thnk rst of web search, but there are many other cases: E-ma search Searchng your aptop Corporate knowedge bases Lega nformaton retreva2Unstructured (text) vs. structured (database) data n the md-nnetes3Unstructured (text) vs. structured (database) data today4Basc assumptons of Informaton RetrevaCoecton: A set of documentsAssume t s a statc coecton for the momentGoa: Retreve documents wth nformaton that s reevant to the users nformaton need and heps the user compete a task5Sec. 1.1how trap mice aliveThe cassc search modeCollectionUser task Info needQueryResultsSearchengineQueryrefinement Get rid of mice in a politically correct wayInfo about removing micewithout killing them Misconception?Misformulation?SearchHow good are the retreved docs?Precision : Fracton of retreved docs that are reevant to the users nformaton needRecall : Fracton of reevant docs n coecton that are retrevedMore precse dentons and measurements to foow ater7Sec. 1.1Introducton toInformation RetrievalTerm-document ncdence matrcesUnstructured data n 1620Whch pays of Shakespeare contan the words Brutus AND Caesarbut NOT Calpurnia?One coud grep a of Shakespeares pays for Brutus and Caesar, then strp out nes contanng Calpurnia?Why s that not the answer?Sow (for arge corpora)NOT Calpurnia s non-trvaOther operatons (e.g., nd the word Romans near countrymen) not feasbeRanked retreva (best documents to return) Later ectures9Sec. 1.1Term-document ncdence matrcesAntony and Cleopatra J ulius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1rutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1!orser 1 0 1 1 1 01 if play contains word, 0 otherwiseBrutus AND Caesar BUT NOT CalpurniaSec. 1.1Incdence vectorsSo we have a 0/1 vector for each term.To answer query: take the vectors for Brutus, Caesar and Calpurnia (compemented) btwse AND.110100 AND110111 AND101111 = 10010011Sec. 1.1Antony and Cleopatra J ulius Caesar The Tempest Hamlet Othello MacbethAntony 1 1 0 0 0 1rutus 1 1 0 1 0 0Caesar 1 1 0 1 1 1Calpurnia 0 1 0 0 0 0Cleopatra 1 0 0 0 0 0mercy 1 0 1 1 1 1!orser 1 0 1 1 1 0Answers to queryAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to !MI"I#S $%!&A'S() *hy, $no+ar+us, *hen Antony found ,ulius Caesar dead, -e cried almost to roarin./ and he wept *hen at 0hilippi he found Brutus slain1-amlet, Act III, Scene iiLord Polonius: I did enact ,ulius Caesar I was 2illed i3 the Capitol/ Brutus 2illed me112Sec. 1.1Bgger coectonsConsder N = 1 mon documents, each wth about 1000 words.Avg 6 bytes/word ncudng spaces/punctuaton 6GB of data n the documents.Say there are M = 500K distinct terms among these.13Sec. 1.1Cant bud the matrx500K x 1M matrx has haf-a-tron 0s and 1s.But t has no more than one bon 1s.matrx s extremey sparse.Whats a better representaton?We ony record the 1 postons.14*hy?Sec. 1.1Introducton toInformation RetrievalThe Inverted IndexThe key data structure underyng modern IRInverted ndexFor each term t, we must store a st of a documents that contan t.Identfy each doc by a docI", a document sera numberCan we used xed-sze arrays for ths?16What happens f the word Caesar s added to document 14? Sec. 1.2BrutusCalpurniaCaesar1 2 4 5 6 16 57 1321 2 4 11 31 451732 3117454101Inverted ndexWe need varabe-sze postngs stsOn dsk, a contnuous run of postngs s norma and bestIn memory, can use nked sts or varabe ength arraysSome tradeohs n sze/ease of nserton17Dictionary PostingsSorted by docID (more ater on why).PostingPostingSec. 1.2BrutusCalpurniaCaesar1 2 4 5 6 16 57 1321 2 4 11 31 451732 3117454101TokenzerToken stream Friends Romans CountrymenInverted ndex constructonLngustc moduesModed tokensfriend roman countrymanIndexerInverted ndexfriendromancountryman2 4213 161Documents tobe ndexedFriends, Romans, countrymen.Sec. 1.2Inta stages of text processngTokenzatonCut character sequence nto word tokens Dea wth Johns, a state-of-the-art solutionNormazatonMap text and query term to same form You want U.S.A. and USA to matchStemmngWe may wsh dherent forms of a root to match authorie, authoriationStop wordsWe may omt very common words (or not) the, a, to, ofIndexer steps: Token sequenceSequence of (Moded token, Document ID) pars.I did enact ,uliusCaesar I was 2illed i3 the Capitol/ &rutus 2illed me1oc 1So let it +e withCaesar1 "he no+le&rutus hath told youCaesar was am+itiousoc 4Sec. 1.2Indexer steps: SortSort by terms And then docID Core inde#in$ stepSec. 1.2Indexer steps: Dctonary & PostngsMutpe term entres n a snge document are merged.Spt nto Dctonary and PostngsDoc. frequency nformaton s added.Why frequency?W dscuss ater.Sec. 1.2Where do we pay n storage?230ointersTerms and countsIR system mpementatonHow do we ndex emcenty?How much storage do we need?Sec. 1.2Lsts of docIDsIntroducton toInformation RetrievalOuery processng wth an nverted ndexThe ndex we |ust butHow do we process a query?Later - what knds of queres can we process?25Our focusSec. 1.3Ouery processng: ANDConsder processng the query:Brutus AND CaesarLocate Brutus n the Dctonary;Retreve ts postngs.Locate Caesar n the Dctonary;Retreve ts postngs."Merge" the two postngs (ntersect the document sets):26128342 4 8 16 32 641 2 3 5 8 1321BrutusCaesarSec. 1.3The mergeWak through the two postngs smutaneousy, n tme near n the tota number of postngs entres2734128 2 4 8 16 32 641 2 3 5 8 13 21BrutusCaesarIf the st engths are x and y, the merge takes O(x+y)operatons.Cruca: postngs sorted by docID.Sec. 1.3Intersectng two postngs sts(a "merge" agorthm)28Introducton toInformation RetrievalThe Booean Retreva Mode& Extended Booean ModesBooean queres: Exact matchThe Booean retreva mode s beng abe to ask a query that s a Booean expresson:Booean Oueres are queres usng AND, OR and NOT to |on query terms Vews each document as a set of words Is precse: document matches condton or not.Perhaps the smpest mode to bud an IR system onPrmary commerca retreva too for 3 decades. Many search systems you st use are Booean:Ema, brary cataog, Mac OS X Spotght30Sec. 1.3Exampe: WestLaw http)55www1westlaw1com56ar.est commercial 7payin. su+scri+ers8 le.al search ser9ice 7started 1:;