Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...
Transcript of Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... ·...
![Page 1: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/1.jpg)
Introduc)ontoInforma)onRetrieval
Introduc*onto
Informa(onRetrieval
CS276:Informa*onRetrievalandWebSearchPanduNayakandPrabhakarRaghavan
Lecture2:Thetermvocabularyandpos*ngslists
![Page 2: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/2.jpg)
Introduc)ontoInforma)onRetrieval
Recapofthepreviouslecture Basicinvertedindexes:
Structure:Dic*onaryandPos*ngs
Keystepinconstruc*on:Sor*ng Booleanqueryprocessing
Intersec*onbylinear*me“merging” Simpleop*miza*ons
Overviewofcoursetopics
Ch. 1
2
![Page 3: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/3.jpg)
Introduc)ontoInforma)onRetrieval
PlanforthislectureElaboratebasicindexing Preprocessingtoformthetermvocabulary
Documents Tokeniza*on Whattermsdoweputintheindex?
Pos*ngs Fastermerges:skiplists Posi*onalpos*ngsandphrasequeries
3
![Page 4: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/4.jpg)
Introduc)ontoInforma)onRetrieval
Recallthebasicindexingpipeline
Tokenizer
Token stream. Friends Romans Countrymen Linguistic modules
Modified tokens. friend roman countryman
Indexer
Inverted index.
friend
roman
countryman
2 4
2
13 16
1
Documents to be indexed.
Friends, Romans, countrymen.
4
![Page 5: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/5.jpg)
Introduc)ontoInforma)onRetrieval
Parsingadocument Whatformatisitin?
pdf/word/excel/html? Whatlanguageisitin? Whatcharactersetisinuse?
Each of these is a classification problem, which we will study later in the course.
But these tasks are often done heuristically …
Sec. 2.1
5
![Page 6: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/6.jpg)
Introduc)ontoInforma)onRetrieval
Complica*ons:Format/language Documentsbeingindexedcanincludedocsfrommanydifferentlanguages Asingleindexmayhavetocontaintermsofseverallanguages.
Some*mesadocumentoritscomponentscancontainmul*plelanguages/formats FrenchemailwithaGermanpdfaXachment.
Whatisaunitdocument? Afile? Anemail?(Perhapsoneofmanyinanmbox.) Anemailwith5aXachments? Agroupoffiles(PPTorLaTeXasHTMLpages)
Sec. 2.1
6
![Page 7: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/7.jpg)
Introduc)ontoInforma)onRetrieval
TOKENSANDTERMS
7
![Page 8: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/8.jpg)
Introduc)ontoInforma)onRetrieval
Tokeniza*on Input:“Friends,Romans,Countrymen” Output:Tokens
Friends Romans Countrymen
Atokenisasequenceofcharactersinadocument Eachsuchtokenisnowacandidateforanindexentry,a`erfurtherprocessing Describedbelow
Butwhatarevalidtokenstoemit?
Sec. 2.2.1
8
![Page 9: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/9.jpg)
Introduc)ontoInforma)onRetrieval
Tokeniza*on Issuesintokeniza*on:
Finland’scapital→Finland?Finlands?Finland’s? Hewle9‐Packard→Hewle9andPackardastwotokens? state‐of‐the‐art:breakuphyphenatedsequence. co‐educa>on lowercase,lower‐case,lowercase? Itcanbeeffec*vetogettheusertoputinpossiblehyphens
SanFrancisco:onetokenortwo? Howdoyoudecideitisonetoken?
Sec. 2.2.1
9
![Page 10: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/10.jpg)
Introduc)ontoInforma)onRetrieval
Numbers 3/12/91 Mar.12,1991 12/3/91 55B.C. B‐52 MyPGPkeyis324a3df234cb23e (800)234‐2333
O`enhaveembeddedspaces OlderIRsystemsmaynotindexnumbers
Buto`enveryuseful:thinkaboutthingslikelookinguperrorcodes/stacktracesontheweb
(Oneanswerisusingn‐grams:Lecture3)
Willo`enindex“meta‐data”separately Crea*ondate,format,etc.
Sec. 2.2.1
10
![Page 11: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/11.jpg)
Introduc)ontoInforma)onRetrieval
Tokeniza*on:languageissues French
L'ensemble→onetokenortwo? L?L’ ?Le? Wantl’ensembletomatchwithunensemble
Un*latleast2003,itdidn’tonGoogle Interna*onaliza*on!
Germannouncompoundsarenotsegmented LebensversicherungsgesellschaTsangestellter ‘lifeinsurancecompanyemployee’ Germanretrievalsystemsbenefitgreatlyfromacompoundspli>er
module Cangivea15%performanceboostforGerman
Sec. 2.2.1
11
![Page 12: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/12.jpg)
Introduc)ontoInforma)onRetrieval
Tokeniza*on:languageissues
ChineseandJapanesehavenospacesbetweenwords: 莎拉波娃现在居住在美国东南部的佛罗里达。 Notalwaysguaranteedauniquetokeniza*on
FurthercomplicatedinJapanese,withmul*plealphabetsintermingled Dates/amountsinmul*pleformats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Sec. 2.2.1
12
![Page 13: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/13.jpg)
Introduc)ontoInforma)onRetrieval
Tokeniza*on:languageissues Arabic(orHebrew)isbasicallywriXenrighttole`,butwithcertainitemslikenumberswriXenle`toright
Wordsareseparated,butleXerformswithinawordformcomplexligatures
←→←→←start ‘Algeriaachieveditsindependencein1962a`er132yearsofFrenchoccupa*on.’
WithUnicode,thesurfacepresenta*oniscomplex,butthestoredformisstraighlorward
Sec. 2.2.1
13
![Page 14: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/14.jpg)
Introduc)ontoInforma)onRetrieval
Stopwords
Withastoplist,youexcludefromthedic*onaryen*relythecommonestwords.Intui*on: TheyhaveliXleseman*ccontent:the,a,and,to,be Therearealotofthem:~30%ofpos*ngsfortop30words
Butthetrendisawayfromdoingthis: Goodcompressiontechniques(lecture5)meansthespacefor
includingstopwordsinasystemisverysmall Goodqueryop*miza*ontechniques(lecture7)meanyoupayliXle
atquery*meforincludingstopwords. Youneedthemfor:
Phrasequeries:“KingofDenmark” Varioussong*tles,etc.:“Letitbe”,“Tobeornottobe” “Rela*onal”queries:“flightstoLondon”
Sec. 2.2.2
14
![Page 15: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/15.jpg)
Introduc)ontoInforma)onRetrieval
Normaliza*ontoterms Weneedto“normalize”wordsinindexedtextaswellasquerywordsintothesameform WewanttomatchU.S.A.andUSA
Resultisterms:atermisa(normalized)wordtype,whichisanentryinourIRsystemdic*onary
Wemostcommonlyimplicitlydefineequivalenceclassesoftermsby,e.g., dele*ngperiodstoformaterm
U.S.A.,USAUSA
dele*nghyphenstoformaterm an>‐discriminatory,an>discriminatoryan>discriminatory
Sec. 2.2.3
15
![Page 16: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/16.jpg)
Introduc)ontoInforma)onRetrieval
Normaliza*on:otherlanguages Accents:e.g.,Frenchrésumévs.resume. Umlauts:e.g.,German:Tuebingenvs.Tübingen
Shouldbeequivalent Mostimportantcriterion:
Howareyourusersliketowritetheirqueriesforthesewords?
Eveninlanguagesthatstandardlyhaveaccents,userso`enmaynottypethem O`enbesttonormalizetoade‐accentedterm
Tuebingen,Tübingen,TubingenTubingen
Sec. 2.2.3
16
![Page 17: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/17.jpg)
Introduc)ontoInforma)onRetrieval
Normaliza*on:otherlanguages Normaliza*onofthingslikedateforms
7月30日 vs. 7/30
Japanese use of kana vs. Chinese characters
Tokeniza*onandnormaliza*onmaydependonthelanguageandsoisintertwinedwithlanguagedetec*on
Crucial:Needto“normalize”indexedtextaswellasquerytermsintothesameform
Morgen will ich in MIT … Is this
German “mit”?
Sec. 2.2.3
17
![Page 18: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/18.jpg)
Introduc)ontoInforma)onRetrieval
Casefolding ReduceallleXerstolowercase
excep*on:uppercaseinmid‐sentence? e.g.,GeneralMotors Fedvs.fed SAILvs.sail
O`enbesttolowercaseeverything,sinceuserswilluselowercaseregardlessof‘correct’capitaliza*on…
Googleexample: QueryC.A.T. #1resultwasfor“cat”(well,Lolcats)notCaterpillarInc.
Sec. 2.2.3
18
![Page 19: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/19.jpg)
Introduc)ontoInforma)onRetrieval
Normaliza*ontoterms
Analterna*vetoequivalenceclassingistodoasymmetricexpansion
Anexampleofwherethismaybeuseful Enter:window Search:window,windows Enter:windows Search:Windows,windows,window Enter:Windows Search:Windows
Poten*allymorepowerful,butlessefficient
Sec. 2.2.3
19
![Page 20: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/20.jpg)
Introduc)ontoInforma)onRetrieval
Thesauriandsoundex Dowehandlesynonymsandhomonyms?
E.g.,byhand‐constructedequivalenceclasses car=automobile color=colour
Wecanrewritetoformequivalence‐classterms Whenthedocumentcontainsautomobile,indexitundercar‐automobile(andvice‐versa)
Orwecanexpandaquery Whenthequerycontainsautomobile,lookundercaraswell
Whataboutspellingmistakes? Oneapproachissoundex,whichformsequivalenceclassesofwordsbasedonphone*cheuris*cs
Moreinlectures3and920
![Page 21: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/21.jpg)
Introduc)ontoInforma)onRetrieval
Lemma*za*on Reduceinflec*onal/variantformstobaseform E.g.,
am,are,is→be
car,cars,car's,cars'→car
theboy'scarsaredifferentcolors→theboycarbedifferentcolor
Lemma*za*onimpliesdoing“proper”reduc*ontodic*onaryheadwordform
Sec. 2.2.4
21
![Page 22: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/22.jpg)
Introduc)ontoInforma)onRetrieval
Stemming Reducetermstotheir“roots”beforeindexing “Stemming”suggestcrudeaffixchopping
languagedependent e.g.,automate(s),automa>c,automa>onallreducedtoautomat.
for example compressed and compression are both accepted as equivalent to compress.
for exampl compress and compress ar both accept as equival to compress
Sec. 2.2.4
22
![Page 23: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/23.jpg)
Introduc)ontoInforma)onRetrieval
Porter’salgorithm CommonestalgorithmforstemmingEnglish
Resultssuggestit’satleastasgoodasotherstemmingop*ons
Conven*ons+5phasesofreduc*ons phasesappliedsequen*ally eachphaseconsistsofasetofcommands sampleconven*on:Oftherulesinacompoundcommand,selecttheonethatappliestothelongestsuffix.
Sec. 2.2.4
23
![Page 24: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/24.jpg)
Introduc)ontoInforma)onRetrieval
TypicalrulesinPorter sses→ss ies→i a)onal→ate )onal→)on
Rulessensi*vetothemeasureofwords (m>1)EMENT→
replacement→replac cement→cement
Sec. 2.2.4
24
![Page 25: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/25.jpg)
Introduc)ontoInforma)onRetrieval
Otherstemmers Otherstemmersexist,e.g.,Lovinsstemmer
hXp://www.comp.lancs.ac.uk/compu*ng/research/stemming/general/lovins.htm Single‐pass,longestsuffixremoval(about250rules)
Fullmorphologicalanalysis–atmostmodestbenefitsforretrieval
Dostemmingandothernormaliza*onshelp? English:verymixedresults.Helpsrecallbutharmsprecision
opera*ve(den*stry)⇒ oper opera*onal(research)⇒ oper opera*ng(systems)⇒ oper
Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish!
Sec. 2.2.4
25
![Page 26: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/26.jpg)
Introduc)ontoInforma)onRetrieval
Language‐specificity Manyoftheabovefeaturesembodytransforma*onsthatare Language‐specificand O`en,applica*on‐specific
Theseare“plug‐in”addendatotheindexingprocess Bothopensourceandcommercialplug‐insareavailableforhandlingthese
Sec. 2.2.4
26
![Page 27: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/27.jpg)
Introduc)ontoInforma)onRetrieval
Dic*onaryentries–firstcutensemble.french
時間.japanese
MIT.english
mit.german
guaranteed.english
entries.english
sometimes.english
tokenization.english
These may be grouped by language (or
not…). More on this in ranking/query
processing.
Sec. 2.2
27
![Page 28: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/28.jpg)
Introduc)ontoInforma)onRetrieval
FASTERPOSTINGSMERGES:SKIPPOINTERS/SKIPLISTS
28
![Page 29: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/29.jpg)
Introduc)ontoInforma)onRetrieval
Recallbasicmerge Walkthroughthetwopos*ngssimultaneously,in*melinearinthetotalnumberofpos*ngsentries
128
31
2 4 8 41 48 64
1 2 3 8 11 17 21
Brutus
Caesar 2 8
If the list lengths are m and n, the merge takes O(m+n) operations.
Can we do better? Yes (if index isn’t changing too fast).
Sec. 2.3
29
![Page 30: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/30.jpg)
Introduc)ontoInforma)onRetrieval
Augmentpos*ngswithskippointers(atindexing*me)
Why? Toskippos*ngsthatwillnotfigureinthesearchresults.
How? Wheredoweplaceskippointers?
128 2 4 8 41 48 64
31 1 2 3 8 11 17 21 31 11
41 128
Sec. 2.3
30
![Page 31: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/31.jpg)
Introduc)ontoInforma)onRetrieval
Queryprocessingwithskippointers
128 2 4 8 41 48 64
31 1 2 3 8 11 17 21 31 11
41 128
Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance.
We then have 41 and 11 on the lower. 11 is smaller.
But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.
Sec. 2.3
31
![Page 32: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/32.jpg)
Introduc)ontoInforma)onRetrieval
Wheredoweplaceskips? Tradeoff:
Moreskips→shorterskipspans⇒morelikelytoskip.Butlotsofcomparisonstoskippointers.
Fewerskips→fewpointercomparison,butthenlongskipspans⇒fewsuccessfulskips.
Sec. 2.3
32
![Page 33: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/33.jpg)
Introduc)ontoInforma)onRetrieval
Placingskips Simpleheuris*c:forpos*ngsoflengthL,use√Levenly‐spacedskippointers.
Thisignoresthedistribu*onofqueryterms. Easyiftheindexisrela*velysta*c;harderifLkeepschangingbecauseofupdates.
Thisdefinitelyusedtohelp;withmodernhardwareitmaynot(Bahleetal.2002)unlessyou’rememory‐based TheI/Ocostofloadingabiggerpos*ngslistcanoutweighthegainsfromquickerinmemorymerging!
Sec. 2.3
33
![Page 34: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/34.jpg)
Introduc)ontoInforma)onRetrieval
PHRASEQUERIESANDPOSITIONALINDEXES
34
![Page 35: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/35.jpg)
Introduc)ontoInforma)onRetrieval
Phrasequeries Wanttobeabletoanswerqueriessuchas“stanforduniversity” –asaphrase
Thusthesentence“IwenttouniversityatStanford”isnotamatch. Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworks
Manymorequeriesareimplicitphrasequeries
Forthis,itnolongersufficestostoreonly<term:docs>entries
Sec. 2.4
35
![Page 36: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/36.jpg)
Introduc)ontoInforma)onRetrieval
AfirstaXempt:Biwordindexes Indexeveryconsecu*vepairoftermsinthetextasaphrase
Forexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwords friendsromans romanscountrymen
Eachofthesebiwordsisnowadic*onaryterm Two‐wordphrasequery‐processingisnowimmediate.
Sec. 2.4.1
36
![Page 37: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/37.jpg)
Introduc)ontoInforma)onRetrieval
Longerphrasequeries Longerphrasesareprocessedaswedidwithwild‐cards:
stanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:
stanforduniversityANDuniversitypaloANDpaloaltoWithoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.
Can have false positives!
Sec. 2.4.1
37
![Page 38: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/38.jpg)
Introduc)ontoInforma)onRetrieval
Extendedbiwords Parsetheindexedtextandperformpart‐of‐speech‐tagging
(POST). Bucketthetermsinto(say)Nouns(N)andar*cles/
preposi*ons(X). CallanystringoftermsoftheformNX*Nanextended
biword. Eachsuchextendedbiwordisnowmadeaterminthedic*onary.
Example:catcherintheryeNXXN
Queryprocessing:parseitintoN’sandX’s Segmentqueryintoenhancedbiwords Lookupinindex:catcherrye
Sec. 2.4.1
38
![Page 39: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/39.jpg)
Introduc)ontoInforma)onRetrieval
Issuesforbiwordindexes Falseposi*ves,asnotedbefore Indexblowupduetobiggerdic*onary
Infeasibleformorethanbiwords,bigevenforthem
Biwordindexesarenotthestandardsolu*on(forallbiwords)butcanbepartofacompoundstrategy
Sec. 2.4.1
39
![Page 40: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/40.jpg)
Introduc)ontoInforma)onRetrieval
Solu*on2:Posi*onalindexes Inthepos*ngs,storeforeachtermtheposi*on(s)inwhichtokensofitappear:
<term,numberofdocscontainingterm;doc1:posi*on1,posi*on2…;doc2:posi*on1,posi*on2…;etc.>
Sec. 2.4.2
40
![Page 41: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/41.jpg)
Introduc)ontoInforma)onRetrieval
Posi*onalindexexample
Forphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevel
Butwenowneedtodealwithmorethanjustequality
<be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …>
Which of docs 1,2,4,5 could contain “to be
or not to be”?
Sec. 2.4.2
41
![Page 42: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/42.jpg)
Introduc)ontoInforma)onRetrieval
Processingaphrasequery Extractinvertedindexentriesforeachdis*nctterm:to,be,or,not.
Mergetheirdoc:posi)onliststoenumerateallposi*onswith“tobeornottobe”. to:
2:1,17,74,222,551;4:8,16,190,429,433;7:13,23,191;... be:
1:17,19;4:17,191,291,430,434;5:14,19,101;...
Samegeneralmethodforproximitysearches
Sec. 2.4.2
42
![Page 43: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/43.jpg)
Introduc)ontoInforma)onRetrieval
Proximityqueries LIMIT!/3STATUTE/3FEDERAL/2TORT
Again,here,/kmeans“withinkwordsof”.
Clearly,posi*onalindexescanbeusedforsuchqueries;biwordindexescannot.
Exercise:Adaptthelinearmergeofpos*ngstohandleproximityqueries.Canyoumakeitworkforanyvalueofk? ThisisaliXletrickytodocorrectlyandefficiently SeeFigure2.12ofIIR There’slikelytobeaproblemonit!
Sec. 2.4.2
43
![Page 44: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/44.jpg)
Introduc)ontoInforma)onRetrieval
Posi*onalindexsize Youcancompressposi*onvalues/offsets:we’lltalkaboutthatinlecture5
Nevertheless,aposi*onalindexexpandspos*ngsstoragesubstan)ally
Nevertheless,aposi*onalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.
Sec. 2.4.2
44
![Page 45: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/45.jpg)
Introduc)ontoInforma)onRetrieval
Posi*onalindexsize Needanentryforeachoccurrence,notjustonceperdocument
Indexsizedependsonaveragedocumentsize Averagewebpagehas<1000terms SECfilings,books,evensomeepicpoems…easily100,000terms
Consideratermwithfrequency0.1%
Why?
100 1 100,000
1 1 1000
Positional postings Postings Document size
Sec. 2.4.2
45
![Page 46: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/46.jpg)
Introduc)ontoInforma)onRetrieval
Rulesofthumb Aposi*onalindexis2–4aslargeasanon‐posi*onalindex
Posi*onalindexsize35–50%ofvolumeoforiginaltext
Caveat:allofthisholdsfor“English‐like”languages
Sec. 2.4.2
46
![Page 47: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/47.jpg)
Introduc)ontoInforma)onRetrieval
Combina*onschemes
Thesetwoapproachescanbeprofitablycombined Forpar*cularphrases(“MichaelJackson”,“BritneySpears”)itisinefficienttokeeponmergingposi*onalpos*ngslists Evenmoresoforphraseslike“TheWho”
Williamsetal.(2004)evaluateamoresophis*catedmixedindexingscheme Atypicalwebquerymixturewasexecutedin¼ofthe*meofusingjustaposi*onalindex
Itrequired26%morespacethanhavingaposi*onalindexalone
Sec. 2.4.3
47
![Page 48: Introducon to Informaon Retrievaldisi.unitn.it/moschitti/Teaching-slides/NLP-IR/lecture02... · 2012-03-06 · Sec. 2.2.2 Stop words With a stop list, you exclude from the diconary](https://reader033.fdocuments.net/reader033/viewer/2022050109/5f4760765be4c144ec5ddcbe/html5/thumbnails/48.jpg)
Introduc)ontoInforma)onRetrieval
Resourcesfortoday’slecture IIR2 MG3.6,4.3;MIR7.2 Porter’sstemmer:
hXp://www.tartarus.org/~mar*n/PorterStemmer/
SkipListstheory:Pugh(1990) Mul*levelskiplistsgivesameO(logn)efficiencyastrees
H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. hXp://www.seg.rmit.edu.au/research/research.php?author=4
D.Bahle,H.Williams,andJ.Zobel.Efficientphrasequeryingwithanauxiliaryindex.SIGIR2002,pp.215‐221.
48