INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of INFORMATION RETRIEVAL AND WEB SEARCH CC437 (Based on original material by Udo Kruschwitz)
INFORMATION RETRIEVAL AND WEB SEARCH
CC437
(Based on original material by Udo Kruschwitz)
INFORMATION RETRIEVAL
GOAL: Find the documents most relevant to a certain QUERY
Latest development: WEB SEARCH– Use the Web as the collection of documents
Related: – QUESTION-ANSWERING– DOCUMENT CLASSIFICATION
INFORMATION RETRIEVAL:SUBTASKS
INDEX the documents in the collection – (offline)
PROCESS the query EVALUATE SIMILARITY and find RANKs
– Find documents most closely matching the query
DISPLAY results / enter a DIALOGUE– E.g., user may refine the query
DOCUMENTS AS BAGS OF WORDS
broad tech stock rally may signal trend - traders.
technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.
broadmay rallyralliedsignal stockstocks techtechnology traderstraders trend
DOCUMENTINDEX
SUBTASKS I: INDEXING
PREPROCESSING Deletion of STOPWORDS STEMMING Selection of INDEX TERMS
INDEXING I: PREPROCESSING
PUNCTUATION REMOVAL– (Crestani et al)
CASE FOLDING– London london– LONDON london
DIGIT REMOVAL– But: SPARCStation 5
INDEXING II: STOPWORD REMOVAL
Very frequent words are not good discriminators– Many of these are CLOSED CLASS words
INQUERY’s list of stop words beginning with letter “a”:
a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, amongst, am, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anywhere, apart, are, around, as, at
Domain-specific stopwords search, webmaster, copyright, www
INDEXING III:STEMMING
Simplest: suffix stripping PORTER STEMMER: inflectional & derivational morphology
– develop develop– developing develop– development develop– developments develop– BUT: photography photographi
The effectiveness of stemming:– For English: increase in recall doesn’t compensate loss in
precision– For other languages: necessary
E.g., Abdul Goweder’s dissertation
STORAGE
Requirements– Huge amounts of data– Lots of redundancy– Quick random access necessary
Indexing techniques:– Inverted index files– Suffix trees / suffix arrays– Signature files
STORAGE TECHNIQUES:INVERTED INDEX
broad tech stock rally may signal trend - traders. broad {1}
gain {2}rally {1,2}score {2}signal {1} stock {1,2}tech {1}technology {2}traders {1,2} trend {1}tuesday {2}
DOCUMENT1INVERTED INDEX
technology stocks rallied on tuesday, with gains scored broadly across many sectors, amid what some traders called a recovery from recent doldrums.
DOCUMENT2
SIMILARITY MODELS
Boolean model Probabilistic model Vector-space model
THE BOOLEAN MODEL
Each index term is either present or absent Documents are either RELEVANT or NOT
RELEVANT (no grading of results) Advantages
– Clean formalism, simple to implement Disadvantages
– Exact matching only– All index terms equal weight
THE VECTOR SPACE MODEL
Query and documents represented as vectors of index terms, assigned non-binary WEIGHTS
Similarity calculated using vector algebra: COSINE (cfr. lexical similarity models)– RANKED similarity
Most popular of all models (cfr. Salton and Lesk’s SMART)
SIMILARITY IN VECTOR SPACE MODELS: THE COSINE MEASURE
kj
kj
qd
qd *cos
θ
dj
qk
N
i ij
N
i ik
N
iijik
jk
ww
wwdqsim
1
2,1
2,
1,,
,
TERM WEIGHTING IN VECTOR SPACE MODELS: THE TF.IDF MEASURE
ikiki df
Nftfidf log*,,
FREQUENCY of term i in document k Number of documents
with term i
EVALUATION
One of the most important contributions of IR to NLE has been the development of better ways of evaluating systems than simple accuracy
Simplest quantitative evaluation metrics
ACCURACY: percentage correct(against some gold standard)- e.g., tagger gets 96.7% of tags correct when evaluated using the Penn TreebankProblem with accuracy: only really useful when classes of approximately equal size (not the case in IR)
ERROR: percentage wrong- ERROR REDUCTION most typical metric in ASR
A more general form of evaluation: precision & recall
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
sfvfvfnv lnv sjjsjnskffsavdkf d lsjnvjf fvjnfdfj djf v lafnlanflj aff
rvjfkjfkbv
KFKRQVFsjfanvnf
CDKBCWDK
Positives and negatives
TRUE NEGATIVES
FALSE NEGATIVES
TP FP
Precision and recall
PRECISION: proportion correct AMONG SELECTED ITEMS
RECALL: proportion of correct items selected
The tradeoff between precision and recall
Easy to get high precision: never classify anything
Easy to get high recall: return everything
Really need to report BOTH, or F-measure
RP
PRF
2
WEB SEARCH
In many senses, just a form of IR But:
– Further information one has to take into account Markup Hyperlinks Meta tags
– Extra problems Document highly heterogeneous Multimedia Quality of data
Key aspects of Google’s search algorithm (as far as we know!)
– Analyze link structure: PAGE RANK– Exploit visual presentation
Page Rank used to rank retrieved documents in addition to similarity measures
Page Rank motivations:– Most important papers are those cited most often– Not all sources of citations are equally reliable
PAGE RANK
k
i
i
pC
pPageRankqqpPageRank
1*)1( )(
Page p
Probability q of randomly jumping to that page
Pages pointing to p
READINGS AND REFERENCES
Jurafsky and Martin, chapter 10.1-10.4 Other references
– Brin, S. and Page, L. 1998, “The anatomy of a large-scale hypertextual web search engine”, In Proc. Of the 7th WWW conference (WWW7),Brisbane
– F. Crestani et al, 1998, “Is this document relevant? …probably”, ACM Computing Surveys, 30(4):528-552
– Goweder, A, 2004, The role of stemming in IR: the case of Arabic, PhD dissertation, University of Essex
– Porter, M.F., 1980, “An algorithm for suffix stripping”, Program, 14(3) :130-137
– G. Salton and M. E. Lesk, 1968. “Computer evaluation of indexing and text processing”, Journal of the ACM, 15(1),8-36