Information Retrieval and Text Processing
• Huge literature dating back to the 1950’s!
• SIGIR/TREC - home for much of this
• Readings:– Salton, Wong, Yang, A Vector Space Model for
Automatic Indexing, CACM Nov 75 V18N11– Tutle, Croft, Inference Networks for Document
Retrieval, ???, [OPTIONAL]
IR/TP applications
• Search• Filtering• Summarization • Classification• Clustering
• Information extraction• Knowledge
management• Author identification• …and more...
Types of search
• Recall -- finding documents one knows exists, e.g., an old e-mail message or RFC
• Discovery -- finding “interesting” documents given a high-level goal
• Classic IR search is focused on discovery
Classic discovery problem
• Corpus: fixed collection of documents, typically “nice” docs (e.g., NYT articles)
• Problem: retrieve documents relevant to user’s information need
Definitions
• Task: example: write a Web crawler
• Information need: perception of documents needed to accomplish task, e.g., Web specs
• Query: sequence of characters given to a search engine one hopes will return desired documents
Conception
• Translating task into information need
• Mis-conception: identify too little (tips on high-bandwidth DNS lookups) and/or too much (TCP spec) as relevant to task
• Sometimes a little extra breadth in results can tip user off to need to refine info need, but not much research into dealing with this automatically
Translation
• Translating info need into query syntax of particular search engine
• Mis-translation: get this wrong– Operator error (is “a b” == a&b or a|b ?)– Polysemy -- same word, different meanings– Synonimy -- different words, same meaning
• Automation: “NLP”, “easy syntax”, “query expansion”, “Q&A”
Refinement
• Modification of query, typically in light of particular results, to better meet info need
• Lots of work of refining query automatically (often with some input from user, e.g., “relevance feedback”)
Precision and recall
• Classic metrics of search-result “goodness”
• Recall = fraction of all good docs retrieved– |relevant results| / |all relevant docs in corpus|
• Precision = fraction of results that are good– |relevant results| / |result-set size|
Precision and recall
• Recall/precision trade-off:– Return everything ==> great recall, bad precision– Return nothing ==> great precision, bad recall
• Precision curves– Search engine produces total ranking– Plot precision at 10%, 20%, .., 100% recall
Other metrics
• Novelty / anti-redundancy– Information content of result set is disjoint
• Comprehendible– Returned documents can be understood by user
• Accurate / authoritative– Citation ranking!!
• Freshness
Term vector basics
Document Id … Automobile … Carburetor … Feline … Jaguar … Doc 1 2 3 0 2 Doc 2 0 0 2 2 Doc 3 2 0 0 2 …
•Basic abstraction for information retrieval•Useful for measuring “semantic” similarity of text
•A row in the above table is a “term vector”•Columns are word stems and phrases
•Trying to capture “meaning”
Everything’s a vector!!
• Documents are vectors
• Document collections are vectors
• Queries are vectors
• Topics are vectors
Cosine measurement of similarity
• E1 . E2 / (|E1|*|E2|) = cos(E1,E2)• Rank doc’s against Q’s, measure similarity
of doc’s, etc.• In example:
– cos(doc1, doc2) ~ 1/3– cos(doc1, doc3) ~ 2/3– cos(doc2, doc3) ~ 1/2– So: doc1&3 are closest
Weighting of terms in vectors
• Salton’s “TF*IDF”– TF = term frequency in document– DF = doc frequency of term (# docs with term)– IDF = inverse doc freq. = 1/DF– Weight of term = TF * IDF
• “Importance” of term determined by:– Count of term in doc (high ==> important)– Number of docs with term (low ==> important)
Relevance-feedback in VSM
• Rocchio formula:– Q’ = F[Q, Relevant, Irrelevant]– Where F is weighted sum, such as:
Q’[t] = a*Q[t]+b*sum_i R_ i[t]+c*sum_i I_ i[t]
Remarks on VSM
• Principled way of solving many IR/text processing problems, not just search
• Tons of variations on VSM– Different term weighting schemes– Different similarity formulas
• Normalization itself is a huge sub-industry
All of this goes out on Web
• Very small, unrefined queries
• Recall not an issue– Quality is the issue (want most relevant)
– Precision-at-ten matters (how many total losers)
• Scale precludes heavy VSM techniques
• Corpus assumptions (e.g., unchanging, uniform quality) do not hold
• “Adversarial IR” - new challenge on Web
• Still, VSM important tool for Web Archeology
Top Related