CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance...
Transcript of CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM... · Relevance...
CSE 5243 INTRO. TO DATA MINING
Slides adapted from Prof. Srinivasan Parthasarathy @OSU
Graph Data & Introduction to Information Retrieval
Huan Sun, CSE@The Ohio State University 11/21/2017
2
GRAPH BASICS AND A GENTLE INTRODUCTION TO PAGERANKSlides adapted from Prof. Srinivasan Parthasarathy @OSU
Chapter 4 Graph Data: http://www.dataminingbook.info/pmwiki.php/Main/BookPathUploads?action=downloadman&upname=book-20160121.pdf , http://www.dataminingbook.info/pmwiki.php
3
Background
Besides the keywords, what other evidence can one use to rate the importance of a webpage?
Solution: Use the hyperlink structure
E.g. a webpage linked by many webpages is probably important. but this method is not global (comprehensive).
PageRank is developed by Larry Page in 1998.
4
Idea
A graph representing WWW Node: webpage Directed edge: hyperlink
A user randomly clicks the hyperlink to surf WWW. The probability a user stop in a particular webpage is the PageRank value.
A node that is linked by many nodes with high PageRank value receives a high rank itself; If there are no links to a node, then there is no support for that page.
5
Formal Formulation
6
Formal Formulation
7
Iterative Computation
8
Example 1
PageRank Calculation: first iteration
=the transpose of A (adjacency matrix)
9
Example 1
PageRank Calculation: second iteration
10
Example 1
Convergence after some iterations
11
A simple version
u: a webpage Bu: the set of u’s backlinks Nv: the number of forward links of page v
Initially, R(u) is 1/N for every webpage Iteratively update each webpage’s PR value until
convergence.
∑ ∈=
uBvvNvRuR )()(
12
A little more advanced version
Adding a damping factor d Imagine that a surfer would stop clicking a hyperlink with probability
1-d
R(u) is at least (1-d)/(N-1) N is the total number of nodes.
∑ ∈+
−−
=uBv
vNvRd
NduR )(1)1()(
13
Other applications
Social network (Facebook, Twitter, etc) Node: Person; Edge: Follower / Followee / Friend Higher PR value: Celebrity
Citation network Node: Paper; Edge: Citation Higher PR values: Important Papers.
Protein-protein interaction network Node: Protein; Edge: Two proteins bind together Higher PR values: Essential proteins.
SEARCH ENGINES
INFORMATION RETRIEVAL IN PRACTICE
BOOK: HTTP://CIIR.CS.UMASS.EDU/DOWNLOADS/SEIRIP.PDF
SLIDES:HTTP://WWW.SEARCH-ENGINES-BOOK.COM/SLIDES/
All slides ©Addison Wesley, 2008
Slides adapted from Prof. W. Bruce Crof @UMASS
15
Search Engines and Information Retrieval
Information Retrieval in PracticeAll slides ©Addison Wesley, 2008
16
Search and Information Retrieval
Search on the Web is a daily activity for many people throughout the world
Search and communication are most popular uses of the computer Applications involving search are everywhere The field of computer science that is most involved with R&D for search
is information retrieval (IR)
17
Information Retrieval
“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968)
General definition that can be applied to many types of information and search applications
Primary focus of IR since the 50s has been on text and documents
18
What is a Document?
Examples: web pages, email, books, news stories, scholarly papers, text messages,
Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.
Common properties Significant text content Some structure (e.g., title, author, date for papers; subject, sender,
destination for email)
19
Documents vs. Database Records
Database records (or tuples in relational databases) are typically made up of well-defined fields (or attributes) e.g., bank records with account numbers, balances, names, addresses,
social security numbers, dates of birth, etc.
Easy to compare fields with well-defined semantics to queries in order to find matches
Text is more difficult
20
Documents vs. Records
Example bank database query Find records with balance > $50,000 in branches located in Amherst, MA. Matches easily found by comparison with field values of records
Example search engine query bank scandals in western mass This text must be compared to the text of entire news stories
21
Comparing Text
Comparing the query text to the document text and determining what is a good match is the core issue of information retrieval
Exact matching of words is not enough Many different ways to write the same thing in a “natural language” like
English e.g., does a news story containing the text “bank director in Amherst steals
funds” match the query? Some stories will be better matches than others
22
Dimensions of IR
IR is more than just text, and more than just web search although these are central
People doing IR work with different media, different types of search applications, and different tasks
23
Other Media
New applications increasingly involve new media e.g., video, photos, music, speech
Like text, content is difficult to describe and compare text may be used to represent them (e.g. tags)
IR approaches to search and evaluation are appropriate
24
Dimensions of IR
Content Applications Tasks
Text Web search Ad hoc search
Images Vertical search Filtering
Video Enterprise search Classification
Scanned docs Desktop search Question answering
Audio Forum search
Music P2P search
Literature search
25
IR Tasks
Ad-hoc search Find relevant documents for an arbitrary text query
Filtering Identify relevant user profiles for a new document
Classification Identify relevant labels for documents
Question answering Give a specific answer to a question
26
Big Issues in IR
Relevance What is it? Simple (and simplistic) definition: A relevant document contains the
information that a person was looking for when they submitted a query to the search engine
Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style
Topical relevance (same topic) vs. user relevance (everything else)
27
Big Issues in IR
Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines are based on
retrieval models Most models describe statistical properties of text
rather than linguistic i.e. counting simple text features such as words instead of
parsing and analyzing the sentences Statistical approach to text processing started with Luhn in
the 50s Linguistic features can be part of a statistical model
28
Big Issues in IR
Evaluation Experimental procedures and measures for comparing
system output with user expectationsOriginated in Cranfield experiments in the 60s
IR evaluation methods now used in many fields Typically use test collection of documents, queries, and
relevance judgmentsMost commonly used are TREC collections
Recall and precision are two examples of effectivenessmeasures
29
Big Issues in IR
Users and Information Needs Search evaluation is user-centered Keyword queries are often poor descriptions of actual information needs Interaction and context are important for understanding user intent Query refinement techniques such as query expansion, query suggestion,
relevance feedback improve ranking
30
IR and Search Engines
A search engine is the practical application of information retrieval techniques to large scale text collections
Web search engines are best-known examples, but many others Open source search engines are important for research and development e.g., Lucene, Lemur/Indri, Galago
Big issues include main IR issues but also some others
31
IR and Search Engines
Relevance-Effective ranking
Evaluation-Testing and measuring
Information needs-User interaction
Performance-Efficient search and indexing
Incorporating new data-Coverage and freshness
Scalability-Growing with data and users
Adaptability-Tuning for applications
Specific problems-e.g. Spam
Information Retrieval Search Engines
32
Search Engine Issues
Performance Measuring and improving the efficiency of search e.g., reducing response time, increasing query throughput, increasing indexing speed
Indexes are data structures designed to improve search efficiency designing and implementing them are major issues for search engines
33
Search Engine Issues
Dynamic data The “collection” for most real applications is constantly changing in
terms of updates, additions, deletions e.g., web pages
Acquiring or “crawling” the documents is a major task Typical measures are coverage (how much has been indexed) and freshness
(how recently was it indexed)
Updating the indexes while processing queries is also a design issue
34
Search Engine Issues
Scalability Making everything work with millions of users every day, and many
terabytes of documents Distributed processing is essential
Adaptability Changing and tuning search engine components such as ranking algorithm,
indexing strategy, interface for different applications
35
Architecture of a Search Engine
Information Retrieval in PracticeAll slides ©Addison Wesley, 2008
36
Search Engine Architecture
A software architecture consists of software components, the interfaces provided by those components, and the relationships between them describes a system at a particular level of abstraction
Architecture of a search engine determined by 2 requirements effectiveness (quality of results) and efficiency (response time and
throughput)
37
Indexing Process
38
Indexing Process
Text acquisition identifies and stores documents for indexing
Text transformation transforms documents into index terms or features
Index creation takes index terms and creates data structures (indexes) to support fast
searching
39
Query Process
40
Query Process
User interaction supports creation and refinement of query, display of results
Ranking uses query and indexes to generate ranked list of documents
Evaluation monitors and measures effectiveness and efficiency (primarily offline)
41
Details: Text Acquisition
Crawler Identifies and acquires documents for search engine Many types – web, enterprise, desktop Web crawlers follow links to find documentsMust efficiently find huge numbers of web pages (coverage) and keep them
up-to-date (freshness) Single site crawlers for site search Topical or focused crawlers for vertical search
Document crawlers for enterprise and desktop search Follow links and scan directories
42
Text Acquisition
Feeds Real-time streams of documents e.g., web feeds for news, blogs, video, radio, tv
RSS is common standard RSS “reader” can provide new XML documents to search engine
Conversion Convert variety of documents into a consistent text plus metadata format e.g. HTML, XML, Word, PDF, etc. → XML
Convert text encoding for different languages Using a Unicode standard like UTF-8
43
Text Acquisition
Document data store Stores text, metadata, and other related content for documents Metadata is information about document such as type and creation dateOther content includes links, anchor text
Provides fast access to document contents for search engine components e.g. result list generation
Could use relational database system More typically, a simpler, more efficient storage system is used due to huge
numbers of documents
44
Text Transformation
Parser Processing the sequence of text tokens in the document to
recognize structural elements e.g., titles, links, headings, etc.
Tokenizer recognizes “words” in the text must consider issues like capitalization, hyphens, apostrophes, non-
alpha characters, separators Markup languages such as HTML, XML often used to specify
structure Tags used to specify document elements
E.g., <h2> Overview </h2> Document parser uses syntax of markup language (or other
formatting) to identify structure
45
Text Transformation
Stopping Remove common words e.g., “and”, “or”, “the”, “in”
Some impact on efficiency and effectiveness Can be a problem for some queries
Stemming Group words derived from a common stem e.g., “computer”, “computers”, “computing”, “compute”
Usually effective, but not for all queries Benefits vary for different languages
46
Text Transformation
Link Analysis Makes use of links and anchor text in web pages Link analysis identifies popularity and community information e.g., PageRank
Anchor text can significantly enhance the representation of pages pointed to by links
Significant impact on web search Less importance in other applications
47
Text Transformation
Information Extraction Identify classes of index terms that are important for
some applications e.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc.
Classifier Identifies class-related metadata for documents i.e., assigns labels to documents e.g., topics, reading levels, sentiment, genre
Use depends on application
48
Index Creation
Document Statistics Gathers counts and positions of words and other features Used in ranking algorithm
Weighting Computes weights for index terms Used in ranking algorithm e.g., tf.idf weight Combination of term frequency in document and inverse document
frequency in the collection
49
Index Creation
Inversion Core of indexing process Converts document-term information to term-document for indexing Difficult for very large numbers of documents
Format of inverted file is designed for fast query processingMust also handle updates Compression used for efficiency
50
Index Creation
Index Distribution Distributes indexes across multiple computers and/or multiple sites Essential for fast query processing with large numbers of documents Many variations Document distribution, term distribution, replication
P2P and distributed IR involve search across multiple sites
51
User Interaction
Query input Provides interface and parser for query language Most web queries are very simple, other applications
may use forms Query language used to describe more complex queries
and results of query transformation e.g., Boolean queries, Indri and Galago query languages similar to SQL language used in database applications IR query languages also allow content and structure
specifications, but focus on content
52
User Interaction
Query transformation Improves initial query, both before and after initial search Includes text transformation techniques used for documents Spell checking and query suggestion provide alternatives to original query Query expansion and relevance feedback modify the original query with
additional terms
53
User Interaction
Results output Constructs the display of ranked documents for a query Generates snippets to show how queries match documents Highlights important words and passages Retrieves appropriate advertising in many applications May provide clustering and other visualization tools
54
Ranking
Scoring Calculates scores for documents using a ranking algorithm Core component of search engine Basic form of score is ∑ qi di qi and di are query and document term weights for term i
Many variations of ranking algorithms and retrieval models
55
Ranking
Performance optimization Designing ranking algorithms for efficient processing Term-at-a time vs. document-at-a-time processing Safe vs. unsafe optimizations
Distribution Processing queries in a distributed environment Query broker distributes queries and assembles results Caching is a form of distributed searching
56
Evaluation
Logging Logging user queries and interaction is crucial for improving search
effectiveness and efficiency Query logs and clickthrough data used for query suggestion, spell
checking, query caching, ranking, advertising search, and other components
Ranking analysis Measuring and tuning ranking effectiveness
Performance analysis Measuring and tuning system efficiency
57
How Does It Really Work?
The course* explains these components of a search engine in more detail Often many possible approaches and techniques for a given component
Focus is on the most important alternatives i.e., explain a small number of approaches in detail rather than many approaches “Importance” based on research results and use in actual search engines Alternatives described in references
* http://www.search-engines-book.com/slides/
RETRIEVAL MODELSInformation Retrieval in Practice
All slides ©Addison Wesley, 2008
59
Retrieval Models
Provide a mathematical framework for defining the search process includes explanation of assumptions basis of many ranking algorithms can be implicit
Progress in retrieval models has corresponded with improvements in effectiveness
Theories about relevance
60
Relevance
Complex concept that has been studied for some time Many factors to consider People often disagree when making relevance judgments
Retrieval models make various assumptions about relevance to simplify problem e.g., topical vs. user relevance e.g., binary vs. multi-valued relevance
61
Retrieval Model Overview
Older models Boolean retrieval Vector Space model
Probabilistic Models BM25 Language models
Combining evidence Inference networks Learning to Rank
62
Boolean Retrieval
Two possible outcomes for query processing TRUE and FALSE “exact-match” retrieval simplest form of ranking
Query usually specified using Boolean operators AND, OR, NOT proximity operators also used
63
Boolean Retrieval
Advantages Results are predictable, relatively easy to explain Many different features can be incorporated Efficient processing since many documents can be eliminated from search
Disadvantages Effectiveness depends entirely on user Simple queries usually don’t work well Complex queries are difficult
64
Searching by Numbers
Sequence of queries driven by number of retrieved documents e.g. “lincoln” search of news articles president AND lincoln president AND lincoln AND NOT (automobile OR car) president AND lincoln AND biography AND life AND
birthplace AND gettysburg AND NOT (automobile OR car)
president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car)
65
Vector Space Model
Documents and query represented by a vector of term weights Collection represented by a matrix of term weights
66
Vector Space Model
67
Vector Space Model
3-d pictures useful, but can be misleading for high-dimensional space
68
Vector Space Model
Documents ranked by distance between points representing query and documents Similarity measure more common than a distance or dissimilarity measure e.g. Cosine correlation
69
Similarity Calculation
Consider two documents D1, D2 and a query Q D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
70
Term Weights
tf.idf weight Term frequency weight measures importance in
document:
Inverse document frequency measures importance in collection:
Some heuristic modifications
71
Relevance Feedback
Rocchio algorithm Optimal query
Maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents
Modifies query according to
α, β, and γ are parameters Typical values 8, 16, 4
72
Vector Space Model
Advantages Simple computational framework for ranking Any similarity measure or term weighting scheme could be used
Disadvantages Assumption of term independence No predictions about techniques for effective ranking
73
Probability Ranking Principle
Robertson (1977) “If a reference retrieval system’s response to each request is a ranking of
the documents in the collection in order of decreasing probability of relevance to the user who submitted the request,
where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose,
the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”
74
IR as Classification
75
Bayes Classifier
Bayes Decision Rule A document D is relevant if P(R|D) > P(NR|D)
Estimating probabilities use Bayes Rule
classify a document as relevant if
lhs is likelihood ratio
76
Estimating P(D|R)
Assume independence
Binary independence model document represented by a vector of binary features indicating term
occurrence (or non-occurrence) pi is probability that term i occurs (i.e., has value 1) in relevant document, si
is probability of occurrence in non-relevant document
77
Binary Independence Model
78
Binary Independence Model
Scoring function is
Query provides information about relevant documents If we assume pi constant, si approximated by entire collection, get idf-
like weight
79
Contingency Table
Gives scoring function:
80
BM25
Popular and effective ranking algorithm based on binary independence model adds document and query term weights
k1, k2 and K are parameters whose values are set empirically
dl is doc length Typical TREC value for k1 is 1.2, k2 varies from 0 to
1000, b = 0.75
81
BM25 Example
Query with two terms, “president lincoln”, (qf = 1) No relevance information (r and R are zero) N = 500,000 documents “president” occurs in 40,000 documents (n1 = 40, 000) “lincoln” occurs in 300 documents (n2 = 300) “president” occurs 15 times in doc (f1 = 15) “lincoln” occurs 25 times (f2 = 25) document length is 90% of the average length (dl/avdl = .9) k1 = 1.2, b = 0.75, and k2 = 100 K = 1.2 · (0.25 + 0.75 · 0.9) = 1.11
82
BM25 Example
83
BM25 Example
Effect of term frequencies
84
Language Model
Unigram language model probability distribution over the words in a language generation of text consists of pulling words out of a
“bucket” according to the probability distribution and replacing them
N-gram language model some applications use bigram and trigram language
models where probabilities depend on previous words
85
Language Model
A topic in a document or query can be represented as a language model i.e., words that tend to occur often when discussing a
topic will have high probabilities in the corresponding language model
Multinomial distribution over words text is modeled as a finite sequence of words, where
there are t possible words at each point in the sequence commonly used, but not only possibility doesn’t model burstiness
86
LMs for Retrieval
3 possibilities: probability of generating the query text from a document language model probability of generating the document text from a query language model comparing the language models representing the query and document
topics
Models of topical relevance
87
Query-Likelihood Model
Rank documents by the probability that the query could be generated by the document model (i.e. same topic)
Given query, start with P(D|Q) Using Bayes’ Rule Assuming prior is uniform, unigram model
88
Estimating Probabilities
Obvious estimate for unigram probabilities is
Maximum likelihood estimate makes the observed value of fqi;D most likely
If query words are missing from document, score will be zero Missing 1 out of 4 query words same as missing 3 out of 4
89
Smoothing
Document texts are a sample from the language model Missing words should not have zero probability of occurring
Smoothing is a technique for estimating probabilities for missing (or unseen) words lower (or discount) the probability estimates for words that are seen
in the document text assign that “left-over” probability to the estimates for the words that
are not seen in the text
90
Estimating Probabilities
Estimate for unseen words is αDP(qi|C) P(qi|C) is the probability for query word i in the collection language model
for collection C (background probability) αD is a parameter
Estimate for words that occur is(1 − αD) P(qi|D) + αD P(qi|C)
Different forms of estimation come from different αD
91
Jelinek-Mercer Smoothing
αD is a constant, λ Gives estimate of
Ranking score
Use logs for convenience accuracy problems multiplying small numbers
92
Where is tf.idf Weight?
- proportional to the term frequency, inversely proportional to the collection frequency
93
Dirichlet Smoothing
αD depends on document length
Gives probability estimation of
and document score
94
Query Likelihood Example
For the term “president” fqi,D = 15, cqi = 160,000
For the term “lincoln” fqi,D = 25, cqi = 2,400
number of word occurrences in the document |d| is assumed to be 1,800
number of word occurrences in the collection is 109
500,000 documents times an average of 2,000 words
μ = 2,000
95
Query Likelihood Example
• Negative number because summing logs of small numbers
96
Query Likelihood Example
97
Relevance Models
Relevance model – language model representing information need query and relevant documents are samples from this model
P(D|R) - probability of generating the text in a document given a relevance model document likelihood model less effective than query likelihood due to difficulties comparing across
documents of different lengths
98
Pseudo-Relevance Feedback
Estimate relevance model from query and top-ranked documents Rank documents by similarity of document model to relevance model Kullback-Leibler divergence (KL-divergence) is a well-known measure of
the difference between two probability distributions
99
KL-Divergence
Given the true probability distribution P and another distribution Q that is an approximation to P,
Use negative KL-divergence for ranking, and assume relevance model R is the true distribution (not symmetric),
100
KL-Divergence
Given a simple maximum likelihood estimate for P(w|R), based on the frequency in the query text, ranking score is
rank-equivalent to query likelihood score
Query likelihood model is a special case of retrieval based on relevance model
101
Estimating the Relevance Model
Probability of pulling a word w out of the “bucket” representing the relevance model depends on the n query words we have just pulled out
By definition
102
Estimating the Relevance Model
Joint probability is
Assume
Gives
103
Estimating the Relevance Model
P(D) usually assumed to be uniform P(w, q1 . . . qn) is simply a weighted average of the language model
probabilities for w in a set of documents, where the weights are the query likelihood scores for those documents
Formal model for pseudo-relevance feedback query expansion technique
104
Pseudo-Feedback Algorithm
105
Example from Top 10 Docs
106
Example from Top 50 Docs
107
Combining Evidence
Effective retrieval requires the combination of many pieces of evidence about a document’s potential relevance have focused on simple word-based evidence many other types of evidence structure, PageRank, metadata, even scores from different models
Inference network model is one approach to combining evidence uses Bayesian network formalism
108
Inference Network
109
Inference Network
Document node (D) corresponds to the event that a document is observed
Representation nodes (ri) are document features (evidence) Probabilities associated with those features are based
on language models θ estimated using the parameters μ one language model for each significant document
structure ri nodes can represent proximity features, or other types
of evidence (e.g. date)
110
Inference Network
Query nodes (qi) are used to combine evidence from representation nodes and other query nodes represent the occurrence of more complex evidence and
document features a number of combination operators are available
Information need node (I) is a special query node that combines all of the evidence from the other query nodes network computes P(I|D, μ)
111
Example: AND Combination
a and b are parent nodes for q
112
Example: AND Combination
Combination must consider all possible states of parents
Some combinations can be computed efficiently
113
Inference Network Operators
Backup slides114
115
Galago Query Language
A document is viewed as a sequence of text that may contain arbitrary tags
A single context is generated for each unique tag name An extent is a sequence of text that appears within a single begin/end
tag pair of the same type as the context
116
Galago Query Language
117
Galago Query Language
TexPoint Display
118
Galago Query Language
119
Galago Query Language
120
Galago Query Language
121
Galago Query Language
122
Galago Query Language
123
Galago Query Language
124
Web Search
Most important, but not only, search application Major differences to TREC news
Size of collection Connections between documents Range of document types Importance of spam Volume of queries Range of query types
125
Search Taxonomy
Informational Finding information about some topic which may be on
one or more web pages Topical search
Navigational finding a particular web page that the user has either
seen before or is assumed to exist
Transactional finding a site where a task such as shopping or
downloading music can be performed
126
Web Search
For effective navigational and transactional search, need to combine features that reflect user relevance
Commercial web search engines combine evidence from hundreds of features to generate a ranking score for a web page page content, page metadata, anchor text, links (e.g.,
PageRank), and user behavior (click logs) page metadata – e.g., “age”, how often it is updated,
the URL of the page, the domain name of its site, and the amount of text content
127
Search Engine Optimization
SEO: understanding the relative importance of features used in search and how they can be manipulated to obtain better search rankings for a web page e.g., improve the text used in the title tag, improve the
text in heading tags, make sure that the domain name and URL contain important keywords, and try to improve the anchor text and link structure
Some of these techniques are regarded as not appropriate by search engine companies
128
Web Search
In TREC evaluations, most effective features for navigational search are: text in the title, body, and heading (h1, h2, h3, and h4)
parts of the document, the anchor text of all links pointing to the document, the PageRank number, and the inlink count
Given size of Web, many pages will contain all query terms Ranking algorithm focuses on discriminating between
these pages Word proximity is important
129
Term Proximity
Many models have been developed
• N-grams are commonly used in commercial web search
Dependence model based on inference net has been effective in TREC - e.g.
130
Example Web Query
131
Machine Learning and IR
Considerable interaction between these fields Rocchio algorithm (60s) is a simple learning approach 80s, 90s: learning ranking algorithms based on user
feedback 2000s: text categorization
Limited by amount of training data Web query logs have generated new wave of
research e.g., “Learning to Rank”
132
Generative vs. Discriminative
All of the probabilistic retrieval models presented so far fall into the category of generative models A generative model assumes that documents were
generated from some underlying model (in this case, usually a multinomial distribution) and uses training data to estimate the parameters of the model
probability of belonging to a class (i.e. the relevant documents for a query) is then estimated using Bayes’ Rule and the document model
133
Generative vs. Discriminative
A discriminative model estimates the probability of belonging to a class directly from the observed features of the document based on the training data
Generative models perform well with low numbers of training examples
Discriminative models usually have the advantage given enough training data Can also easily incorporate many features
134
Discriminative Models for IR
Discriminative models can be trained using explicit relevance judgments or click data in query logs Click data is much cheaper, more noisy e.g. Ranking Support Vector Machine (SVM) takes as
input partial rank information for queries partial information about which documents should be ranked
higher than others
135
Ranking SVM
Training data is
r is partial rank information if document dashould be ranked higher than db, then (da, db) ∈ ri
partial rank information comes from relevance judgments (allows multiple levels of relevance) or click data e.g., d1, d2 and d3 are the documents in the first, second and
third rank of the search output, only d3 clicked on → (d3, d1) and (d3, d2) will be in desired ranking for this query
136
Ranking SVM
Learning a linear ranking function where w is a weight vector that is adjusted by learning da is the vector representation of the features of document non-linear functions also possible
Weights represent importance of features learned using training data e.g.,
137
Ranking SVM
Learn w that satisfies as many of the following conditions as possible:
Can be formulated as an optimization problem
138
Ranking SVM
ξ, known as a slack variable, allows for misclassification of difficult or noisy training examples, and C is a parameter that is used to prevent overfitting
139
Ranking SVM
Software available to do optimization Each pair of documents in our training data can be
represented by the vector:
Score for this pair is:
SVM classifier will find a w that makes the smallest score as large as possible make the differences in scores as large as possible for
the pairs of documents that are hardest to rank
140
Topic Models
Improved representations of documents can also be viewed as improved smoothing techniques improve estimates for words that are related to the
topic(s) of the document instead of just using background probabilities
Approaches Latent Semantic Indexing (LSI) Probabilistic Latent Semantic Indexing (pLSI) Latent Dirichlet Allocation (LDA)
141
LDA
Model document as being generated from a mixture of topics
142
LDA
Gives language model probabilities
Used to smooth the document representation by mixing them with the query likelihood probability as follows:
143
LDA
If the LDA probabilities are used directly as the document representation, the effectiveness will be significantly reduced because the features are too smoothed e.g., in typical TREC experiment, only 400 topics used
for the entire collection generating LDA topics is expensive
When used for smoothing, effectiveness is improved
144
LDA Example
Top words from 4 LDA topics from TREC news
145
Summary
Best retrieval model depends on application and data available
Evaluation corpus (or test collection), training data, and user data are all critical resources
Open source search engines can be used to find effective ranking algorithms Galago query language makes this particularly easy
Language resources (e.g., thesaurus) can make a big difference