Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research...

98
Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st , 2008 Yisong Yue Cornell University Joint work with: Thorsten Joachims (advisor), Filip Radlinski, Thomas Finley, Robert Kleinberg, Josef Broder

Transcript of Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research...

Page 1: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structured Prediction and Active Learning for Information Retrieval

Presented at Microsoft Research AsiaAugust 21st, 2008

Yisong YueCornell University

Joint work with:Thorsten Joachims (advisor), Filip Radlinski,

Thomas Finley, Robert Kleinberg, Josef Broder

Page 2: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Outline

• Structured Prediction– Complex Retrieval Goals– Structural SVMs (Supervised Learning)

• Active Learning– Learning From Real Users– Multi-armed Bandit Problems

Page 3: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Supervised Learning

• Find function from input space X to output space Y

such that the prediction error is low.

Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters…

x

y1

GATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCAGATACAACCTATCCCCGTATATATATTCTATGGGTATAGTATTAAATCACATTTA

x

y-1x

y7.3

Page 4: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Examples of Complex Output Spaces

• Natural Language Parsing– Given a sequence of words x, predict the parse tree y.– Dependencies from structural constraints, since y has to be a

tree.

The dog chased the catx

S

VPNP

Det NV

NP

Det N

y

Page 5: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

• Part-of-Speech Tagging– Given a sequence of words x, predict sequence of tags y.

– Dependencies from tag-tag transitions in Markov model.

Similarly for other sequence labeling problems, e.g., RNA

Intron/Exon Tagging.

The rain wet the catx

Det NVDet Ny

Examples of Complex Output Spaces

Page 6: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Examples of Complex Output Spaces

• Multi-class Labeling

• Sequence Alignment

• Grammar Trees & POS Tagging

• Markov Random Fields• Clustering• Information Retrieval (Rankings)

– Average Precision & NDCG

– Listwise Approaches

– Diversity

– More Complex Goals

Page 7: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Information Retrieval

• Input: x (feature representation of a document/query pair)

• Conventional Approach– Real valued retrieval functions f(x) – Sort by f(xi) to obtain ranking

• Training Method– Human-labeled data (documents labeled by relevance)– Learn f(x) using relatively simple criterion– Computationally convenient– Works pretty well (but we can do better)

Page 8: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Conventional SVMs

• Input: x (high dimensional point)

• Target: y (either +1 or -1)

• Prediction: sign(wTx)

• Training:

subject to:

• The sum of slacks upper bounds the accuracy loss

N

ii

w N

Cw

1

2

, 2

1minarg

iiT xwi 1)(y : i

i

i

Page 9: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Pairwise Preferences SVM

ji

jiw N

Cw

,,

2

, 2

1minarg

ji

yyjixwxw

ji

jijijT

iT

, ,0

:, ,1

,

,

Such that:

Large Margin Ordinal Regression [Herbrich et al., 1999]

Can be reduced to time [Joachims, 2005] )log( nnO

Pairs can be reweighted to more closely model IR goals [Cao et al., 2006]

Page 10: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Mean Average Precision

• Consider rank position of each relevance doc– K1, K2, … KR

• Compute Precision@K for each K1, K2, … KR

• Average precision = average of P@K

• Ex: has AvgPrec of

• MAP is Average Precision across multiple queries/rankings

76.05

3

3

2

1

1

3

1

Page 11: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Optimization Challenges

• Rank-based measures are multivariate– Cannot decompose (additively) into document pairs– Need to exploit other structure

• Defined over rankings– Rankings do not vary smoothly– Discontinuous w.r.t model parameters– Need some kind of relaxation/approximation

Page 12: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

[Y & Burges; 2007]

Page 13: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Optimization Approach

• Approximations / Smoothing– Directly define gradient

• LambdaRank [Burges et al., 2006]

– Gaussian smoothing • SoftRank GP [Guiver & Snelson, 2008]

• Upper bound relaxations– Exponential Loss w/ Boosting

• AdaRank [Xu et al., 2007]

– Hinge Loss w/ Structural SVMs • [Chapelle et al., 2007]• SVM-map [Yue et al., 2007]

Page 14: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structured Prediction

• Let x be a structured input (candidate documents)• Let y be a structured output (ranking)

• Use a joint feature map to encode

the compatibility of predicting y for given x.– Captures all the structure of the prediction problem

• Consider linear models: after learning w, we can make predictions via

FR ),( xy

),(maxargˆ xyyy

Tw

Page 15: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Linear Discriminant for Ranking

• Let x = (x1,…xn) denote candidate documents (features)

• Let yjk = {+1, -1} encode pairwise rank orders

• Feature map is linear combination of documents.

• Prediction made by sorting on document scores wTxi

),(maxargˆ xyyy

Tw

relj relk

kjjk xxy: :!

)(),( xy

Page 16: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Linear Discriminant for Ranking

• Using pairwise preferences is common in IR

• So far, just reformulated using structured prediction notation.

• But we won’t decompose into independent pairs– Treat the entire ranking as a structured object– Allows for optimizing average precision

relj relk

kjT

jkT xxwyw

: :!

)(),( xy

Page 17: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM

• Let x denote a structured input (candidate documents)• Let y denote a structured output (ranking)

• Standard objective function:

• Constraints are defined for each incorrect labeling y’ over the set of documents x.

i

iN

Cw 2

2

1

iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy

[Y, Finley, Radlinski, Joachims; SIGIR 2007]

Page 18: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM for MAP

• Minimize

subject to

where ( y jk = {-1, +1} )

and

• Sum of slacks is smooth upper bound on MAP loss.

relj relk

ik

ij

ii xxyjk

: :!

)()()()( )(),( xy

i

iN

Cw 2

2

1

iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy

)'(1)'( yy Avgprec

i

[Y, Finley, Radlinski, Joachims; SIGIR 2007]

Page 19: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Too Many Constraints!

• For Average Precision, the true labeling is a ranking where the relevant documents are all ranked in the front, e.g.,

• An incorrect labeling would be any other ranking, e.g.,

• This ranking has Average Precision of about 0.8 with (y’) ¼ 0.2

• Intractable number of rankings, thus an intractable number of constraints!

Page 20: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM Training

Original SVM Problem• Intractable number of constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

[Tsochantaridis et al., 2005]

Page 21: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM Training

Original SVM Problem• Intractable number of constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

[Tsochantaridis et al., 2005]

Page 22: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM Training

Original SVM Problem• Intractable number of constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

[Tsochantaridis et al., 2005]

Page 23: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM Training

Original SVM Problem• Intractable number of constraints

• Most are dominated by a small set of “important” constraints

Structural SVM Approach• Repeatedly finds the next most

violated constraint…

• …until set of constraints is a good approximation.

[Tsochantaridis et al., 2005]

Page 24: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• A constraint is violated when

• Finding most violated constraint reduces to

• Highly related to inference/prediction:

0)'(),(),'( iTw yxyxy

)'(),'(maxargˆ'

yxyyy

Tw

),(maxarg xyy

Tw

Page 25: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

Observations• MAP is invariant on the order of documents within a relevance

class– Swapping two relevant or non-relevant documents does not change MAP.

• Joint SVM score is optimized by sorting by document score, wTxj

• Reduces to finding an interleaving

between two sorted lists of documents

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

Page 26: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

Page 27: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of

the non-relevant document

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

Page 28: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

Page 29: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document• Never want to swap past previous

non-relevant document

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

Page 30: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• Start with perfect ranking• Consider swapping adjacent

relevant/non-relevant documents• Find the best feasible ranking of the

non-relevant document• Repeat for next non-relevant

document• Never want to swap past previous

non-relevant document• Repeat until all non-relevant

documents have been considered

relj relk

kT

jT

jk xwxwy: :!'

)(')'(maxarg yy

Page 31: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Comparison with other SVM methods

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

TREC 9 Indri TREC 10 Indri TREC 9Submissions

TREC 10Submissions

TREC 9Submissions(without best)

TREC 10Submissions(without best)

Dataset

Mea

n A

vera

ge

Pre

cisi

on

SVM-MAP

SVM-ROC

SVM-ACC

SVM-ACC2

SVM-ACC3

SVM-ACC4

Page 32: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM for MAP

• Treats rankings as structured objects

• Optimizes hinge-loss relaxation of MAP– Provably minimizes the empirical risk

– Performance improvement over conventional SVMs

• Relies on subroutine to find most violated constraint

– Computationally compatible with linear discriminant

)'(),'(maxargˆ'

yxyyy

Tw

Page 33: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Need for Diversity (in IR)

• Ambiguous Queries– Users with different information needs issuing the

same textual query– “Jaguar”– At least one relevant result for each information need

• Learning Queries– User interested in “a specific detail or entire breadth

of knowledge available” • [Swaminathan et al., 2008]

– Results with high information diversity

Page 34: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Result #18

Top of First Page

Bottom of First Page

Results From 11/27/2007

Query: “Jaguar”

Page 35: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Learning to Rank

• Current methods– Real valued retrieval functions f(q,d) – Sort by f(q,di) to obtain ranking

• Benefits:– Know how to perform learning– Can optimize for rank-based performance measures– Outperforms traditional IR models

• Drawbacks:– Cannot account for diversity– During prediction, considers each document independently

Page 36: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Example

•Choose K documents with maximal information coverage.•For K = 3, optimal set is {D1, D2, D10}

Page 37: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Diversity via Set Cover

• Documents cover information– Assume information is partitioned into discrete units.

• Documents overlap in the information covered.

• Selecting K documents with maximal coverage is a set cover problem– NP-complete in general

– Greedy has (1-1/e) approximation [Khuller et al., 1997]

Page 38: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Diversity via Subtopics

• Current datasets use manually determined subtopic labels– E.g., “Use of robots in the world today”

• Nanorobots• Space mission robots• Underwater robots

– Manual partitioning of the total information– Relatively reliable– Use as training data

Page 39: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Weighted Word Coverage

• Use words to represent units of information

• More distinct words = more information

• Weight word importance

• Does not depend on human labeling

• Goal: select K documents which collectively cover as many distinct (weighted) words as possible– Greedy selection yields (1-1/e) bound.– Need to find good weighting function (learning problem).

Page 40: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 41: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Example

D1 D2 D3 Best

Iter 1 12 11 10 D1

Iter 2 -- 2 3 D3

Marginal Benefit

V1 V2 V3 V4 V5

D1 X X X

D2 X X X

D3 X X X X

Word Benefit

V1 1

V2 2

V3 3

V4 4

V5 5

Document Word Counts

Page 42: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Related Work Comparison

• Essential Pages [Swaminathan et al., 2008]

– Uses fixed function of word benefit– Depends on word frequency in candidate set

• Our goals– Automatically learn a word benefit function

• Learn to predict set covers • Use training data• Minimize subtopic loss

– No prior ML approach • (to our knowledge)

Page 43: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Linear Discriminant

• x = (x1,x2,…,xn) - candidate documents

• y – subset of x • V(y) – union of words from documents in y.

• Discriminant Function:

• (v,x) – frequency features (e.g., ¸10%, ¸20%, etc).

• Benefit of covering word v is then wT(v,x)

)(

),(),(y

xxyVv

TT vww

),(maxargˆ xyyy

Tw

[Y, Joachims; ICML 2008]

Page 44: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Linear Discriminant

• Does NOT reward redundancy – Benefit of each word only counted once

• Greedy has (1-1/e)-approximation bound

• Linear (joint feature space) – Allows for SVM optimization

)(

),(),(y

xxyVv

TT vww

[Y, Joachims; ICML 2008]

Page 45: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

More Sophisticated Discriminant

• Documents “cover” words to different degrees– A document with 5 copies of “Microsoft” might cover it

better than another document with only 2 copies.

• Use multiple word sets, V1(y), V2(y), … , VL(y)

• Each Vi(y) contains only words satisfying certain importance criteria.

[Y, Joachims; ICML 2008]

Page 46: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

More Sophisticated Discriminant

)(

)( 1

),(

),(

),(1

y

y

x

x

xy

LVv L

Vv

v

v

•Separate i for each importance level i. •Joint feature map is vector composition of all i

),(maxargˆ xyyy

Tw

•Greedy has (1-1/e)-approximation bound. •Still uses linear feature space.

[Y, Joachims; ICML 2008]

Page 47: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Weighted Subtopic Loss

• Example:– x1 covers t1

– x2 covers t1,t2,t3

– x3 covers t1,t3

• Motivation– Higher penalty for not covering popular subtopics– Mitigates effects of label noise in tail subtopics

# Docs Loss

t1 3 1/2

t2 1 1/6

t3 2 1/3

Page 48: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Structural SVM

• Input: x (candidate set of documents)• Target: y (subset of x of size K)

• Same objective function:

• Constraints for each incorrect labeling y’.

• Scoreof best y at least as large as incorrect y’ plus loss

• Finding most violated constraint also set cover problem

i

iN

Cw 2

2

1

iiiTiiTi ww )'(),'(),( :' )()()()( yxyxyyy

Page 49: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

TREC Experiments

• TREC 6-8 Interactive Track Queries• Documents labeled into subtopics.

• 17 queries used, – considered only relevant docs– decouples relevance problem from diversity problem

• 45 docs/query, 20 subtopics/query, 300 words/doc

Page 50: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

TREC Experiments

• 12/4/1 train/valid/test split– Approx 500 documents in training set

• Permuted until all 17 queries were tested once

• Set K=5 (some queries have very few documents)

• SVM-div – uses term frequency thresholds to define importance levels

• SVM-div2 – in addition uses TFIDF thresholds

Page 51: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

TREC Results

Method Loss

Random 0.469

Okapi 0.472

Unweighted Model 0.471

Essential Pages 0.434

SVM-div 0.349

SVM-div2 0.382

Methods W / T / L

SVM-div vs Ess. Pages

14 / 0 / 3 **

SVM-div2 vs Ess. Pages

13 / 0 / 4

SVM-div vs SVM-div2

9 / 6 / 2

Page 52: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Can expect further benefit from having more training data.

Page 53: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

IR as Structured Prediction

• As a general approach:– Structured prediction encapsulates goals of IR– Recent work have demonstrated the benefit of

using structured prediction.

• Future directions:– Apply to more general retrieval paradigms– XML retrieval

Page 54: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

XML Retrieval

• Can retrieve information at different scopes– Individual documents– Smaller components of documents– Larger clusters of documents

• Issues of objective function & diversity still apply– Complex performance measures– Inter-component dependencies

[Image src: http://www.dcs.bbk.ac.uk/~ptw/teaching/html/]

Page 55: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Blog Retrieval

• Special case of XML retrieval

• Query: “High Energy Physics”– Return a blog feed?– Return blog front page?– Return individual blog posts?

– Optimizing for MAP?– Diversity?

Page 56: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Active Learning

• Batch Learning– Learns a model using pre-collected training data– Assumes training data representative of unseen data– Most studied machine learning paradigm

• Very successful in wide range of applications

– Includes most work on structured prediction

• Active Learning:– Can be applied directly to live users

• Representative of real users

– Removes cost of human labeled training data• Time / Money / Reliability

Page 57: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Implicit Feedback

• Users provide feedback while searching– What results they click on– How they reformulate queries– The length of time from issuing query to

clicking on a result– Geographical– User-specific data

• Personal search history• Age / gender / profession / etc.

Page 58: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Presentation Bias in Click Results

[Granka et al., 2004]

Page 59: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Biased Implicit Feedback

• Users biased towards top of rankings

– Passive collection results in very biased training data– No feedback for relevant documents outside top 10– Most prior work focus on passive collection

• Our goals– Use active learning methods to gather unbiased

implicit feedback– Still present good results to users while learning

• Learn “on the fly”

Page 60: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Preferences Between Rankings

• Interleave two rankings into one ranking– Users click more on documents from better ranking.

[Radlinski et al., 2008]

Page 61: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Active Learning Approach

• Leverage ranking preference test– Compare relative quality of competing retrieval functions

– Not biased towards any specific rank position

• Avoid showing users poor results– Quickly determine bad rankings

– Algorithm must learn “online”

• Formulate as a multi-armed bandit problem

Page 62: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Dueling Bandits Problem

• Given bandits (retrieval functions) r1, …, rN

• Each time step compares two bandits

• Comparison is noisy– Some probability of saying worse bandit is better– Each comparison independent

• Choose pair (rt,rt’) to minimize regret:

• (% users who prefer best bandit over chosen ones)

[Broder, Kleinberg, Y; work in progress]

T

tttT rrPrrPR

1

1)'*()*(

Page 63: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Regret Minimization

• If regret is sublinear in T (e.g., log T)– Then average regret (over time) tends to 0

• Want average regret to approach 0 quickly– RT should be as small as possible

T

tttT rrPrrPR

1

1)'*()*(

[Broder, Kleinberg, Y; work in progress]

Page 64: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Results

• Let ε be the differentiability of top two (r*,r**)– P(r* > r**) = ½ + ε

• Known lower bound:

• Interleaved Filter achieves regret

• Information theoretically optimal – (up to constant factors)

TN

RT log

TN

ORT log

[Broder, Kleinberg, Y; work in progress]

Page 65: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Assumptions

• Strong Stochastic Transitivity– For three bandits ri > rj > rk :

• Stochastic Triangle Inequality– For three bandits ri > rj > rk :

• Satisfied by many standard generative models– E.g., Logistic/Bradley-Terry (K=2)

jkijik , max

jkijik K

Page 66: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Interleaved Filter

• Choose candidate bandit at random ►

Page 67: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Interleaved Filter

• Choose candidate bandit at random

• Make noisy comparisons (Bernoulli trial)

against all other bandits in turn– Maintain mean and confidence interval

Page 68: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Interleaved Filter

• Choose candidate bandit at random

• Make noisy comparisons (Bernoulli trial)

against all other bandits in turn…– Maintain mean and confidence interval

• …until another bandit is better – With confidence 1 – δ

Page 69: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Interleaved Filter

• Choose candidate bandit at random

• Make noisy comparisons (Bernoulli trial)

against all other bandits in turn…– Maintain mean and confidence interval

• …until another bandit is better – With confidence 1 – δ

• Repeat process with new candidate– Remove all empirically worse bandits

Page 70: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Interleaved Filter

• Choose candidate bandit at random

• Make noisy comparisons (Bernoulli trial)

against all other bandits in turn…– Maintain mean and confidence interval

• …until another bandit is better – With confidence 1 – δ

• Repeat process with new candidate– Remove all empirically worse bandits

• Continue until 1 candidate left ►

Page 71: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Regret Analysis

• Stops comparing at 1 – δ confidence– Concludes one bandit is better

• Appropriate choice of δ ** leads to

1 - 1/T probability of finding r*

• Regret is 0 whenever we choose r*– Only accumulate regret when finding r*

T

tttT rrPrrPR

1

1)'*()*(

** δ = N-2T-1

Page 72: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Naïve Approach

• In deterministic case, O(N) comparisons to find max

• Extend to noisy case:– Maintain current candidate

– Run comparisons against 1 other bandit until 1 – δ confidence

– Take better bandit as candidate

– Repeat until all bandits considered

• Problem:– If current candidate awful

– Many comparisons to determine which awful bandit is better

– Incur high regret for each comparison

TN

ORT log2

Page 73: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Naïve vs Interleaved Filter

• Naïve performs poorly due to matches

between two awful bandits – Too many comparisons– Accumulates high regret

• Interleaved Filter bounds matches using

bounds on current candidate vs best– Stops when better bandit found– Regret bounded

T

tttT rrPrrPR

1

1)'*()*(

Page 74: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Naïve vs Interleaved Filter

• But Naïve concentrates on 2 bandits at

any point in time

• Interleaved Filter compares 1 bandit vs

rest simultaneously

– Should experience N2 blowup in regret– … or at least N log N

TN

ORT log2

TN

ORT log

Page 75: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Regret Analysis

• Define a round to be all the time steps for

a particular candidate bandit– O(log N) rounds total w.h.p.

• Define a match to be all the comparisons

between two bandits in a round– O(N) matches in each round– At most O(N log N) total matches

• End of each round – Remove empirically inferior bandits– “Constant fraction” of bandits removed

after each round

Page 76: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Regret Analysis

• O(log N) rounds played

• “Constant fraction” of bandits removed

at end of each round– O(N) total matches w.h.p.

• Each match incurs regret

• Expected regret:

TO log1

TN

ORE

TOT

TN

OT

RE

T

T

log

)(1

log1

1

Page 77: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Dueling Bandits Problem

• Uses a natural (and simple) regret formulation

– Captures preference for the best possible retrieval function– Consistent with unbiased ranking preference feedback

• [Radlinski et al., 2008]

– Online/Bandit formulation of finding the max w/ noisy compares

• Interleaved Filter achieves best possible regret bound

- Logarithmic in T - Linear in N

TN

ORT log

T

tttT rrPrrPR

1

1)'*()*(

Page 78: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Related Work

• Other forms of implicit feedback– Preferences between documents within a ranking

• Other active learning techniques– Bandit algorithm for minimizing abandonment

• [Radlinski et al., 2008]

– Active exploration of pairwise document preferences• [Radlinski et al., 2007]

– These approaches cannot generalize across queries

• Most learning approaches use passive collection– Susceptible to presentation bias

Page 79: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Moving Forward

• Limitations– Assumes users preferences are static

• Interleaved filter first explores, then commits

– Assumes a finite set of ranking functions• Should assume a continuous parameter space

• Future directions– Use Interleaved Filter as an optimization engine

• Collect finite sample from continuous parameter space

– Look at completely new problem formulations– Progress towards live user studies

Page 80: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Summary

• Structured prediction for complex retrieval problems– Rank-based performance measures– Diversity– Potentially much more!

• Active learning using unbiased implicit feedback– Learn directly from users (cheaper & more accurate)– Active learning for structured prediction models?

http://www.yisongyue.com

• Thanks to Tie-Yan Liu and Hang Li

Page 81: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Extra Slides

Page 82: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

SVM-map

Page 83: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Experiments

• Used TREC 9 & 10 Web Track corpus.

• Features of document/query pairs computed from outputs of existing retrieval functions.

(Indri Retrieval Functions & TREC Submissions)

• Goal is to learn a recombination of outputs which improves mean average precision.

Page 84: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Comparison with Best Base methods

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

TREC 9 Indri TREC 10 Indri TREC 9Submissions

TREC 10Submissions

TREC 9Submissions(without best)

TREC 10Submissions(without best)

Dataset

Mea

n A

vera

ge

Pre

cisi

on

SVM-MAP

Base 1

Base 2

Base 3

Page 85: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Comparison with other SVM methods

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

TREC 9 Indri TREC 10 Indri TREC 9Submissions

TREC 10Submissions

TREC 9Submissions(without best)

TREC 10Submissions(without best)

Dataset

Mea

n A

vera

ge

Pre

cisi

on

SVM-MAP

SVM-ROC

SVM-ACC

SVM-ACC2

SVM-ACC3

SVM-ACC4

Page 86: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

SVM-div

Page 87: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Essential Pages

),(1log),(),(

),,(),(),,(

2 xxx

xxx

vrvrv

xvTFvxvC ii

[Swaminathan et al., 2008]

Benefit of covering word v with document xi

Importance of covering word v

Intuition: - Frequent words cannot encode information diversity.- Infrequent words do not provide significant information

x = (x1,x2,…,xn) - set of candidate documents for a queryy – a subset of x of size K (our prediction).

v

vC ),,(maxargˆ xyyy

Page 88: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Finding Most Violated Constraint

• Encode each subtopic as an additional “word” to be covered.

• Use greedy prediction to find approximate most violated constraint.

)(

)(

)( 1

),(

),(

),('

1

y

y

y

x

x

xy

T T

Vv L

Vv

Lv

v

),'('maxargˆ'

xyyy

Tw

Page 89: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Approximate Constraint Generation

• Theoretical guarantees no longer hold.– Might not find an epsilon-close approximation to the feasible

region boundary.

• Performs well in practice.

Page 90: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Approximate constraint generation seems to work perform well.

Page 91: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Synthetic Dataset

• Trec dataset very small• Synthetic dataset so we can vary retrieval size K

• 100 queries• 100 docs/query, 25 subtopics/query, 300 words/doc

• 15/10/75 train/valid/test split

Page 92: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Consistently outperforms Essential Pages

Page 93: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Interleaved Filter

Page 94: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Lower Bound Example

• All suboptimal bandits roughly equivalent– ε12 = ε13 = … = ε1N

• comparisons to differentiate best from suboptimal

• Pay Θ(ε) regret for each comparison

• Accumulated regret over all comparisons is at least

Tlog

12

TN

log

Page 95: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Per-Match Regret

• Number of comparisons in match ri vs rj :

– ε1i > εij : round ends before concluding ri > rj

– ε1i < εij : conclude ri > rj before round ends, remove rj

• Pay ε1i + ε1j regret for each comparison

– By triangle inequality ε1i + ε1j ≤ (2K+1)max{ε1i , εij}

– Thus by stochastic transitivity accumulated regret is

TO

iji

log, max

122

1

TOTO

iji

log1

log, max

1

1

Page 96: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Number of Rounds

• Assume all superior bandits have

equal prob of defeating candidate– Worst case scenario under transitivity

• Model this as a random walk– rj transitions to each ri (i < j) with equal probability

– Compute total number of steps before reaching r1 (i.e., r*)

• Can show O(log N) w.h.p. using Chernoff bound

Page 97: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Total Matches Played

• O(N) matches played in each round– Naïve analysis yields O(N log N) total

• However, all empirically worse bandits

are also removed at the end of each round– Will not participate in future rounds

• Assume worst case that inferior bandits have ½ chance of

being empirically worse– Can show w.h.p. that O(N) total matches are played

over O(log N) rounds

Page 98: Structured Prediction and Active Learning for Information Retrieval Presented at Microsoft Research Asia August 21 st, 2008 Yisong Yue Cornell University.

Removing Inferior Bandits

• At conclusion of each round– Remove any empirically worse bandits

• Intuition:– High confidence that winner is better

than incumbent candidate

– Empirically worse bandits cannot be “much better” than

incumbent candidate

– Can show that winner is also better than empirically worse

bandits with high confidence

– Preserves 1-1/T confidence overall that we’ll find the best bandit