Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online:...

38
Batch Learning from Bandit Feedback CS7792 Counterfactual Machine Learning – Fall 2018 Thorsten Joachims Departments of Computer Science and Information Science Cornell University β€’ A. Swaminathan, T. Joachims, Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization, JMLR Special Issue in Memory of Alexey Chervonenkis, 16(1):1731-1755, 2015. β€’ T. Joachims, A. Swaminathan, M. de Rijke. Deep Learning with Logged Bandit Feedback. In ICLR, 2018.

Transcript of Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online:...

Page 1: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Batch Learning from Bandit Feedback

CS7792 Counterfactual Machine Learning – Fall 2018

Thorsten Joachims

Departments of Computer Science and Information ScienceCornell University

β€’ A. Swaminathan, T. Joachims, Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization, JMLR Special Issue in Memory of Alexey Chervonenkis, 16(1):1731-1755, 2015.

β€’ T. Joachims, A. Swaminathan, M. de Rijke. Deep Learning with Logged Bandit Feedback. In ICLR, 2018.

Page 2: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Interactive Systemsβ€’ Examples

– Ad Placement– Search engines– Entertainment media– E-commerce– Smart homes

β€’ Log Files– Measure and optimize

performance– Gathering and maintenance

of knowledge– Personalization

Page 3: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Batch Learning from Bandit Feedback

β€’ Data𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1,𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛,𝑝𝑝𝑛𝑛

β€œBandit” Feedbackβ€’ Properties

– Contexts π‘₯π‘₯𝑖𝑖 drawn i.i.d. from unknown 𝑃𝑃(𝑋𝑋)– Actions 𝑦𝑦𝑖𝑖 selected by existing system πœ‹πœ‹0: 𝑋𝑋 β†’ π‘Œπ‘Œβ€“ Loss 𝛿𝛿𝑖𝑖 drawn i.i.d. from unknown 𝑃𝑃 𝛿𝛿𝑖𝑖 π‘₯π‘₯𝑖𝑖 ,𝑦𝑦𝑖𝑖

β€’ Goal of Learning– Find new system πœ‹πœ‹ that selects 𝑦𝑦 with better 𝛿𝛿

contextπœ‹πœ‹0 action reward

[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]

propensity

Page 4: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Learning Settings

Full-Information (Labeled) Feedback

Partial-Information(e.g. Bandit) Feedback

Online Learning β€’ Perceptronβ€’ Winnowβ€’ Etc.

β€’ EXP3β€’ UCB1β€’ Etc.

Batch Learning β€’ SVMβ€’ Random Forestsβ€’ Etc.

?

Page 5: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Comparison with Supervised LearningBatch Learning from Bandit Feedback

Conventional Supervised Learning

Train example π‘₯π‘₯,𝑦𝑦, 𝛿𝛿 π‘₯π‘₯,π‘¦π‘¦βˆ—

Context π‘₯π‘₯ drawn i.i.d. from unknown 𝑃𝑃(𝑋𝑋)

drawn i.i.d. from unknown 𝑃𝑃(𝑋𝑋)

Action 𝑦𝑦 selected by existing system β„Ž0: 𝑋𝑋 β†’ π‘Œπ‘Œ

N/A

Feedback 𝛿𝛿 Observe 𝛿𝛿 π‘₯π‘₯,𝑦𝑦 only for 𝑦𝑦 chosen by β„Ž0

Assume known loss function βˆ†(𝑦𝑦,π‘¦π‘¦βˆ—) know feedback 𝛿𝛿 π‘₯π‘₯,𝑦𝑦 for every possible y

Page 6: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Outline of Lectureβ€’ Batch Learning from Bandit Feedback (BLBF)

𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛 Find new policy πœ‹πœ‹ that selects 𝑦𝑦 with better 𝛿𝛿

β€’ Learning Principle for BLBF– Hypothesis Space, Risk, Empirical Risk, and Overfitting– Learning Principle: Counterfactual Risk Minimization

β€’ Learning Algorithms for BLBF– POEM: Bandit training of CRF policies for structured

outputs– BanditNet: Bandit training of deep network policies

Page 7: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Hypothesis Space

Definition [Stochastic Hypothesis / Policy]:Given context π‘₯π‘₯, hypothesis/policy πœ‹πœ‹ selects action 𝑦𝑦 with probability πœ‹πœ‹ 𝑦𝑦 π‘₯π‘₯

Note: stochastic prediction rules βŠƒ deterministic prediction rules

πœ‹πœ‹1(π‘Œπ‘Œ|π‘₯π‘₯) πœ‹πœ‹2(π‘Œπ‘Œ|π‘₯π‘₯)

π‘Œπ‘Œ|π‘₯π‘₯

Page 8: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Risk

Definition [Expected Loss (i.e. Risk)]: The expected loss / risk R(πœ‹πœ‹) of policy πœ‹πœ‹ is

R πœ‹πœ‹ = ��𝛿𝛿 π‘₯π‘₯,𝑦𝑦 πœ‹πœ‹ 𝑦𝑦 π‘₯π‘₯ 𝑃𝑃 π‘₯π‘₯ 𝑑𝑑π‘₯π‘₯ 𝑑𝑑𝑦𝑦

πœ‹πœ‹1(π‘Œπ‘Œ|π‘₯π‘₯) πœ‹πœ‹2(π‘Œπ‘Œ|π‘₯π‘₯)

π‘Œπ‘Œ|π‘₯π‘₯

Page 9: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Evaluating Online Metrics Offline

β€’ Online: On-policy A/B Test

β€’ Offline: Off-policy Counterfactual Estimates

Draw 𝑆𝑆1from πœ‹πœ‹1 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹1

Draw 𝑆𝑆2from πœ‹πœ‹2 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹2

Draw 𝑆𝑆3from πœ‹πœ‹3 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹3

Draw 𝑆𝑆4from πœ‹πœ‹4 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹4

Draw 𝑆𝑆5from πœ‹πœ‹5 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹5

Draw 𝑆𝑆6from πœ‹πœ‹6 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹6

Draw 𝑆𝑆 from πœ‹πœ‹0

οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ πœ‹πœ‹6

οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ πœ‹πœ‹12

οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ πœ‹πœ‹18

οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ πœ‹πœ‹24

οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ β„Ž1οΏ½π‘ˆπ‘ˆ πœ‹πœ‹30

Draw 𝑆𝑆7from πœ‹πœ‹7 οΏ½π‘ˆπ‘ˆ πœ‹πœ‹7

Page 10: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Approach 1: Direct Method

β€’ Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛

1. Learn reward predictor �̂�𝛿: π‘₯π‘₯ Γ— 𝑦𝑦 β†’ β„œ

Represent via features Ξ¨ π‘₯π‘₯,𝑦𝑦Learn regression based on Ξ¨ π‘₯π‘₯,𝑦𝑦from 𝑆𝑆 collected under πœ‹πœ‹0

2. Derive policy πœ‹πœ‹(π‘₯π‘₯)πœ‹πœ‹ π‘₯π‘₯ ≝ argmax

𝑦𝑦𝛿𝛿 π‘₯π‘₯,𝑦𝑦

𝛿𝛿(π‘₯π‘₯,𝑦𝑦)

𝛿𝛿 π‘₯π‘₯,𝑦𝑦𝑦

Ξ¨1

Ξ¨2

Page 11: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Approach 2:Off-Policy Risk Evaluation

Given 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛 collected under πœ‹πœ‹0,

Unbiased estimate of risk, if propensity nonzeroeverywhere (where it matters).

�𝑅𝑅 πœ‹πœ‹ =1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

π›Ώπ›Ώπ‘–π‘–πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯π‘–π‘–πœ‹πœ‹0 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖

[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Langford, Li, 2009.]

Propensity 𝑝𝑝𝑖𝑖

πœ‹πœ‹0(π‘Œπ‘Œ|π‘₯π‘₯) πœ‹πœ‹(π‘Œπ‘Œ|π‘₯π‘₯)

Page 12: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Partial Information Empirical Risk Minimization

β€’ Setup– Stochastic logging using πœ‹πœ‹0 with 𝑝𝑝𝑖𝑖 = πœ‹πœ‹0(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

Data S = π‘₯π‘₯1,𝑦𝑦1 , 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛 , 𝛿𝛿𝑛𝑛, 𝑝𝑝𝑛𝑛– Stochastic prediction rules πœ‹πœ‹ ∈ 𝐻𝐻: πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖

β€’ Training

[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]

οΏ½πœ‹πœ‹ ≔ argminπœ‹πœ‹βˆˆπ»π»οΏ½π‘–π‘–

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖

πœ‹πœ‹0(π‘Œπ‘Œ|π‘₯π‘₯) πœ‹πœ‹1(π‘Œπ‘Œ|π‘₯π‘₯) πœ‹πœ‹0(π‘Œπ‘Œ|π‘₯π‘₯) πœ‹πœ‹237(π‘Œπ‘Œ|π‘₯π‘₯)

Page 13: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Generalization Error Bound for BLBFβ€’ Theorem [Generalization Error Bound]

– For any hypothesis space 𝐻𝐻 with capacity 𝐢𝐢, and for all πœ‹πœ‹ ∈ 𝐻𝐻 with probability 1 βˆ’ πœ‚πœ‚

�𝑅𝑅 πœ‹πœ‹ = �𝑀𝑀𝑀𝑀𝑀𝑀𝑛𝑛 πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖𝑝𝑝𝑖𝑖

𝛿𝛿𝑖𝑖

�𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ = �𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖𝑝𝑝𝑖𝑖

𝛿𝛿𝑖𝑖

Bound accounts for the fact that variance of risk estimator can vary greatly between different πœ‹πœ‹ ∈ H

R πœ‹πœ‹ ≀ �𝑅𝑅 πœ‹πœ‹ + 𝑂𝑂 �𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ /𝑛𝑛 + 𝑂𝑂(𝐢𝐢)

[Swaminathan & Joachims, 2015]

Unbiased Estimator

Variance Control

Capacity Control

Page 14: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Counterfactual Risk Minimizationβ€’ Theorem [Generalization Error Bound]

Constructive principle for designing learning algorithms

[Swaminathan & Joachims, 2015]

R πœ‹πœ‹ ≀ �𝑅𝑅 πœ‹πœ‹ + 𝑂𝑂 �𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ /𝑛𝑛 + 𝑂𝑂(𝐢𝐢)

πœ‹πœ‹π‘π‘π‘π‘π‘π‘ = argminπœ‹πœ‹βˆˆπ»π»π‘–π‘–

�𝑅𝑅 πœ‹πœ‹ + πœ†πœ†1 �𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ /𝑛𝑛 + πœ†πœ†2𝐢𝐢(𝐻𝐻𝑖𝑖)

�𝑅𝑅 πœ‹πœ‹ =1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖 �𝑉𝑉𝑀𝑀𝑉𝑉(πœ‹πœ‹) =

1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖

2

βˆ’ �𝑅𝑅 πœ‹πœ‹ 2

Page 15: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Outline of Lectureβ€’ Batch Learning from Bandit Feedback (BLBF)

𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛 Find new policy πœ‹πœ‹ that selects 𝑦𝑦 with better 𝛿𝛿

β€’ Learning Principle for BLBF– Hypothesis Space, Risk, Empirical Risk, and Overfitting– Learning Principle: Counterfactual Risk Minimization

β€’ Learning Algorithms for BLBF– POEM: Bandit training of CRF policies for structured

outputs– BanditNet: Bandit training of deep network policies

Page 16: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

POEM Hypothesis Space

Hypothesis Space: Stochastic policies

πœ‹πœ‹π‘€π‘€ 𝑦𝑦 π‘₯π‘₯ =1

𝑍𝑍(π‘₯π‘₯)exp 𝑀𝑀 β‹… Ξ¦ π‘₯π‘₯, 𝑦𝑦

with– 𝑀𝑀: parameter vector to be learned– Ξ¦ π‘₯π‘₯,𝑦𝑦 : joint feature map between input and output– Z(x): partition function

Note: same form as CRF or Structural SVM

Page 17: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

POEM Learning Methodβ€’ Policy Optimizer for Exponential Models (POEM)

– Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛, 𝑝𝑝𝑛𝑛– Hypothesis space: πœ‹πœ‹π‘€π‘€ 𝑦𝑦 π‘₯π‘₯ = exp 𝑀𝑀 β‹… πœ™πœ™ π‘₯π‘₯,𝑦𝑦 /𝑍𝑍(π‘₯π‘₯)– Training objective: Let 𝑧𝑧𝑖𝑖(𝑀𝑀) = πœ‹πœ‹π‘€π‘€ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖 𝛿𝛿𝑖𝑖/𝑝𝑝𝑖𝑖

[Swaminathan & Joachims, 2015]

Unbiased Risk Estimator

Variance Control

𝑀𝑀 = argminπ‘€π‘€βˆˆβ„œπ‘π‘

1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑧𝑧𝑖𝑖(𝑀𝑀) + πœ†πœ†11𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑧𝑧𝑖𝑖 𝑀𝑀 2 βˆ’1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑧𝑧𝑖𝑖 𝑀𝑀2

+ πœ†πœ†2 𝑀𝑀 2

Capacity Control

Page 18: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

POEM ExperimentMulti-Label Text Classification

β€’ Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1,𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛,𝑝𝑝𝑛𝑛Reuters LYRL RCV1 (top 4 categories)

β€’ Results: POEM with H isomorphic to CRF with one weight vector per label

0.28

0.33

0.38

0.43

0.48

0.53

0.58

1 2 4 8 16 32 64 128

Ham

min

g Lo

ss

|S| = Quantity (in epochs) of Training Interactions from pi0

f0 (log data)CoStACRF(supervised)POEM

[Swaminathan & Joachims, 2015]

𝑦𝑦𝑖𝑖 = {sport, politics}𝑝𝑝𝑖𝑖 = 0.3

πœ‹πœ‹0𝛿𝛿𝑖𝑖 = 1

Learning from Logged Interventions

Every time a system places an ad, presents a search ranking, or makes a recommendation, we can think about this as an intervention for which we can observe the user's response (e.g. click, dwell time, purchase). Such logged intervention data is actually one of the most plentiful types of data available, as it can be recorded from a variety of

π‘₯π‘₯𝑖𝑖

πœ‹πœ‹0 logging policy

Page 19: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Does Variance RegularizationImprove Generalization?

β€’ IPS:

β€’ POEM:

𝑀𝑀 = argminπ‘€π‘€βˆˆβ„œπ‘π‘

�𝑅𝑅 𝑀𝑀 + πœ†πœ†2 𝑀𝑀 2

𝑀𝑀 = argminπ‘€π‘€βˆˆβ„œπ‘π‘

�𝑅𝑅 𝑀𝑀 + πœ†πœ†1 �𝑉𝑉𝑀𝑀𝑉𝑉 𝑀𝑀 /𝑛𝑛 + πœ†πœ†2 𝑀𝑀 2

Hamming Loss Scene Yeast TMC LYRLπœ‹πœ‹0 1.543 5.547 3.445 1.463

IPS 1.519 4.614 3.023 1.118POEM 1.143 4.517 2.522 0.996

# examples 4*1211 4*1500 4*21519 4*23149# features 294 103 30438 47236# labels 6 14 22 4

Page 20: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

POEM Efficient Training Algorithmβ€’ Training Objective:

β€’ Idea: First-order Taylor Majorization– Majorize at current value– Majorize βˆ’ 2 at current value

β€’ Algorithm:– Majorize objective at current 𝑀𝑀𝑑𝑑– Solve majorizing objective via Adagrad to get 𝑀𝑀𝑑𝑑+1

𝑂𝑂𝑃𝑃𝑂𝑂 = minπ‘€π‘€βˆˆβ„œπ‘π‘

1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑧𝑧𝑖𝑖(𝑀𝑀) + πœ†πœ†11𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑧𝑧𝑖𝑖 𝑀𝑀 2 βˆ’1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑧𝑧𝑖𝑖 𝑀𝑀2

𝑂𝑂𝑃𝑃𝑂𝑂 ≀ minπ‘€π‘€βˆˆβ„œπ‘π‘

1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝐴𝐴𝑖𝑖 𝑧𝑧𝑖𝑖 𝑀𝑀 + 𝐡𝐡𝑖𝑖 𝑧𝑧𝑖𝑖 𝑀𝑀 2

[De Leeuw, 1977+] [Groenen et al., 2008] [Swaminathan & Joachims, 2015]

Page 21: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Counterfactual Risk Minimizationβ€’ Theorem [Generalization Error Bound]

Constructive principle for designing learning algorithms

[Swaminathan & Joachims, 2015]

R πœ‹πœ‹ ≀ �𝑅𝑅 πœ‹πœ‹ + 𝑂𝑂 �𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ /𝑛𝑛 + 𝑂𝑂(𝐢𝐢)

πœ‹πœ‹π‘π‘π‘π‘π‘π‘ = argminπœ‹πœ‹βˆˆπ»π»π‘–π‘–

�𝑅𝑅 πœ‹πœ‹ + πœ†πœ†1 �𝑉𝑉𝑀𝑀𝑉𝑉 πœ‹πœ‹ /𝑛𝑛 + πœ†πœ†2𝐢𝐢(𝐻𝐻𝑖𝑖)

�𝑅𝑅 πœ‹πœ‹ =1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖 �𝑉𝑉𝑀𝑀𝑉𝑉(πœ‹πœ‹) =

1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖

2

βˆ’ �𝑅𝑅 πœ‹πœ‹ 2

Page 22: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Propensity Overfitting Problemβ€’ Example

– Instance Space 𝑋𝑋 = 1, … ,π‘˜π‘˜β€“ Label Space π‘Œπ‘Œ = 1, … ,π‘˜π‘˜

– Loss 𝛿𝛿 π‘₯π‘₯,𝑦𝑦 = οΏ½βˆ’2 𝑖𝑖𝑖𝑖 𝑦𝑦 == π‘₯π‘₯βˆ’1 π‘œπ‘œπ‘œπ‘œβ„Žπ‘€π‘€π‘‰π‘‰π‘€π‘€π‘–π‘–π‘œπ‘œπ‘€π‘€

– Training data: uniform x,y sample (𝑝𝑝𝑖𝑖 = 1π‘˜π‘˜

)– Hypothesis space: all deterministic functions

πœ‹πœ‹π‘œπ‘œπ‘π‘π‘‘π‘‘ π‘₯π‘₯ = π‘₯π‘₯ with risk 𝑅𝑅 πœ‹πœ‹π‘œπ‘œπ‘π‘π‘‘π‘‘ = βˆ’2

𝑅𝑅 οΏ½πœ‹πœ‹ = minπœ‹πœ‹βˆˆπ»π»

1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖 =

1𝑛𝑛�𝑖𝑖

𝑛𝑛1

1/π‘˜π‘˜π›Ώπ›Ώπ‘–π‘– ≀ βˆ’π‘˜π‘˜

Problem 1: Unbounded risk estimate!

SS

SS

SS

S

𝑋𝑋

π‘Œπ‘Œ

Page 23: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Propensity Overfitting Problemβ€’ Example

– Instance Space 𝑋𝑋 = 1, … ,π‘˜π‘˜β€“ Label Space π‘Œπ‘Œ = 1, … ,π‘˜π‘˜

– Loss 𝛿𝛿 π‘₯π‘₯,𝑦𝑦 = οΏ½βˆ’2 𝑖𝑖𝑖𝑖 𝑦𝑦 == π‘₯π‘₯βˆ’1 π‘œπ‘œπ‘œπ‘œβ„Žπ‘€π‘€π‘‰π‘‰π‘€π‘€π‘–π‘–π‘œπ‘œπ‘€π‘€

– Training data: uniform x,y sample (𝑝𝑝𝑖𝑖 = 1π‘˜π‘˜

)– Hypothesis space: all deterministic functions

πœ‹πœ‹π‘œπ‘œπ‘π‘π‘‘π‘‘ π‘₯π‘₯ = π‘₯π‘₯ with risk 𝑅𝑅 πœ‹πœ‹π‘œπ‘œπ‘π‘π‘‘π‘‘ = βˆ’2

𝑅𝑅 οΏ½πœ‹πœ‹ = minπœ‹πœ‹βˆˆπ»π»

1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖 =

1𝑛𝑛�𝑖𝑖

𝑛𝑛0

1/π‘˜π‘˜π›Ώπ›Ώπ‘–π‘– = 0

Problem 2: Lack of equivariance!

SS

SS

SS

S

𝑋𝑋

π‘Œπ‘Œ

01

0

Page 24: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Control Variateβ€’ Idea: Inform estimate when expectation of correlated

random variable is known.– Estimator:

– Correlated RV with known expectation:

�̂�𝑆 πœ‹πœ‹ =1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖

𝐸𝐸 �̂�𝑆 πœ‹πœ‹ =1𝑛𝑛�

𝑖𝑖

𝑛𝑛

οΏ½πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)πœ‹πœ‹0 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

πœ‹πœ‹0 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)𝑃𝑃 π‘₯π‘₯ 𝑑𝑑𝑦𝑦𝑖𝑖𝑑𝑑π‘₯π‘₯𝑖𝑖 = 1

Alternative Risk Estimator: Self-normalized estimator

�𝑅𝑅𝑆𝑆𝑆𝑆 πœ‹πœ‹ =�𝑅𝑅 πœ‹πœ‹οΏ½Μ‚οΏ½π‘† πœ‹πœ‹

�𝑅𝑅 πœ‹πœ‹ =1𝑛𝑛�𝑖𝑖

π‘›π‘›πœ‹πœ‹ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖

[Hesterberg, 1995] [Swaminathan & Joachims, 2015]

Page 25: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

SNIPS Learning Objectiveβ€’ Method:

– Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛, 𝑝𝑝𝑛𝑛– Hypothesis space: πœ‹πœ‹π‘€π‘€ 𝑦𝑦 π‘₯π‘₯ = exp 𝑀𝑀 β‹… πœ™πœ™ π‘₯π‘₯,𝑦𝑦 /𝑍𝑍(π‘₯π‘₯)– Training objective:

[Swaminathan & Joachims, 2015]

Self-Normalized Risk Estimator

Variance Control

𝑀𝑀 = argminπ‘€π‘€βˆˆβ„œπ‘π‘

�𝑅𝑅𝑆𝑆𝑆𝑆𝐼𝐼𝐼𝐼𝑆𝑆(𝑀𝑀) + πœ†πœ†1 �𝑉𝑉𝑀𝑀𝑉𝑉 �𝑅𝑅𝑆𝑆𝑆𝑆𝐼𝐼𝐼𝐼𝑆𝑆(𝑀𝑀) + πœ†πœ†2 𝑀𝑀 2

Capacity Control

Page 26: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

How well does NormPOEM generalize?

Hamming Loss

Scene Yeast TMC LYRL

πœ‹πœ‹0 1.511 5.577 3.442 1.459

POEM (IPS) 1.200 4.520 2.152 0.914POEM (SNIPS) 1.045 3.876 2.072 0.799

# examples 4*1211 4*1500 4*21519 4*23149# features 294 103 30438 47236# labels 6 14 22 4

Page 27: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Outline of Lectureβ€’ Batch Learning from Bandit Feedback (BLBF)

𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛 Find new policy πœ‹πœ‹ that selects 𝑦𝑦 with better 𝛿𝛿

β€’ Learning Principle for BLBF– Hypothesis Space, Risk, Empirical Risk, and Overfitting– Learning Principle: Counterfactual Risk Minimization

β€’ Learning Algorithms for BLBF– POEM: Bandit training of CRF policies for structured

outputs– BanditNet: Bandit training of deep network policies

Page 28: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

BanditNet: Hypothesis Space

Hypothesis Space: Stochastic policies

πœ‹πœ‹π‘€π‘€ 𝑦𝑦 π‘₯π‘₯ =1

𝑍𝑍(π‘₯π‘₯)exp π·π·π‘€π‘€π‘€π‘€π‘π‘π·π·π‘€π‘€π‘œπ‘œ(π‘₯π‘₯,𝑦𝑦|𝑀𝑀)

with– 𝑀𝑀: parameter tensors to be learned– Z(x): partition function

Note: same form as Deep Net with softmax output

[Joachims et al., 2017]

Page 29: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

BanditNet: Learning Methodβ€’ Method:

– Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛, 𝑝𝑝𝑛𝑛– Hypotheses: πœ‹πœ‹π‘€π‘€ 𝑦𝑦 π‘₯π‘₯ = exp π·π·π‘€π‘€π‘€π‘€π‘π‘π·π·π‘€π‘€π‘œπ‘œ π‘₯π‘₯|𝑀𝑀 /𝑍𝑍(π‘₯π‘₯)– Training objective:

Self-Normalized Risk Estimator

Variance Control

𝑀𝑀 = argminπ‘€π‘€βˆˆβ„œπ‘π‘

�𝑅𝑅𝑆𝑆𝑆𝑆𝐼𝐼𝐼𝐼𝑆𝑆(𝑀𝑀) + πœ†πœ†1 �𝑉𝑉𝑀𝑀𝑉𝑉 �𝑅𝑅𝑆𝑆𝑆𝑆𝐼𝐼𝐼𝐼𝑆𝑆(𝑀𝑀) + πœ†πœ†2 𝑀𝑀 2

Capacity Control

[Joachims et al., 2017]

Page 30: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

BanditNet: Learning Methodβ€’ Method:

– Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛, 𝑝𝑝𝑛𝑛– Representation: Deep Network Policies

πœ‹πœ‹π‘€π‘€ 𝑦𝑦 π‘₯π‘₯ =1

𝑍𝑍(π‘₯π‘₯,𝑀𝑀)exp π·π·π‘€π‘€π‘€π‘€π‘π‘π·π·π‘€π‘€π‘œπ‘œ 𝑦𝑦|π‘₯π‘₯,𝑀𝑀

– SNIPS Training Objective:

𝑀𝑀 = argminπ‘€π‘€βˆˆβ„œπ‘π‘

�𝑅𝑅𝑆𝑆𝑆𝑆𝐼𝐼𝐼𝐼𝑆𝑆(𝑀𝑀) + πœ†πœ† 𝑀𝑀 2

= argminπ‘€π‘€βˆˆβ„œπ‘π‘

1

βˆ‘π‘–π‘–=1𝑛𝑛 πœ‹πœ‹π‘€π‘€ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖𝑝𝑝𝑖𝑖

�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€ 𝑦𝑦𝑖𝑖 π‘₯π‘₯𝑖𝑖

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖 + πœ†πœ† 𝑀𝑀 2

Page 31: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Optimization via SGD

β€’ Problem: SNIPS objective not suitable for SGDβ€’ Step 1: Discretize over values in denominator

β€’ Step 2: View as series of constrained OP

β€’ Step 3: Eliminate constraint via Lagrangian

�𝑀𝑀𝑗𝑗 = argminw

βˆ‘π‘–π‘–=1𝑛𝑛 πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)𝑝𝑝𝑖𝑖

𝛿𝛿𝑖𝑖 subject to 1π‘›π‘›βˆ‘π‘–π‘–=1𝑛𝑛 πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖= 𝑆𝑆𝑗𝑗

�𝑀𝑀 = argmin𝑆𝑆𝑗𝑗

argminw

1𝑆𝑆𝑗𝑗�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖𝛿𝛿𝑖𝑖 subject to

1𝑛𝑛�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖= 𝑆𝑆𝑗𝑗

�𝑀𝑀𝑗𝑗 = argminw

maxπœ†πœ†

�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖(π›Ώπ›Ώπ‘–π‘–βˆ’πœ†πœ†) + πœ†πœ†π‘†π‘†π‘—π‘—

Page 32: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Optimization via SGD

β€’ Step 4: Search grid over πœ†πœ† instead of 𝑆𝑆𝑗𝑗– Hard: Given 𝑆𝑆𝑗𝑗, find πœ†πœ†π‘—π‘—. – Easy: Given πœ†πœ†π‘—π‘—, find 𝑆𝑆𝑗𝑗.

Solve

Compute

�𝑀𝑀𝑗𝑗 = argminw

�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖(π›Ώπ›Ώπ‘–π‘–βˆ’πœ†πœ†π‘—π‘—) + πœ†πœ†π‘—π‘—π‘†π‘†π‘—π‘—

𝑆𝑆𝑗𝑗 =1𝑛𝑛�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖

Page 33: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

BanditNet: Training Algorithm

β€’ Given: – Data: 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛,𝑝𝑝𝑛𝑛– Lagrange Multipliers: πœ†πœ†π‘—π‘— ∈ πœ†πœ†1, … , πœ†πœ†π‘˜π‘˜

β€’ Compute:– For each πœ†πœ†π‘—π‘— solve:

– For each οΏ½w𝑗𝑗 compute:

– Find overall �𝑀𝑀:

�𝑀𝑀𝑗𝑗 = argminw

�𝑖𝑖=1

π‘›π‘›πœ‹πœ‹π‘€π‘€(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)

𝑝𝑝𝑖𝑖(π›Ώπ›Ώπ‘–π‘–βˆ’πœ†πœ†π‘—π‘—)

𝑆𝑆𝑗𝑗 =1𝑛𝑛�𝑖𝑖=1

𝑛𝑛 πœ‹πœ‹οΏ½π‘€π‘€π‘—π‘—(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)𝑝𝑝𝑖𝑖

�𝑀𝑀 = argmin�𝑀𝑀𝑗𝑗,𝑆𝑆𝑗𝑗

1𝑆𝑆𝑗𝑗�𝑖𝑖=1

𝑛𝑛 πœ‹πœ‹οΏ½π‘€π‘€π‘—π‘—(𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖)𝑝𝑝𝑖𝑖

𝛿𝛿𝑖𝑖

Page 34: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Object Recognition: Data and Setup

β€’ Data: CIFAR-10 (fully labeled) π‘†π‘†βˆ— = π‘₯π‘₯1,𝑦𝑦1βˆ— , … , π‘₯π‘₯𝑐𝑐,π‘¦π‘¦π‘π‘βˆ—

β€’ Bandit feedback generation:– Draw image π‘₯π‘₯𝑖𝑖– Use logging policy πœ‹πœ‹0 π‘Œπ‘Œ|π‘₯π‘₯𝑖𝑖 to predict 𝑦𝑦𝑖𝑖

β€’ Record propensity πœ‹πœ‹0 π‘Œπ‘Œ = 𝑦𝑦𝑖𝑖|π‘₯π‘₯𝑖𝑖– Observe loss 𝛿𝛿𝑖𝑖 = 𝑦𝑦𝑖𝑖 β‰  π‘¦π‘¦π‘–π‘–βˆ—

𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1, 𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛,𝑝𝑝𝑛𝑛

β€’ Network architecture: ResNet20 [He et al., 2016]

[Beygelzimer & Langford, 2009] [Joachims et al., 2017]

𝑦𝑦𝑖𝑖 = dog𝑝𝑝𝑖𝑖 = 0.3

πœ‹πœ‹0

𝛿𝛿𝑖𝑖 = 1

Page 35: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Bandit Feedback vs. Test Error

Logging Policy πœ‹πœ‹0: 49% error rateBandit-ResNet with naΓ―ve IPS: >49% error rate

[Joachims et al., 2017]

Page 36: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Lagrange Multiplier vs. Test Error

Large basin of optimality far away from naΓ―ve IPS.

Page 37: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Analysis of SNIPS Estimate

Control variate responds to the Lagrange multiplier monotonically.SNIPS training error resembles test error.

Page 38: Batch Learning from Bandit Feedback - Cornell …...Evaluating Online Metrics Offline β€’ Online: On-policy A/B Test β€’ Offline: Off-policy Counterfactual Estimates Draw 𝑆𝑆

Conclusions and Futureβ€’ Batch Learning from Bandit Feedback

– Feedback for only presented action 𝑆𝑆 = π‘₯π‘₯1,𝑦𝑦1, 𝛿𝛿1,𝑝𝑝1 , … , π‘₯π‘₯𝑛𝑛,𝑦𝑦𝑛𝑛, 𝛿𝛿𝑛𝑛, 𝑝𝑝𝑛𝑛

– Goal: Find new system πœ‹πœ‹ that selects 𝑦𝑦 with better 𝛿𝛿– Learning Principle for BLBF: Counterfactual Risk Minimization

β€’ Learning from Logged Interventions: BLBF and Beyond– POEM: [Swaminathan & Joachims, 2015c]– NormPOEM: [Swaminathan & Joachims, 2015c]– BanditNet: [Joachims et al., 2018]– SVM PropRank [Joachims et al., 2017a] – DeepPropDCG: [Agarwal et al., 2018]– Unbiased Matrix Factorization: [Schnabel et al. 2016]

β€’ Future Research– Other learning algorithms? Other partial-information settings?– How to handle new bias-variance trade-off in risk estimators?– Applications

β€’ Software, Papers, SIGIR Tutorial, Data: www.joachims.org