Multi-Armed Bandits (MABs) - Purdue University Bandits (MABs) CS57300 - Data Mining Fall 2016 ......
Transcript of Multi-Armed Bandits (MABs) - Purdue University Bandits (MABs) CS57300 - Data Mining Fall 2016 ......
} So far we have seen how to:◦ Test a hypothesis in batches (A/B testing)◦ Test multiple hypotheses (Paul the Octopus-style)◦ Stop a hypothesis test before experiment is over
2
Recap last two classes
© 2016 Bruno Ribeiro
} Select 50% users to see headline A◦ Titanic Sinks
} Select 50% users to see headline B◦ Ship Sinks Killing Thousands
} Do people click more on headline A or B?
} If A much better than B we could do better…
} We refer to decision A or B as choosing an arm
The New York Times Daily Dilemma
3© 2016 Bruno Ribeiro
} Sometimes we don’t only want to quickly find whether hypothesis A is better than hypothesis B
} We really want to use the best-looking hypothesis at any point in time
} Deciding if H0 should be rejected is irrelevant
4
Truth is…
© 2016 Bruno Ribeiro
} Web in perpetual state of feature testing
} Goal: Acquire just enough information about suboptimal arms to ensure they are suboptimal
(arm A) Titanic Sinks
(arm B) Ship Sinks Killing Thousands
5
Real-world Problem
X(i)k =
(1 , if k-th user seeing headline i clicks
0 , otherwise
X(1)k =
(1 , with probability p10 , otherwise
X(2)k =
(1 , with probability p20 , otherwise
reward
© 2016 Bruno Ribeiro
} Play t times
} Each time choose arm i∈{1,2}
} Goal: Maximize total expected reward
7
Multi-armed Bandit Dynamics
A
B
X(1)k =
(1 , with probability p10 , otherwise
X(2)k =
(1 , with probability p20 , otherwise
π is a vector of the arms we playπt is the t-th played arm
© 2016 Bruno Ribeiro
RT =TX
t=1
X(⇡t)n⇡t (t)
,
where ni(t) =tX
h=1
1{⇡h = i}
k-th time arm A is played
} Exploration-exploitation trade-off◦ Play arm with highest (empirical) average reward so far?◦ Play arms just to get a better estimate of expected reward?
} Classical model that dates back multiple decades [Thompson ‘33, Wald ‘47, Arrow et al. ‘49, Robbins ‘50, …, Gittins & Jones ‘72, ... ]
8
Problem Characteristics
© 2016 Bruno Ribeiro
9
Formal Bandit Definition
Markov chain: P [X(i)k |X(i)
k�1, X(i)k�2, . . .] = P [X(i)
k |X(i)k�1]
• K � 2 arms
• Pulling ni times arm i produces rewards X
(i)1 , . . . , X
(i)ni
with (unknown) joint distribution f(x1, . . . , xni |✓i), ✓i 2 ⇥
• At time t � ni(t) we know X
(i)1 , . . . , X
(i)ni(t)
,
where ni(t) is the number of pulls of arm i at time t.
• Many formulations assume X
(i)1 , . . . , X
(i)ni(t)
form a Markov chain
© 2016 Bruno Ribeiro
• Most applications assume i.i.d. draws P [X(i)k |X(i)
k�1, . . .] = P [X(i)k ]
(A1) only one arm is operated each time
(A2) rewards in arms not used remain froze
(A3) arms are independent
(A4) frozen arms contribute no reward
10
Assumptions (can be easily violated in practice)
© 2016 Bruno Ribeiro
11
I.i.d. Stochastic Bandit Definition
© 2016 Bruno Ribeiro
• Assumption: Conditional Independence (P [X
(i)k |X(i)
k�1, . . . , ✓i] = P [X
(i)k |✓i])
• K � 2 arms
• Pulling ni times arm i produces rewards X
(i)1 , . . . , X
(i)ni(t)
i.i.d. with distribution f(x|✓i), ✓i 2 ⇥
• At time t � ni(t) we know X
(i)1 , . . . , X
(i)ni(t)
Relies on a model f(x|✓i), ✓i 2 ⇥.
Depending on model can be more general than i.i.d. and Markov assumption.
12
Goal
© 2016 Bruno Ribeiro
Future actions dependon past actions. Policy takes intoconsideration past values
Regret
Rt = max
j?=1,...,K
tX
h=1
X(j?)nj? (h)
�tX
h=1
X(⇡h)n⇡h
(h),
where ⇡ is the sequence of arm choices (or way to choose arms [policy]),
n⇡h(h) is the number of times we pull arm ⇡h at time h.
We can seek to minimize average regret
min
⇡E[Rt]
or minimized regret with high probability
P [Rt � ✏] �
13
Regret Growth with i.i.d. Rewards
Optimal policy Chosen policy
• Standard deviation of empirical
Pni
k=1 X(i)k grows like
pt
• Thus, at best E[Rt] /pt
• Rather, we minimize over ⇡ w.r.t. best policy (Pseudo-regret)
¯Rt = max
i?=1,...,KE
"nX
h=1
X(i?)ni? (h)
�nX
h=1
X(⇡h)n⇡h
(h)
#
© 2016 Bruno Ribeiro
15
Lower Bound on Expected Pseudo-Regret
*The KL-divergence of two distributions can be thought of as a measure of their statistical distinguishability
(Theorem 2, Lai & Robbins, 1985)Valid for large values of n
© 2016 Bruno Ribeiro
• Recall
¯
Rt = max
i?=1,...,KE
"tX
h=1
X
(i?)ni? (h)
�tX
h=1
X
(⇡h)n⇡h
(h)
#
= t max
i?=1,...,Kµi? � E
"tX
h=1
X
(⇡h)n⇡h
(h)
#
= t max
i?=1,...,K�
KX
k=1
E [nk(t)�k]
• Asymptotically
E[ni(t)] &log t
DKL(f(x|✓i), f(x|✓i?)),
where DKL is the KL divergence metric.
} Algorithm
◦ Let arm i be the arm with the maximum average reward at step t
◦ Play i
} Play-the-winner does not work well◦Worst case: E[Rt]∝ t
17
Play-the-winner
© 2016 Bruno Ribeiro
} Assume rewards in [0,1]} ε-greedy: at time t◦ with probability 1-εt play the best arm so far◦ with probability εt play random arm
} Theoretical guarantee (Auer, Cesa-Bianchi, Fischer 2002)
◦ In practice we use larger values for Δ than the minimum reward gap (too conservative)
18
ε-greedy
• � = mini:�i>0 �i and let ✏t = min
�12�2t , 1
�
• If t � 12�2 , the probability of choosing a suboptimal arm i is bounded by
C�2t for some constant C > 0
• Then we have a logarithmic regret as E[ni(t)] C�2 log t and Rt P
i:�i>0C�i�2 log t
© 2016 Bruno Ribeiro
} For K > 2 arms we play suboptimal arms with same probability
} Very sensitive to high variance rewards
} Real-world performance worst than next algorithm (UCB1)
19
Problems of ε-greedy
© 2016 Bruno Ribeiro
} Using a probabilistic argument we can provide an upper bound of the expected reward for arm i=1,…,K with a given level of confidence
} Strategy: play arm with largest upper bound
} Algorithm known as Upper Confidence Bound 1 (UCB1)
20
Optimism in Face of Uncertainty (UCB)
© 2016 Bruno Ribeiro
} UCB1 algorithm
◦ t total arm plays◦ Let◦ Gives algorithm:� Play arm i with largest
21
Using the Chernoff-HoeffdingTheorem
1-1/t
1
ni(t)
ni(t)X
k=1
Xk +
s2 log t
ni(t)
Let X1, . . . , Xni be i.i.d. rewards from arm i with distribution
bounded in [0, 1], then for any ✏ 2 (0, 1)
P
"niX
k=1
Xk niE[X1]� ✏
# exp
✓�2✏2
ni
◆
✏ =p2ni(t) log t
Why: P2
4 1
ni(t)
ni(t)X
k=1
Xk +
s2 log t
ni(t) E[X1]
3
5 t�4
© 2016 Bruno Ribeiro
} Each sub-optimal arm i is pulled on average at most
times.
} Note that the MAB lower bound is O(log t)
22
UCB 1 Regret Bound
E[ni(t)] 8 log t
�
2i
+
⇡2
3
© 2016 Bruno Ribeiro
} Use Empirical Bernstein’s inequality
} Play arm i at time t if i has the maximim
23
Improving Bound ➡ Smaller Regret
© 2016 Bruno Ribeiro
} Optimal solutions via stochastic dynamic programming ◦ Gittins index
} Suffer from Incomplete Learning (Brezzi and Lai 2000, Kumar and Varaiya 1986)
◦ Playing the wrong arm foreverwith non-zero probability◦ One more reason to be warry of average rewards
} Only applicable to infinite horizon & complex to compute
} Poor performance on real-world problems
Forever Wrong
24
Optimal Solution?
© 2016 Bruno Ribeiro
} Thompson sampling◦ Strategy: select arm i according to posterior probability
◦ Can be used in complex problems (dependent priors, complex actions, dependent rewards)
◦ Great real-world performance
◦ Great regret bounds
25
Bayesian Bandits (continuing from last class)
© 2016 Bruno Ribeiro
26
Bernoulli Bandits
© 2016 Bruno Ribeiro
• Let µi 2 (0, 1)
• Reward of arm i = 1, . . . ,K at step k is X(i)k ⇠ Bernoulli(µi)
• Strategy:
– Uniform prior µi ⇠ U(0, 1)
– Play arm i at time t as to maximize posterior
P [µi = µ?|X(i)1 , . . . , X(i)
ni(t)]
– Typically simpler randomized strategy:
play i with probability P [µi = µ?|X(i)1 , . . . , X(i)
ni(t)]
27
Thompson (1933)
© 2016 Bruno Ribeiro
28
Bernoulli rewards + Beta priors
Beta distribution PDF
wikipedia
© 2016 Bruno Ribeiro
Yt|It ⇠ Bernoulli(µIt)
Prior: Beta distribution
P [µi|↵,�] =µ↵�1i (1� µi)
��1
R 10 p↵�1
(1� p)��1dp
Posterior µi ⇠ Beta(↵+
Ptk=1 Yk1{Ik = i},� +
Ptk=1(1� Yk)1{Ik = i})
29
Thompson Algorithm (for Bernoulli rewards)
Binomial Distribution
Number of Correct Predictions
Dens
ity
0 2 4 6 8 10 12
0.00
0.05
0.10
0.15
0.20
Yt© 2016 Bruno Ribeiro
} Theorem (Agrawal and Goyal, 2012)
30
TS: Bernoulli Reward Regret
Proof idea} Posterior gets concentrated as more samples are obtained
© 2016 Bruno Ribeiro
For all µ1, . . . , µK there is a constant C such that 8✏ > 0,
¯Rt (1 + ✏)X
i:�i>0
�i log t
DKL(µi, µ?)
+
Ck
✏2
} Facebook & Linkedin use two-sample hypothesis tests instead of MAB. Why?
} More generally, which MAB assumptions often does not hold in real-life applications?
31
Some Shortcomings of MAB
© 2016 Bruno Ribeiro
(A2) rewards in arms not used remain froze Recommendation influence users
(A3) arms are independentUsers are influenced by what they see and by each other
32
Assumptions most violated in practice
© 2016 Bruno Ribeiro
} Bandits with side information
} We know reader subscribes to magazine
} Headline A may be more successful in this subpopulation◦ Titanic Sinks
} Headline B better for general population◦ Ship Sinks Killing Thousands
34
Contextual Bandits
© 2016 Bruno Ribeiro
} Consider a hash (random function) with deterministic projection h:{0,1}n → ℝm
} At each play:
some useful assumptions about f
35
Contextual Bandits: Problem Formulation
User features Headline features (context)
Learned from observations(e.g. recursive least squares)
© 2016 Bruno Ribeiro
1. Observe features Xt 2 X
2. Choose arm it 2 {1, . . . ,K}
3. In theory we will get reward Yt = f(h(xt, zit)|✓) + ✏t
37
Thompson Sampling for Linear Contextual Bandits 1/2
© 2016 Bruno Ribeiro
Assume noise is Gaussian
✏t ⇠ N(0,�2)
and that ✓ has prior
✓ ⇠ N(0,2I)
The posterior distribution of ✓ is given by
p(✓|Xt, ~Yt) = N(
ˆ✓t,⌃t)
where
⌃t = �I +XTtXt,
where � =
�2
2 .
38
Thompson Sampling for Linear Contextual Bandits 2/2
} From [Russo, Van Roy 2014] pseudo-regret is
The above draws each context ∝ posterior probability of being optimal
Rt = O(dpt)
© 2016 Bruno Ribeiro
Thompson Sampling heuristic:
˜
✓t ⇠ N(
ˆ
✓t,⌃t)
and obtain best arm
i
?= argmax
i2{1,...,K}h(xt+1, zi)
˜
✓t
} Example:◦ Chapelle et al. (2012)
} The features used: ◦ Ωt+1 set of sparse binary entries◦ Concatenate categorical features of user with features of all
bandits (ad, headline) ◦ Use hash function h() maps from categorical space to lower
dimensional ℝm space
40
Response Prediction for Display Advertising
© 2016 Bruno Ribeiro
41
Algorithm for Display Advertising
© 2016 Bruno Ribeiro
Goal: maximize the number of clicks or conversions
Model: Logistic regression
P [Yt = 1|xt, zIt , ✓] =1
1 + exp(�✓
Th(xt, zIt))
Response prediction based on training
set ⌦
0t = {(xk, zIk , yk)}k=1,...,t
ˆ
✓ = argmin
↵2Rm
�
2
k↵k2 +tX
k=1
log(1 + exp(�yk↵Th(xk, zIk))
42
Display Ads (cont)
© 2016 Bruno Ribeiro
If prior ✓ ⇠ N(0,
1� I), the posterior P [✓|D]
has no closed form expression but we can use the
Laplace approximation of the integral
P [✓|D] = N(
ˆ
✓, diag(qi)�1
)
where
qi =
tX
j=1
w
2j,ipj(1� pj) with pj = (1 + exp(�ˆ
✓
Th(xj , zIj )))
�1
and wj = h(xj , zIj )
43
Using Thompson Sampling Algorithm for Ad Display
© 2016 Bruno Ribeiro
1. A new user arrives at time t+ 1
2. Form the set ⌦t+1 = {(xt, zi) : i 2 {1, . . . ,K}} of context
corresponding to the di↵erent items that can be recommended to user
3. Sample vector from the current (approximate) posterior
˜
✓t ⇠ N(
ˆ
✓, diag(qi)�1
)
4. Choose the context (xt, zi) that maximizes
probability of positive response according to
i = argmax
i=1,...,K
1
1 + exp(�˜
✓
Tth(xt, zi))
5. Recommend item and get response Yt+1