1
Multi-armed Bandit Problems with Dependent Arms
Sandeep Pandey ([email protected])
Deepayan Chakrabarti ([email protected])
Deepak Agarwal ([email protected])
2
Background: Bandits
Bandit “arms”
μ1 μ2 μ3(unknown reward
probabilities)
Pull arms sequentially so as to maximize the total expected reward
• Show ads on a webpage to maximize clicks
• Product recommendation to maximize sales
3
Dependent Arms
Reward probabilities μi are generally assumed to be independent of each other
What if they are dependent? E.g., ads on similar topics, using similar
text/phrases, should have similar rewards
“Skiing, snowboarding”
“Skiing, snowshoes”
“Get Vonage!”
μ1=0.3 μ2=0.28 μ3=10-6
“Snowshoe rental”
μ2=0.31
4
Dependent Arms
Reward probabilities μi are generally assumed to be independent of each other
What if they are dependent? E.g., ads on similar topics, using similar
text/phrases, should have similar rewards A click on one ad other “similar” ads may
generate clicks as well Can we increase total reward using this
dependency?
5
μi ~ f(π[i])
Cluster Model of Dependence
Arm 1
Arm 4
Arm 3
Arm 2
Cluster 1 Cluster 2
Successes si ~ Bin(ni, μi)
# pulls of arm i
Some distribution
(known)
Cluster-specific parameter (unknown)
6
Cluster Model of Dependence
Arm 1
Arm 4
Arm 3
Arm 2
μi ~ f(π1) μi ~ f(π2)
Total reward:
Discounted: ∑ αt.E[R(t)], α = discounting factor
Undiscounted: ∑ E[R(t)]
t=0
∞
t=0
T
7
Discounted Reward
x1 x2
x”1 x”2
x’1 x’2
Pull Arm 1
x3 x4
x”3 x”4
x’3 x’4
Pull Arm 3
Arm 2
Arm 4
MDP for cluster 1
MDP for cluster 2
The optimal policy can be computed using per-cluster MDPs only.
Optimal Policy:
• Compute an (“index”, arm) pair for each cluster
• Pick the cluster with the largest index, and pull the corresponding arm
8
Discounted Reward
x1 x2
x”1 x”2
x’1 x’2
Pull Arm 1
x3 x4
x”3 x”4
x’3 x’4
Pull Arm 3
Arm 2
Arm 4
MDP for cluster 1
MDP for cluster 2
The optimal policy can be computed using per-cluster MDPs only.
Optimal Policy:
• Compute an (“index”, arm) pair for each cluster
• Pick the cluster with the largest index, and pull the corresponding arm
• Reduces the problem to smaller state spaces
• Reduces to Gittins’ Theorem [1979] for independent bandits
• Approximation bounds on the index for k-step lookahead
9
Cluster Model of Dependence
Arm 1
Arm 4
Arm 3
Arm 2
μi ~ f(π1) μi ~ f(π2)
Total reward:
Discounted: ∑ αt.E[R(t)], α = discounting factor
Undiscounted: ∑ E[R(t)]
t=0
∞
t=0
T
10
Undiscounted Reward
Arm 1
Arm 4
Arm 3
Arm 2
All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm”
“Cluster arm” 1
“Cluster arm” 2
11
Undiscounted Reward
Arm 1
Arm 4
Arm 3
Arm 2
“Cluster arm” 1
“Cluster arm” 2
Two-Level Policy
In each iteration:
Pick “cluster arm” using a traditional bandit policy
Pick an arm within that cluster using a traditional bandit policy
Each “cluster arm” must have some estimated
reward probability
12
Issues
What is the reward probability of a “cluster arm”?
How do cluster characteristics affect performance?
13
Reward probability of a “cluster arm” What is the reward probability r of a “cluster
arm”? MEAN: r = ∑si / ∑ni,
i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster
Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm”
14
Reward probability drift causes problems
Drift Non-optimal clusters might temporarily look better
optimal arm is explored only O(log T) times
Arm 1
Arm 4
Arm 3
Arm 2
Cluster 1 Cluster 2
Best (optimal) arm, with reward
probability μopt
(opt cluster)
15
Reward probability of a “cluster arm” What is the reward probability r of a “cluster
arm”? MEAN: r = ∑si / ∑ni
MAX: r = max( E[μi] )
PMAX: r = E[ max(μi) ]
Both MAX and PMAX aim to estimate μmax and thus reduce drift
for all arms i in cluster
16
Reward probability of a “cluster arm”
MEAN: r = ∑si / ∑ni
MAX: r = max( E[μi] )
PMAX: r = E[ max(μi) ]
Both MAX and PMAX aim to estimate μmax and thus reduce drift
Bias in estimation
of μmax
Variance of estimator
High
Unbiased
Low
High
17
Comparison of schemes
10 clusters, 11.3 arms/cluster MAX performs best
18
Issues
What is the reward probability of a “cluster arm”?
How do cluster characteristics affect performance?
19
Effects of cluster characteristics We analytically study the effects of cluster
characteristics on the “crossover-time” Crossover-time: Time when the expected reward
probability of the optimal cluster becomes highest among all “cluster arms”
20
Effects of cluster characteristics Crossover-time Tc for MEAN depends on:
Cluster separation Δ = μopt – μmax outside opt cluster
Δ increases Tc decreases
Cluster size Aopt
Aopt increases Tc increases
Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases
21
Experiments (effect of separation)
Δ increases Tc decreases higher reward
22
Experiments (effect of size)
Aopt increases Tc increases lower reward
23
Experiments (effect of cohesiveness)
Cohesiveness increases Tc decreases higher reward
24
Related Work
Typical multi-armed bandit problems Do not consider dependencies Very few arms
Bandits with side information Cannot handle dependencies among arms
Active learning Emphasis on #examples required to achieve a
given prediction accuracy
25
Conclusions
We analyze bandits where dependencies are encapsulated within clusters
Discounted Reward the optimal policy is an index scheme on the clusters
Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on
performance, for MEAN
26
Discounted Reward
x1 x2
x3 x4 x”1 x”2
x’1 x’2
x3 x4
x3 x4
Pull Arm 1
success
failure
Change of belief for both arms 1
and 2Estimated
reward probabilities
Pull
Arm 2
Pul
l A
rm 3
Pull
Arm 4
• Create a belief-state MDP
• Each state contains the estimated reward probabilities for all arms
• Solve for optimal
1 2 3 4
27
Background: Bandits
Bandit “arms”
p1 p2 p3(unknown payoff
probabilities)
Regret = optimal payoff – actual payoff
28
Reward probability of a “cluster arm” What is the reward probability of a “cluster
arm”? Eventually, every “cluster arm” must converge to
the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems
29
Experiments
Simulation based on one week’s worth of data from a large-scale ad-matching application
10 clusters, with 11.3 arms/cluster on average
30
Comparison of schemes
10 clusters, 11.3 arms/cluster Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best
31
Reward probability drift causes problems
Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster
Arm 1
Arm 4
Arm 3
Arm 2
Cluster 1 Cluster 2
Best (optimal) arm, with reward
probability μopt
(opt cluster)
Top Related