Bandit algorithms

Post on 11-Apr-2017

152 views 0 download

Transcript of Bandit algorithms

Multi Armed Bandit Algorithms

By,Shrinivas Vasala

2

Overview

- K Slot Machine- Multi Armed Bandit Problems- A/B Testing- MAB Algorithms- Summary

3

K Slot Machines

- Choose a machine and receive a reward- T turns (chances)- What will be your goal ?

- Maximize the cumulative rewards- How you choose the machines (arms) ?

4

Multi Armed Bandit Problem (MAB)

- Goal : Two Fold- Try different arms (Exploration)- Play the seemingly most rewarding arm (Exploitation)

- Explore – Exploit Trade Off- Multi Armed Bandit Algorithms

- Reward distribution ( Unknown)- Mean Reward : <µ1, . . . , µK>- Standard Deviation Reward: <σ1, . . . , σk>

- Regret :- Maximize Cumulative Rewards = Minimize Regret

(Minimize)

5

A/B Testing

- Advertisement selection for a request from a pool of advertisements- Rewards : CTR/AR or CPM

- Recommendation of news articles to users - Product pricing and promotional offers- MAB is used to measure the performance of A/B

Testing experiments

6

MAB Algorithms

- Epsilon-greedy- Softmax- Pursuit- Upper Confidence Bound (UCB1)- UCB1-Tuned

Epsilon-greedy Algorithm- Choose epsilon ( Ɛ) : exploration factor- Play the best arm with probability (1 – Ɛ): Exploitation - Play the random arm with probability Ɛ: Exploration

Note : - Typical value of Ɛ = 0.10 (10%)

8

Softmax Algorithm

9

Pursuit Algorithm

ExplorationExploitation

10

Upper Confidence Bound 1 (UCB1)

- At each iteration, choose the arm corresponding to maximum above score.

Exploitation Exploration

11

UCB1- Tuned

Exploitation Exploration

Variance of the reward

12

Advanced Bandits

- Adversarial Bandits- Contextual Bandits- Infinite Armed Bandits- Thomson Sampling Bandits

13

Summary- Each algorithm has an upper bound on regret

- It’s a function of average rewards distribution- Each algorithm has a tuning parameter- Parameter tuning is a function of reward function - Choose right MAB algorithm based on

simulations/historical data

- All these algorithms have life time auto learning mechanism

14

Thank You