Post on 09-Jul-2020
CS 4803 / 7643: Deep Learning
Zsolt KiraGeorgia Tech
Topics: – (Deep) Reinforcement Learning
Administrativia• Projects!
• Project Check-in due April 11th
– Will be graded pass/fail, if fail then you can address the issues– Counts for 5 points of project score
• Poster due date moved to April 23rd (last day of class)– No presentations
• Final submission due date April 30th
(C) Dhruv Batra & Zsolt Kira 2
Project Check-in/Progress Report
(C) Dhruv Batra & Zsolt Kira 3
Administrativia• Piazza me requests for topics after reinforcement
learning
(C) Dhruv Batra & Zsolt Kira 4
Types of Learning• Supervised learning
– Learning from a “teacher”– Training data includes desired outputs
• Unsupervised learning– Discover structure in data– Training data does not include desired outputs
• Reinforcement learning– Learning to act under evaluative feedback (rewards)
(C) Dhruv Batra & Zsolt Kira 5Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
What is Reinforcement Learning?
Agent-oriented learning—learning by interacting with an environment to achieve a goal ○ more realistic and ambitious than other kinds of machine
learningLearning by trial and error, with only delayed evaluative feedback (reward)○ the kind of machine learning most like natural learning○ learning that can tell for itself when it is right or wrong
Slide Credit: Rich Sutton
David Silver 2015
Example: Hajime Kimura’s RL Robots
Before After
Backward New Robot, Same algorithmSlide Credit: Rich Sutton
● Environment may be unknown, nonlinear, stochastic and complex
● Agent learns a policy mapping states to actions
○ Seeking to maximize its cumulative reward in the long run
AgentAction, Response, Control
State, Stimulus, Situation
Reward, Gain, Payoff, Cost
Environment
(world)
Slide Credit: Rich Sutton
RL API
RL API
(C) Dhruv Batra & Zsolt Kira 10Slide Credit: David Silver
Signature challenges of RL
Evaluative feedback (reward)Sequentiality, delayed consequencesNeed for trial and error, to explore as well as exploitNon-stationarityThe fleeting nature of time and online data
Slide Credit: Rich Sutton
Robot Locomotion
Objective: Make the robot move forward
State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement
Figures copyright John Schulman et al., 2016. Reproduced with permission.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Atari Games
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step
Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Go
Objective: Win the game!
State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise
This image is CC0 public domain
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Demo• http://projects.rajivshah.com/rldemo/• https://cs.stanford.edu/people/karpathy/convnetjs/de
mo/rldemo.html
(C) Dhruv Batra & Zsolt Kira 15
Markov Decision Process- Mathematical formulation of the RL problem
- Life is trajectory:
Defined by:
: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
What is Markov about MDPs?
• “Markov” generally means that given the present state, the future and the past are independent
• For Markov decision processes, “Markov” means action outcomes depend only on the current state
Andrey Markov (1856‐1922)
Slide Credit: http://ai.berkeley.edu
Example: Grid World
The agent receives rewards each time step Small “living” reward each step (can be negative) Big rewards come at the end (good or bad)
This is now our supervision! Different from supervised learning: Rewards can be sparse How can we assign credit/blame to actions far into the past that caused
our large positive/negative reward? The world is uncertain
If we have a model of it, how can we solve? If we don’t, how can we solve it?
Should we build a model of it?
Slide Credit: http://ai.berkeley.edu
Example: Grid World
A maze‐like problem The agent lives in a grid Walls block the agent’s path
Noisy movement: actions do not always go as planned 80% of the time, the action North takes the
agent North (if there is no wall there)
10% of the time, North takes the agent West; 10% East
If there is a wall in the direction the agent would have been taken, the agent stays put
The agent receives rewards each time step Small “living” reward each step (can be negative) Big rewards come at the end (good or bad)
Goal: maximize sum of rewards
Slide Credit: http://ai.berkeley.edu
Grid World Actions
Deterministic Grid World Stochastic Grid World
Slide Credit: http://ai.berkeley.edu
Solving MDPs
Assume for now: We know the transition function We have a state space defined
Slide Credit: http://ai.berkeley.edu
Components of an RL Agent
• Value function– How good is each state and/or state-action pair?
• Policy– How does an agent behave?
• Model– Agent’s representation of the environment
(C) Dhruv Batra & Zsolt Kira 22
Optimal Quantities
The value (utility) of a state s:V*(s) = expected utility starting in s and acting optimally
The value (utility) of a q‐state (s,a):Q*(s,a) = expected utility starting out having taken action a from state s and (thereafter) acting optimally
The optimal policy:*(s) = optimal action from state s
a
s
s’
s, a
(s,a,s’) is a transition
s,a,s’
s is a state
(s, a) is a q‐state
Slide Credit: http://ai.berkeley.edu
Policies
Optimal policy when R(s, a, s’) = ‐0.03 for all non‐terminals s
• In deterministic single-agent search problems, we want an optimal plan, or sequence of actions, from start to a goal
• For MDPs, we want an optimal policy *: S → A– A policy gives an action for each state– An optimal policy is one that maximizes
expected utility if followed– An explicit policy defines a reflex agent
Slide Credit: http://ai.berkeley.edu
What’s a good policy?
The optimal policy *
What’s a good policy?
Maximizes current reward? Sum of all future reward?
The optimal policy *
What’s a good policy?
Maximizes current reward? Sum of all future reward?
Discounted future rewards!
The optimal policy *
Utilities of Sequences
Slide Credit: http://ai.berkeley.edu
Utilities of Sequences• What preferences should an agent have over reward
sequences?
• More or less?
• Now or later?[1, 2, 2] [2, 3, 4]or
[0, 0, 1] [1, 0, 0]or
Slide Credit: http://ai.berkeley.edu
Discounting
• It’s reasonable to maximize the sum of rewards• It’s also reasonable to prefer rewards now to rewards later• One solution: values of rewards decay exponentially
Worth Now Worth Next Step Worth In Two Steps
Slide Credit: http://ai.berkeley.edu
Discounting
• How to discount?– Each time we descend a level,
we multiply in the discount once
• Why discount?– Sooner rewards probably do
have higher utility than later rewards
– Also helps our algorithms converge
• Example: discount of 0.5– U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3– U([1,2,3]) < U([3,2,1])
Slide Credit: http://ai.berkeley.edu
What’s a good policy?
Maximizes current reward? Sum of all future reward?
Discounted future rewards!
Formally:
with
The optimal policy *
Optimal Policies
R(s) = ‐2.0R(s) = ‐0.4
R(s) = ‐0.03R(s) = ‐0.01Reward
at every non-terminal state (living reward/
penalty)
Slide Credit: http://ai.berkeley.edu
Components of an RL Agent
• Value function– How good is each state and/or state-action pair?
• Policy– How does an agent behave?
• Model– Agent’s representation of the environment
(C) Dhruv Batra & Zsolt Kira 34
Value Function• A value function is a prediction of future reward
• “State Value Function” or simply “Value Function”– How good is a state? – Am I in trouble? Am I winning this game?
• “Action Value Function” or Q-function– How good is a state action-pair? – Should I do this now?
(C) Dhruv Batra & Zsolt Kira 35
Values of States• Fundamental operation: compute the value of a state
– Expected utility under optimal action– Average sum of (discounted) rewards
• Recursive definition of value:a
s
s, a
s,a,s’
s’
Bellman Equation
Slide Credit: http://ai.berkeley.edu
Values of States• Fundamental operation: compute the value of a state
– Expected utility under optimal action– Average sum of (discounted) rewards
• Recursive definition of value:a
s
s, a
s,a,s’
s’
Bellman Equation
Slide Credit: http://ai.berkeley.edu
Snapshot of Demo – Gridworld V Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Slide Credit: http://ai.berkeley.edu
Snapshot of Demo – Gridworld Q Values
Noise = 0.2
Discount = 0.9
Living reward = 0
Slide Credit: http://ai.berkeley.edu
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Definitions: Value function and Q-value function
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
How good is a state? The value function at state s, is the expected cumulative reward from state s(and following the policy thereafter):
How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s (and following the policy thereafter):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Components of an RL Agent
• Value function– How good is each state and/or state-action pair?
• Policy– How does an agent behave?
• Model– Agent’s representation of the environment
(C) Dhruv Batra & Zsolt Kira 43
Model
(C) Dhruv Batra & Zsolt Kira 44Slide Credit: David Silver
Model
(C) Dhruv Batra & Zsolt Kira 45
• Model predicts what the world will do next
Slide Credit: David Silver
Components of an RL Agent• Value function
– How good is each state and/or state-action pair?
• Policy– How does an agent behave?
• Model– Agent’s representation of the environment
(C) Dhruv Batra 46
Solving MDPs
Assume for now: We know the transition function We have a state space defined
Slide Credit: http://ai.berkeley.edu
Value Iteration
• Start with V0(s) = 0: no time steps left means an expected reward sum of zero
• Given vector of Vk(s) values, do one update from each state:
• Repeat until convergence
• Complexity of each iteration: O(S2A)
• Theorem: will converge to unique optimal values– Basic idea: approximations get refined towards optimal values– Policy may converge long before values do
a
Vk+1(s)
s, a
s,a,s’
)’(skV
Slide Credit: http://ai.berkeley.edu
Example: Racing
• A robot car wants to travel far, quickly• Three states: Cool, Warm, Overheated• Two actions: Slow, Fast• Going faster gets double reward
Cool
Warm
Overheated
Fast
Fast
Slow
Slow
0.5
0.5
0.5
0.5
1.0
1.0
+1
+1
+1
+2
+2
‐10
Slide Credit: http://ai.berkeley.edu
Racing Search Tree
Slide Credit: http://ai.berkeley.edu
Example: Value Iteration
0 0 0
2 1 0
3.5 2.5 0
Assume no discount!
Slide Credit: http://ai.berkeley.edu
Computing Actions from Values
• Let’s imagine we have the optimal values V*(s)
• How should we act?– It’s not obvious!
• We need to do a one step calculation
• This is called policy extraction, since it gets the policy implied by the values
Slide Credit: http://ai.berkeley.edu
Computing Actions from Q-Values• Let’s imagine we have the optimal
q-values:
• How should we act?– Completely trivial to decide!
• Important lesson: actions are easier to select from q-values than values!
Slide Credit: http://ai.berkeley.edu
Reinforcement Learning
• Still assume a Markov decision process (MDP):– A set of states s S– A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)
• Still looking for a policy (s)
• New twist: don’t know T or R– I.e. we don’t know which states are good or what the actions do– Must actually try actions and states out to learn
Slide Credit: http://ai.berkeley.edu
Model-Based Learning• Model-Based Idea:
– Learn an approximate model based on experiences– Solve for values as if the learned model were correct
• Step 1: Learn empirical MDP model– Count outcomes s’ for each s, a– Normalize to give an estimate of– Discover each when we experience (s, a, s’)
• Step 2: Solve the learned MDP– For example, use value iteration, as before
Slide Credit: http://ai.berkeley.edu
The Story So Far: MDPs and RL
Known MDP: Offline SolutionGoal Technique
Compute V*, Q*, * Value / policy iteration
Evaluate a fixed policy Policy evaluation
Unknown MDP: Model‐Based Unknown MDP: Model‐FreeGoal Technique
Compute V*, Q*, * VI/PI on approx. MDP
Evaluate a fixed policy PE on approx. MDP
Goal Technique
Compute V*, Q*, * Q‐learning
Evaluate a fixed policy Value Learning
Slide Credit: http://ai.berkeley.edu
Model-Free Learning
Slide Credit: http://ai.berkeley.edu
Example: Expected Age
Goal: Compute expected age of students
Unknown P(A): “Model Based” Unknown P(A): “Model Free”
Without P(A), instead collect samples [a1, a2, … aN]
Known P(A)
Why does this work? Because samples appear with the right frequencies.
Why does this work? Because eventually you learn the right
model.
Slide Credit: http://ai.berkeley.edu
Reinforcement Learning• We still assume an MDP:
– A set of states s S– A set of actions (per state) A– A model T(s,a,s’)– A reward function R(s,a,s’)
• Still looking for a policy (s)
• New twist: don’t know T or R, so must try out actions
• Big idea: Compute all averages over T using sample outcomes
Slide Credit: http://ai.berkeley.edu
Model-Free Learning
• Model-free (temporal difference) learning– Experience world through episodes
– Update estimates each transition
– Over time, updates will mimic Bellman updates
r
a
s
s, a
s’
a’
s’, a’
s’’
Slide Credit: http://ai.berkeley.edu
Passive Reinforcement Learning
• Simplified task: policy evaluation– Input: a fixed policy (s)– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– Goal: learn the state values
• In this case:– Learner is “along for the ride”– No choice about what actions to take– Just execute the policy and learn from experience– This is NOT offline planning! You actually take actions in the world.
Slide Credit: http://ai.berkeley.edu
Direct Evaluation
• Goal: Compute values for each state under
• Idea: Average together observed sample values– Act according to – Every time you visit a state, write down what
the sum of discounted rewards turned out to be– Average those samples
• This is called direct evaluation
Slide Credit: http://ai.berkeley.edu
Example: Direct Evaluation
Input Policy
Assume: = 1
Observed Episodes (Training) Output Values
A
B C D
E
B, east, C, ‐1
C, east, D, ‐1
D, exit, x, +10
B, east, C, ‐1
C, east, D, ‐1
D, exit, x, +10
E, north, C, ‐1
C, east, A, ‐1
A, exit, x, ‐10
Episode 1 Episode 2
Episode 3 Episode 4E, north, C, ‐1
C, east, D, ‐1
D, exit, x, +10
A
B C D
E
+8 +4 +10
‐10
‐2
Slide Credit: http://ai.berkeley.edu
Problems with Direct Evaluation
• What’s good about direct evaluation?– It’s easy to understand– It doesn’t require any knowledge of T, R– It eventually computes the correct average
values, using just sample transitions
• What bad about it?– It wastes information about state connections– Each state must be learned separately– So, it takes a long time to learn
Output Values
A
B C D
E
+8 +4 +10
‐10
‐2
If B and E both go to C under this policy, how can their values be different?
Slide Credit: http://ai.berkeley.edu
Why Not Use Policy Evaluation?
• Simplified Bellman updates calculate V for a fixed policy:– Each round, replace V with a one‐step‐look‐ahead layer over V
– This approach fully exploited the connections between the states– Unfortunately, we need T and R to do it!
• Key question: how can we do this update to V without knowing T and R?– In other words, how do we take a weighted average without knowing the weights?
(s)
s
s, (s)
s, (s),s’
s’
Slide Credit: http://ai.berkeley.edu
Sample‐Based Policy Evaluation?
• We want to improve our estimate of V by computing these averages:
• Idea: Take samples of outcomes s’ (by doing the action!) and average
(s)
s
s, (s)
'1s'2s '3ss, (s),s’
s'
Almost! But we can’t rewind time to get sample after sample from state s.
What’s the difficulty of this algorithm?
Slide Credit: http://ai.berkeley.edu
Temporal Difference Learning
• Big idea: learn from every experience!– Update V(s) each time we experience a transition (s, a, s’, r)– Likely outcomes s’ will contribute updates more often
• Temporal difference learning of values– Policy still fixed, still doing evaluation!– Move values toward value of whatever successor occurs: running
average
(s)
s
s, (s)
s’
Sample of V(s):
Update to V(s):
Same update:
Slide Credit: http://ai.berkeley.edu
Exponential Moving Average
• Exponential moving average – The running interpolation update:
– Makes recent samples more important:
– Forgets about the past• Decreasing learning rate (alpha) can give converging averages
Why do we want to forget about the past?
(distant past values were wrong anyway)
Slide Credit: http://ai.berkeley.edu
Example: Temporal Difference Learning
Assume: = 1, α = 1/2
Observed TransitionsB, east, C, ‐2
0
0 0 8
0
0
‐1 0 8
0
0
‐1 3 8
0
C, east, D, ‐2
A
B C D
E
States
Slide Credit: http://ai.berkeley.edu
Problems with TD Value Learning
• TD value learning is a model‐free way to do policy evaluation, mimicking Bellman updates with running sample averages
• However, if we want to turn values into a (new) policy, we’re sunk:
a
s
s, a
s,a,s’
s’Idea: learn Q‐values, not values
Makes action selection model‐free too!
Slide Credit: http://ai.berkeley.edu
Active Reinforcement Learning
• Full reinforcement learning: optimal policies (like value iteration)– You don’t know the transitions T(s,a,s’)– You don’t know the rewards R(s,a,s’)– You choose the actions now– Goal: learn the optimal policy / values
• In this case:– Learner makes choices!– Fundamental tradeoff: exploration vs. exploitation– This is NOT offline planning! You actually take actions in the
world and find out what happens…
Slide Credit: http://ai.berkeley.edu
Detour: Q‐Value Iteration
• Value iteration: find successive (depth‐limited) values– Start with V0(s) = 0, which we know is right– Given Vk, calculate the depth k+1 values for all states:
• But Q‐values are more useful, so compute them instead– Start with Q0(s,a) = 0, which we know is right– Given Qk, calculate the depth k+1 q‐values for all q‐states:
Slide Credit: http://ai.berkeley.edu
Q-Learning
• We’d like to do Q-value updates to each Q-state:
– But can’t compute this update without knowing T, R
• Instead, compute average as we go– Receive a sample transition (s,a,r,s’)– This sample suggests
– But we want to average over results from (s,a) – So keep a running average
Slide Credit: http://ai.berkeley.edu
Q-Learning Properties
• Amazing result: Q-learning converges to optimal policy --even if you’re acting suboptimally!
• This is called off-policy learning
• Caveats:– You have to explore enough– You have to eventually make the learning rate
small enough– … but not decrease it too quickly– Basically, in the limit, it doesn’t matter how you select actions (!)
Slide Credit: http://ai.berkeley.edu
Generalizing Across States
• Basic Q-Learning keeps a table of all q-values
• In realistic situations, we cannot possibly learn about every single state!
– Too many states to visit them all in training– Too many states to hold the q-tables in memory
• Instead, we want to generalize:– Learn about some small number of training states
from experience– Generalize that experience to new, similar
situations– This is the fundamental idea in machine learning!
[demo – RL pacman]
Slide Credit: http://ai.berkeley.edu
Example: Pacman
Let’s say we discover through experience that this state is
bad:
In naïve q‐learning, we know nothing about this state:
Or even this one!
Slide Credit: http://ai.berkeley.edu
Feature-Based Representations
• Solution: describe a state using a vector of features (properties)
– Features are functions from states to real numbers (often 0/1) that capture important properties of the state
– Example features:• Distance to closest ghost• Distance to closest dot• Number of ghosts• 1 / (dist to dot)2
• Is Pacman in a tunnel? (0/1)• …… etc.• Is it the exact state on this slide?
– Can also describe a q-state (s, a) with features (e.g. action moves closer to food)
Slide Credit: http://ai.berkeley.edu
Linear Value Functions
• Using a feature representation, we can write a q function (or value function) for any state using a few weights:
• Advantage: our experience is summed up in a few powerful numbers
• Disadvantage: states may share features but actually be very different in value!
Slide Credit: http://ai.berkeley.edu
Approximate Q-Learning
• Q-learning with linear Q-functions:
• Intuitive interpretation:– Adjust weights of active features– E.g., if something unexpectedly bad happens, blame the features that
were on: disprefer all states with that state’s features
• Formal justification: online least squares
Exact Q’s
Approximate Q’s
Slide Credit: http://ai.berkeley.edu
Minimizing Error*
Approximate q update explained:
Imagine we had only one point x, with features f(x), target value y, and weights w:
“target” “prediction”
Slide Credit: http://ai.berkeley.edu
Approaches to RL• Value-based RL
– Estimate the optimal action-value function
• Policy-based RL– Search directly for the optimal policy
• Model– Build a model of the world
• State transition, reward probabilities
– Plan (e.g. by look-ahead) using model
(C) Dhruv Batra 81
Deep RL• Value-based RL
– Use neural nets to represent Q function
• Policy-based RL– Use neural nets to represent policy
• Model– Use neural nets to represent and learn the model
(C) Dhruv Batra 82
Q-Networks
Slide Credit: David Silver
Case Study: Playing Atari Games
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step
Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values)
Input: state st
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values)
Familiar conv layers, FC layer
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
Number of actions between 4-18 depending on Atari game
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
:neural network with weights
Q-network Architecture
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
16 8x8 conv, stride 4
32 4x4 conv, stride 2
FC-256
FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
Number of actions between 4-18 depending on Atari game
A single feedforward pass to compute Q-values for all actions from the current state => efficient!
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:Forward Pass
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:
where
Forward Pass
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:
where
Forward Pass
Backward PassGradient update (with respect to Q-function parameters θ):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Remember: want to find a Q-function that satisfies the Bellman Equation:
Deep Q-learning
Loss function:
where
Iteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)
Forward Pass
Backward PassGradient update (with respect to Q-function parameters θ):
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n
Training the Q-network: Experience Replay
96
Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Training the Q-network: Experience Replay
97
Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops
Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game
(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,
instead of consecutive samples
[Mnih et al. NIPS Workshop 2013; Nature 2015]
Experience Replay
(C) Dhruv Batra 98Slide Credit: David Silver
Video by Károly Zsolnai-Fehér. Reproduced with permission.https://www.youtube.com/watch?v=V1eYniJ0Rnk
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n