Lecture 14: Reinforcement · PDF fileLecture 14: Reinforcement Learning. Fei-Fei Li &...
date post
18-Feb-2019Category
Documents
view
214download
0
Embed Size (px)
Transcript of Lecture 14: Reinforcement · PDF fileLecture 14: Reinforcement Learning. Fei-Fei Li &...
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20171
Lecture 14:Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Administrative
2
Grades: - Midterm grades released last night, see Piazza for more
information and statistics- A2 and milestone grades scheduled for later this week
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Administrative
3
Projects:- All teams must register their project, see Piazza for registration
form- Tiny ImageNet evaluation server is online
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Administrative
4
Survey: - Please fill out the course survey!- Link on Piazza or https://goo.gl/forms/eQpVW7IPjqapsDkB2
https://goo.gl/forms/eQpVW7IPjqapsDkB2
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
So far Supervised Learning
5
Data: (x, y)x is data, y is label
Goal: Learn a function to map x -> y
Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.
Cat
Classification
This image is CC0 public domain
https://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/https://creativecommons.org/publicdomain/zero/1.0/deed.enhttps://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
So far Unsupervised Learning
6
Data: xJust data, no labels!
Goal: Learn some underlying hidden structure of the data
Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation
2-d density images left and right are CC0 public domain
1-d density estimation
https://commons.wikimedia.org/wiki/File:Bivariate_example.pnghttps://www.flickr.com/photos/omegatron/8533520357https://creativecommons.org/publicdomain/zero/1.0/deed.en
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Today: Reinforcement Learning
7
Problems involving an agent interacting with an environment, which provides numeric reward signals
Goal: Learn how to take actions in order to maximize reward
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20178
Overview
- What is Reinforcement Learning?- Markov Decision Processes- Q-Learning- Policy Gradients
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20179
Agent
Environment
Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201710
Agent
Environment
State st
Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201711
Agent
Environment
Action atState st
Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201712
Agent
Environment
Action atState st Reward rt
Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201713
Agent
Environment
Action atState st Reward rt
Next state st+1
Reinforcement Learning
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Cart-Pole Problem
14
Objective: Balance a pole on top of a movable cart
State: angle, angular speed, position, horizontal velocityAction: horizontal force applied on the cartReward: 1 at each time step if the pole is upright
This image is CC0 public domain
https://commons.wikimedia.org/wiki/File:Cart-pendulum.svghttps://creativecommons.org/publicdomain/zero/1.0/deed.enhttps://commons.wikimedia.org/wiki/File:Cart-pendulum.svg
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Robot Locomotion
15
Objective: Make the robot move forward
State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Atari Games
16
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Go
17
Objective: Win the game!
State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise
This image is CC0 public domain
https://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.pnghttps://creativecommons.org/publicdomain/zero/1.0/deed.enhttps://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.png
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201718
Agent
Environment
Action atState st Reward rt
Next state st+1
How can we mathematically formalize the RL problem?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Markov Decision Process
19
- Mathematical formulation of the RL problem- Markov property: Current state completely characterises the state of the
world
Defined by:
: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Markov Decision Process- At time step t=0, environment samples initial state s0 ~ p(s0)- Then, for t=0 until done:
- Agent selects action at- Environment samples reward rt ~ R( . | st, at)- Environment samples next state st+1 ~ P( . | st, at)- Agent receives reward rt and next state st+1
- A policy is a function from S to A that specifies what action to take in each state
- Objective: find policy * that maximizes cumulative discounted reward:
20
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
A simple MDP: Grid World
21
Objective: reach one of terminal states (greyed out) in least number of actions
actions = {
1. right
2. left
3. up
4. down
}
Set a negative reward for each transition
(e.g. r = -1)
states
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
A simple MDP: Grid World
22
Random Policy Optimal Policy
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
The optimal policy *
23
We want to find optimal policy * that maximizes the sum of rewards.
How do we handle the randomness (initial state, transition probability)?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
The optimal policy *
24
We want to find optimal policy * that maximizes the sum of rewards.
How do we handle the randomness (initial state, transition probability)?Maximize the expected sum of rewards!
Formally: with
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Definitions: Value function and Q-value function
25
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Definitions: Value function and Q-value function
26
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,
How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Definitions: Value function and Q-value function
27
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,
How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:
How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Bellman equation
28
The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Bellman equation
29
Q* satisfies the following Bellman equation:
Intuition: if the optimal state-action values for the next time-step Q*(s,a) are known, then the optimal strategy is to take the action that maximizes the expected value of
The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017
Bellman equation
30
Q* satisfies the following Bellman equation:
Intuition: