Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 ·...

113
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020 1 Lecture 17: Reinforcement Learning

Transcript of Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 ·...

Page 1: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 20201

Lecture 17:Reinforcement Learning

Page 2: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Administrative

2

Final project report due 6/7

Video due 6/9

Both are optional. See Piazza post @1875.

Page 3: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

So far… Supervised Learning

3

Data: (x, y)x is data, y is label

Goal: Learn a function to map x -> y

Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.

Cat

Classification

This image is CC0 public domain

Page 4: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

So far… Unsupervised Learning

4

Data: xJust data, no labels!

Goal: Learn some underlying hidden structure of the data

Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation

2-d density images left and right are CC0 public domain

1-d density estimationFigure copyright Ian Goodfellow, 2016. Reproduced with permission.

Page 5: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Today: Reinforcement Learning

5

Problems involving an agent interacting with an environment, which provides numeric reward signals

Goal: Learn how to take actions in order to maximize reward

Atari games figure copyright Volodymyr Mnih et al., 2013. Reproduced with permission.

Page 6: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 20206

Overview

- What is Reinforcement Learning?- Markov Decision Processes- Q-Learning- Policy Gradients- SOTA Deep RL

Page 7: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

What is reinforcement learning?

7

Page 8: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 20208

Agent

Environment

Reinforcement Learning

Page 9: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 20209

Agent

Environment

State st

Reinforcement Learning

Page 10: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202010

Agent

Environment

Action atState st

Reinforcement Learning

Page 11: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202011

Agent

Environment

Action atState st Reward rt

Reinforcement Learning

Page 12: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202012

Agent

Environment

Action atState st Reward rt

Next state st+1

Reinforcement Learning

Page 13: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Cart-Pole Problem

13

Objective: Balance a pole on top of a movable cart

State: angle, angular speed, position, horizontal velocityAction: horizontal force applied on the cartReward: 1 at each time step if the pole is upright

This image is CC0 public domain

Page 14: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Robot Locomotion

14

Objective: Make the robot move forward

State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement

Figures copyright John Schulman et al., 2016. Reproduced with permission.

Schulman et al.. “High-dimensional continuous control using generalized advantage estimation” 2015

Page 15: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Atari Games

15

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.

Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.

Page 16: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Go

16

Objective: Win the game!

State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise

This image is CC0 public domainSilver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354-359.

Page 17: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Markov decision processes (MDPs)

17

Page 18: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202018

Agent

Environment

Action atState st Reward rt

Next state st+1

How can we mathematically formalize the RL problem?

Page 19: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Markov Decision Process

19

- Mathematical formulation of the RL problem- Markov property: Current state completely characterises the state of the

world

Defined by:

: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor

Page 20: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Markov Decision Process- At time step t=0, environment samples initial state s0 ~ p(s0)- Then, for t=0 until done:

- Agent selects action at- Environment samples reward rt ~ R( . | st, at)- Environment samples next state st+1 ~ P( . | st, at)- Agent receives reward rt and next state st+1

- A policy 𝝿 is a function from S to A that specifies what action to take in each state

- Objective: find policy 𝝿* that maximizes cumulative discounted reward:

20

Page 21: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

A simple MDP: Grid World

21

Objective: reach one of terminal states (greyed out) in least number of actions

actions = {

1. right

2. left

3. up

4. down

}

Set a negative “reward” for each transition

(e.g. r = -1)

states

Page 22: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

A simple MDP: Grid World

22

Random Policy Optimal Policy

Page 23: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

The optimal policy 𝝿*

23

We want to find optimal policy 𝝿* that maximizes the sum of rewards.

Page 24: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

The optimal policy 𝝿*

24

We want to find optimal policy 𝝿* that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability…)?Maximize the expected sum of rewards!

Page 25: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

The optimal policy 𝝿*

25

We want to find optimal policy 𝝿* that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability…)?Maximize the expected sum of rewards!

Formally: with

Page 26: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Definitions: Value function and Q-value function

26

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

Page 27: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Definitions: Value function and Q-value function

27

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

Page 28: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Definitions: Value function and Q-value function

28

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

Page 29: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Q-learning

29

Page 30: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Bellman equation

30

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Page 31: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Bellman equation

31

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Page 32: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Bellman equation

32

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of

The optimal policy

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Page 33: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Solving for the optimal policy

33

Qi will converge to Q* as i -> infinity

Value iteration algorithm: Use Bellman equation as an iterative update

Page 34: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

What’s the problem with this?

Solving for the optimal policy

34

Qi will converge to Q* as i -> infinity

Value iteration algorithm: Use Bellman equation as an iterative update

Page 35: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

What’s the problem with this?Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!

Solving for the optimal policy

35

Qi will converge to Q* as i -> infinity

Value iteration algorithm: Use Bellman equation as an iterative update

Page 36: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

What’s the problem with this?Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!

Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

Solving for the optimal policy

36

Qi will converge to Q* as i -> infinity

Value iteration algorithm: Use Bellman equation as an iterative update

Page 37: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Solving for the optimal policy: Q-learning

37

Q-learning: Use a function approximator to estimate the action-value function

Page 38: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Solving for the optimal policy: Q-learning

38

Q-learning: Use a function approximator to estimate the action-value function

If the function approximator is a deep neural network => deep q-learning!

Page 39: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Solving for the optimal policy: Q-learning

39

Q-learning: Use a function approximator to estimate the action-value function

If the function approximator is a deep neural network => deep q-learning!

function parameters (weights)

Page 40: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Remember: want to find a Q-function that satisfies the Bellman Equation:

40

Solving for the optimal policy: Q-learning

Page 41: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Remember: want to find a Q-function that satisfies the Bellman Equation:

41

Loss function:

where

Solving for the optimal policy: Q-learning

Forward Pass

Page 42: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Remember: want to find a Q-function that satisfies the Bellman Equation:

42

Loss function:

where

Solving for the optimal policy: Q-learning

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):

Page 43: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Remember: want to find a Q-function that satisfies the Bellman Equation:

43

Loss function:

whereIteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)

Solving for the optimal policy: Q-learning

Forward Pass

Backward PassGradient update (with respect to Q-function parameters θ):

Page 44: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Case Study: Playing Atari Games

44

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Figures copyright Volodymyr Mnih et al., 2013. Reproduced with permission.

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 45: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

:neural network with weights

Q-network Architecture

45

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 46: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

:neural network with weights

Q-network Architecture

46

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

Input: state st

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 47: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

:neural network with weights

Q-network Architecture

47

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values)

Familiar conv layers, FC layer

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 48: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

:neural network with weights

Q-network Architecture

48

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 49: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

:neural network with weights

Q-network Architecture

49

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

Number of actions between 4-18 depending on Atari game

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 50: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

:neural network with weights

Q-network Architecture

50

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

16 8x8 conv, stride 4

32 4x4 conv, stride 2

FC-256

FC-4 (Q-values) Last FC layer has 4-d output (if 4 actions), corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

Number of actions between 4-18 depending on Atari game

A single feedforward pass to compute Q-values for all actions from the current state => efficient!

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 51: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Remember: want to find a Q-function that satisfies the Bellman Equation:

51

Loss function:

whereIteratively try to make the Q-value close to the target value (yi) it should have, if Q-function corresponds to optimal Q* (and optimal policy 𝝿*)

Training the Q-network: Loss function (from before)

Forward Pass

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Problem: Nonstationary! The “target” for Q(s, a) depends on the current weights θ!

Solution: Use a slow-moving “target” Q-network that is delayed in parameter updates to generate target value labels for the Q-network.

Page 52: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Training the Q-network: Experience Replay

52

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 53: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Training the Q-network: Experience Replay

53

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 54: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Training the Q-network: Experience Replay

54

Learning from batches of consecutive samples is problematic:- Samples are correlated => inefficient learning- Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

Address these problems using experience replay- Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played- Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples Each transition can also contribute to multiple weight updates=> greater data efficiency

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 55: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202055

Putting it together: Deep Q-Learning with Experience Replay

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 56: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202056

Putting it together: Deep Q-Learning with Experience Replay

Initialize replay memory, Q-network

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 57: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202057

Putting it together: Deep Q-Learning with Experience Replay

Play M episodes (full games)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 58: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202058

Putting it together: Deep Q-Learning with Experience Replay

Initialize state (starting game screen pixels) at the beginning of each episode

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 59: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202059

Putting it together: Deep Q-Learning with Experience Replay

For each timestep t of the game

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 60: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202060

Putting it together: Deep Q-Learning with Experience Replay

With small probability, select a random action (explore), otherwise select greedy action from current policy

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 61: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202061

Putting it together: Deep Q-Learning with Experience Replay

Take the action (at), and observe the reward rt and next state st+1

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 62: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202062

Putting it together: Deep Q-Learning with Experience Replay

Store transition in replay memory

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 63: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202063

Putting it together: Deep Q-Learning with Experience Replay

Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step

[Mnih et al. NIPS Workshop 2013; Nature 2015]

Page 64: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202064

Video by Károly Zsolnai-Fehér. Reproduced with permission.

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Page 65: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

What is a problem with Q-learning?

65

The Q-function can be very complicated!

Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair

Page 66: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

What is a problem with Q-learning?

66

The Q-function can be very complicated!

Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair

But the policy can be much simpler: just close your handCan we learn a policy directly, e.g. finding the best policy from a collection of policies?

Page 67: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Policy gradients

67

Page 68: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Formally, let’s define a class of parameterized policies:

For each policy, define its value:

Policy Gradients

68

Page 69: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Formally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

How can we do this?

Policy Gradients

69

Page 70: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Formally, let’s define a class of parameterized policies:

For each policy, define its value:

We want to find the optimal policy

How can we do this?

Policy Gradients

70

Gradient ascent on policy parameters!

Page 71: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

71

Mathematically, we can write:

Where r(𝜏) is the reward of a trajectory

And

Page 72: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

72

Expected reward:

Page 73: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

73

Now let’s differentiate to calculate the gradients:

Expected reward:

Page 74: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

74

Intractable! Gradient of an expectation is problematic when p depends on θ

Expected reward:

Now let’s differentiate to calculate the gradients:

Page 75: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

75

Intractable! Gradient of an expectation is problematic when p depends on θ

Now let’s differentiate to calculate the gradients:

However, we can use a nice trick:

Expected reward:

Page 76: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

76

Intractable! Gradient of an expectation is problematic when p depends on θ

Can estimate with Monte Carlo sampling! Let’s see how in the next slide

Expected reward:

Now let’s differentiate to calculate the gradients:

However, we can use a nice trick:

Page 77: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

77

Can we compute gradients of a trajectory without knowing the transition probabilities?

We have:

Page 78: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

78

Can we compute gradients of a trajectory without knowing the transition probabilities?We have:

Thus:

Page 79: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

79

Can we compute gradients of a trajectory without knowing the transition probabilities?We have:

Thus:

And when differentiating:Doesn’t depend on transition probabilities!

Page 80: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE algorithm

80

Can we compute gradients of a trajectory without knowing the transition probabilities?We have:

Thus:

And when differentiating:

Therefore when sampling a trajectory 𝜏, we can estimate J(𝜃) with

Doesn’t depend on transition probabilities!

Page 81: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Intuition

81

Gradient estimator:

Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen

Page 82: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Intuition

82

Gradient estimator:

Interpretation:- If r(𝜏) is high, push up the probabilities of the actions seen- If r(𝜏) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were good. But in expectation, it averages out!

However, this also suffers from high variance because credit assignment is really hard. Can we help the estimator?

Page 83: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Variance reduction

83

Gradient estimator:

Page 84: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Variance reduction

84

Gradient estimator:

First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state

Page 85: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Variance reduction

85

Gradient estimator:

First idea: Push up probabilities of an action seen, only by the cumulative future reward from that state

Second idea: Use discount factor 𝛾 to ignore delayed effects

Page 86: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Variance reduction: Baseline

Problem: The raw value of a trajectory isn’t necessarily meaningful. For example, if rewards are all positive, you keep pushing up probabilities of actions.

What is important then? Whether a reward is better or worse than what you expect to get

Idea: Introduce a baseline function dependent on the state.Concretely, estimator is now:

86

Page 87: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

How to choose the baseline?

87

Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

Page 88: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

How to choose the baseline?

88

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

A: Q-function and value function!

Page 89: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

How to choose the baseline?

89

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

A: Q-function and value function!Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.

Page 90: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

How to choose the baseline?

90

A better baseline: Want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

Q: What does this remind you of?

A: Q-function and value function!Intuitively, we are happy with an action at in a state st if is large. On the contrary, we are unhappy with an action if it’s small.

Using this, we get the estimator:

Page 91: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Actor-Critic Algorithm

91

Problem: we don’t know Q and V. Can we learn them?

Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function).

- The actor decides which action to take, and the critic tells the actor how good its action was and how it should adjust

- Also alleviates the task of the critic as it only has to learn the values of (state, action) pairs generated by the policy

- Can also incorporate Q-learning tricks e.g. experience replay- Remark: we can define by the advantage function how much an

action was better than expected

Page 92: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Actor-Critic Algorithm

92

Initialize policy parameters 𝜃, critic parameters 𝜙For iteration=1, 2 … do

Sample m trajectories under the current policy

For i=1, …, m doFor t=1, ... , T do

End for

Page 93: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE in action: Recurrent Attention Model (RAM)

93

Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

- Inspiration from human perception and eye movements- Saves computational resources => scalability- Able to ignore clutter / irrelevant parts of image

State: Glimpses seen so farAction: (x,y) coordinates (center of glimpse) of where to look next in imageReward: 1 at the final timestep if image correctly classified, 0 otherwise

glimpse

[Mnih et al. 2014]

Page 94: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

REINFORCE in action: Recurrent Attention Model (RAM)

94

Objective: Image Classification

Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

- Inspiration from human perception and eye movements- Saves computational resources => scalability- Able to ignore clutter / irrelevant parts of image

State: Glimpses seen so farAction: (x,y) coordinates (center of glimpse) of where to look next in imageReward: 1 at the final timestep if image correctly classified, 0 otherwise

Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCEGiven state of glimpses seen so far, use RNN to model the state and output next action

glimpse

[Mnih et al. 2014]

Page 95: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 202095

REINFORCE in action: Recurrent Attention Model (RAM)

[Mnih et al. 2014]Figures copyright Daniel Levy, 2017. Reproduced with permission.

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

Page 96: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep Reinforcement Learning

96

Page 97: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Proximal Policy Optimization

97

Recall Policy Gradient with baseline:

[“Proximal Policy Optimization Algorithms”, Schulman et al. 2017]

Page 98: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Proximal Policy Optimization

98

Recall Policy Gradient with baseline:

Objective from Conservative Policy Iteration [Kakade et al. 2002]:

[“Proximal Policy Optimization Algorithms”, Schulman et al. 2017]

Page 99: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Proximal Policy Optimization

99

Recall Policy Gradient with baseline:

Objective from Conservative Policy Iteration [Kakade et al. 2002]:

Relaxation for PPO Objective:

[“Proximal Policy Optimization Algorithms”, Schulman et al. 2017]

Page 100: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Proximal Policy Optimization

100

Relaxation for PPO Objective:

[“Proximal Policy Optimization Algorithms”, Schulman et al. 2017]

Page 101: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Proximal Policy Optimization

101

Relaxation for PPO Objective:

Key Idea: Tune KL penalty weight adaptively based on KL violation!

[“Proximal Policy Optimization Algorithms”, Schulman et al. 2017]

Page 102: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

PPO at scale: OpenAI Five and Dota 2

102

- 180 years of self-play per day

- 256 GPUs and 128,000 CPU cores

[“OpenAI Five”, OpenAI 2018]

Page 103: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Soft Actor-Critic

103

[“Soft Actor-Critic - Deep Reinforcement Learning with Real-World Robots”, Haarnoja et al. 2018]

Entropy-Regularized RL - reward policy for producing multi-modal actions

Page 104: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SOTA Deep RL: Soft Actor-Critic

104

Learn Soft Q-function with entropy regularization and the best policy under the Q-function.

[“Soft Actor-Critic - Deep Reinforcement Learning with Real-World Robots”, Haarnoja et al. 2018]

Entropy-Regularized RL - reward policy for producing multi-modal actions

Best policy is given by a Boltzmann (softmax) distribution instead of maximum!

Page 105: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

SAC in action: Robotic Manipulation

105

Learn policies from scratch on a real robot in hours!

[“Soft Actor-Critic - Deep Reinforcement Learning with Real-World Robots”, Haarnoja et al. 2018]

Page 106: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Policy Learning in Robotics

106

Why is it hard?

- High-dimensional state space (e.g. visuomotor control)

- Continuous action space

- Hard exploration

- Sample efficiency

Page 107: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Addressing Sample Efficiency in Robotics

107

Sim2Real

- Train a policy in simulation, and then transfer it to real world.

[“Solving Rubik’s Cube with a Robotic Hand”, OpenAI, 2019]

Page 108: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020108

Imitation Learning

- Learn a policy from human demonstrations (provided via VR, phone, etc.)

[“Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation”, Zhang et al., 2017]

[“Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations”, Mandlekar*, Xu* et al., 2020]

Addressing Sample Efficiency in Robotics

Page 109: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020109

Batch Reinforcement Learning

- Learn a policy from offline data collected by other policies (not necessarily expert).

Addressing Sample Efficiency in Robotics

[“Scaling data-driven robotics with reward sketching and batch reinforcement learning”, Cabi et al., 2020]

[“Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data”, Mandlekar et al., 2020]

Page 110: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Competing against humans in game play

110

This image is CC0 public domain

This image is CC0 public domain

AlphaGo [DeepMind, Nature 2016]: - Required many engineering tricks- Bootstrapped from human play- Beat 18-time world champion Lee Sedol

AlphaGo Zero [Nature 2017]:- Simplified and elegant version of AlphaGo- No longer bootstrapped from human play- Beat (at the time) #1 world ranked Ke Jie

Alpha Zero: Dec. 2017- Generalized to beat world champion programs on

chess and shogi as wellMuZero (November 2019)

- Plans through a learned model of the gameSilver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017 Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019

Page 111: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Competing against humans in game play

111

This image is CC0 public domain

This image is CC0 public domain

Silver et al, “Mastering the game of Go with deep neural networks and tree search”, Nature 2016 Silver et al, “Mastering the game of Go without human knowledge”, Nature 2017 Silver et al, “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”, Science 2018 Schrittwieser et al, “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model”, arXiv 2019

“With the debut of AI in Go games, I've realized that I'm not at the top even if I become the number one through frantic efforts”

“Even if I become the number one, there is an entity that cannot be defeated”

November 2019: Lee Sedol announces retirement

Quotes from link: Image of Lee Sedol is licensed under CC BY 2.0

Page 112: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Summary

- Policy gradients: very general but suffer from high variance so requires a lot of samples. Challenge: sample-efficiency

- Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration

- Guarantees:- Policy Gradients: Converges to a local minima of J(𝜃), often good

enough!- Q-learning: Zero guarantees since you are approximating Bellman

equation with a complicated function approximator

112

Page 113: Reinforcement Learning Lecture 17cs231n.stanford.edu/slides/2020/lecture_17.pdf · 2020-06-04 · Reinforcement Learning. Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - 12 June

Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 14 - June 04, 2020

Next Time: Scene Graphs

113

Figure copyright Krishna et al. 2017. Reproduced with permission.

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International Journal of Computer Vision 123, no. 1 (2017): 32-73.