# Reinforcement Learning Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung...

date post

04-Jul-2020Category

## Documents

view

0download

0

Embed Size (px)

### Transcript of Reinforcement Learning Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20171

Reinforcement Learning

from http://cs231n.stanford.edu/slides/2019/

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

So far… Supervised Learning

5

Data: (x, y) x is data, y is label

Goal: Learn a function to map x -> y

Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.

Cat

Classification

This image is CC0 public domain

https://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/ https://creativecommons.org/publicdomain/zero/1.0/deed.en https://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

So far… Unsupervised Learning

6

Data: x Just data, no labels!

Goal: Learn some underlying hidden structure of the data

Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation

2-d density images left and right are CC0 public domain

1-d density estimation

https://commons.wikimedia.org/wiki/File:Bivariate_example.png https://www.flickr.com/photos/omegatron/8533520357 https://creativecommons.org/publicdomain/zero/1.0/deed.en

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Today: Reinforcement Learning

7

Problems involving an agent interacting with an environment, which provides numeric reward signals

Goal: Learn how to take actions in order to maximize reward

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20178

Overview

- What is Reinforcement Learning? - Markov Decision Processes - Q-Learning - Policy Gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20179

Agent

Environment

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201710

Agent

Environment

State st

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201711

Agent

Environment

Action at State st

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201712

Agent

Environment

Action at State st Reward rt

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201713

Agent

Environment

Action at State st Reward rt

Next state st+1

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Cart-Pole Problem

14

Objective: Balance a pole on top of a movable cart

State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright

This image is CC0 public domain

https://commons.wikimedia.org/wiki/File:Cart-pendulum.svg https://creativecommons.org/publicdomain/zero/1.0/deed.en https://commons.wikimedia.org/wiki/File:Cart-pendulum.svg

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Robot Locomotion

15

Objective: Make the robot move forward

State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Atari Games

16

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Go

17

Objective: Win the game!

State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise

This image is CC0 public domain

https://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.png https://creativecommons.org/publicdomain/zero/1.0/deed.en https://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.png

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201718

Agent

Environment

Action at State st Reward rt

Next state st+1

How can we mathematically formalize the RL problem?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Markov Decision Process

19

- Mathematical formulation of the RL problem - Markov property: Current state completely characterises the state of the

world

Defined by:

: set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Markov Decision Process - At time step t=0, environment samples initial state s0 ~ p(s0) - Then, for t=0 until done:

- Agent selects action at - Environment samples reward rt ~ R( . | st, at) - Environment samples next state st+1 ~ P( . | st, at) - Agent receives reward rt and next state st+1

- A policy is a function from S to A that specifies what action to take in each state

- Objective: find policy * that maximizes cumulative discounted reward:

20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

A simple MDP: Grid World

21

Objective: reach one of terminal states (greyed out) in least number of actions

★

★

actions = {

1. right

2. left

3. up

4. down

}

Set a negative “reward” for each transition

(e.g. r = -1)

states

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

A simple MDP: Grid World

22

Random Policy Optimal Policy

★

★

★

★

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

The optimal policy *

23

We want to find optimal policy * that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability…)?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

The optimal policy *

24

We want to find optimal policy * that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards!

Formally: with

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Definitions: Value function and Q-value function

25

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Definitions: Value function and Q-value function

26

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Definitions: Value function and Q-value function

27

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Bellman equation

28

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Bellman equation

29

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Bellman equation

30

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that

*View more*