Lecture 14: Reinforcement · PDF fileLecture 14: Reinforcement Learning. Fei-Fei Li &...

Click here to load reader

  • date post

    18-Feb-2019
  • Category

    Documents

  • view

    214
  • download

    0

Embed Size (px)

Transcript of Lecture 14: Reinforcement · PDF fileLecture 14: Reinforcement Learning. Fei-Fei Li &...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20171

Lecture 14:Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Administrative

2

Grades: - Midterm grades released last night, see Piazza for more

information and statistics- A2 and milestone grades scheduled for later this week

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Administrative

3

Projects:- All teams must register their project, see Piazza for registration

form- Tiny ImageNet evaluation server is online

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Administrative

4

Survey: - Please fill out the course survey!- Link on Piazza or https://goo.gl/forms/eQpVW7IPjqapsDkB2

https://goo.gl/forms/eQpVW7IPjqapsDkB2

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

So far Supervised Learning

5

Data: (x, y)x is data, y is label

Goal: Learn a function to map x -> y

Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.

Cat

Classification

This image is CC0 public domain

https://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/https://creativecommons.org/publicdomain/zero/1.0/deed.enhttps://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

So far Unsupervised Learning

6

Data: xJust data, no labels!

Goal: Learn some underlying hidden structure of the data

Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation

2-d density images left and right are CC0 public domain

1-d density estimation

https://commons.wikimedia.org/wiki/File:Bivariate_example.pnghttps://www.flickr.com/photos/omegatron/8533520357https://creativecommons.org/publicdomain/zero/1.0/deed.en

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Today: Reinforcement Learning

7

Problems involving an agent interacting with an environment, which provides numeric reward signals

Goal: Learn how to take actions in order to maximize reward

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20178

Overview

- What is Reinforcement Learning?- Markov Decision Processes- Q-Learning- Policy Gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20179

Agent

Environment

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201710

Agent

Environment

State st

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201711

Agent

Environment

Action atState st

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201712

Agent

Environment

Action atState st Reward rt

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201713

Agent

Environment

Action atState st Reward rt

Next state st+1

Reinforcement Learning

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Cart-Pole Problem

14

Objective: Balance a pole on top of a movable cart

State: angle, angular speed, position, horizontal velocityAction: horizontal force applied on the cartReward: 1 at each time step if the pole is upright

This image is CC0 public domain

https://commons.wikimedia.org/wiki/File:Cart-pendulum.svghttps://creativecommons.org/publicdomain/zero/1.0/deed.enhttps://commons.wikimedia.org/wiki/File:Cart-pendulum.svg

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Robot Locomotion

15

Objective: Make the robot move forward

State: Angle and position of the jointsAction: Torques applied on jointsReward: 1 at each time step upright + forward movement

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Atari Games

16

Objective: Complete the game with the highest score

State: Raw pixel inputs of the game stateAction: Game controls e.g. Left, Right, Up, DownReward: Score increase/decrease at each time step

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Go

17

Objective: Win the game!

State: Position of all piecesAction: Where to put the next piece downReward: 1 if win at the end of the game, 0 otherwise

This image is CC0 public domain

https://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.pnghttps://creativecommons.org/publicdomain/zero/1.0/deed.enhttps://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.png

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201718

Agent

Environment

Action atState st Reward rt

Next state st+1

How can we mathematically formalize the RL problem?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Markov Decision Process

19

- Mathematical formulation of the RL problem- Markov property: Current state completely characterises the state of the

world

Defined by:

: set of possible states: set of possible actions: distribution of reward given (state, action) pair: transition probability i.e. distribution over next state given (state, action) pair: discount factor

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Markov Decision Process- At time step t=0, environment samples initial state s0 ~ p(s0)- Then, for t=0 until done:

- Agent selects action at- Environment samples reward rt ~ R( . | st, at)- Environment samples next state st+1 ~ P( . | st, at)- Agent receives reward rt and next state st+1

- A policy is a function from S to A that specifies what action to take in each state

- Objective: find policy * that maximizes cumulative discounted reward:

20

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

A simple MDP: Grid World

21

Objective: reach one of terminal states (greyed out) in least number of actions

actions = {

1. right

2. left

3. up

4. down

}

Set a negative reward for each transition

(e.g. r = -1)

states

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

A simple MDP: Grid World

22

Random Policy Optimal Policy

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

The optimal policy *

23

We want to find optimal policy * that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability)?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

The optimal policy *

24

We want to find optimal policy * that maximizes the sum of rewards.

How do we handle the randomness (initial state, transition probability)?Maximize the expected sum of rewards!

Formally: with

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Definitions: Value function and Q-value function

25

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Definitions: Value function and Q-value function

26

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,

How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Definitions: Value function and Q-value function

27

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1,

How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

How good is a state-action pair?The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Bellman equation

28

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Bellman equation

29

Q* satisfies the following Bellman equation:

Intuition: if the optimal state-action values for the next time-step Q*(s,a) are known, then the optimal strategy is to take the action that maximizes the expected value of

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

Bellman equation

30

Q* satisfies the following Bellman equation:

Intuition: