Reinforcement Learning Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung...

download Reinforcement Learning Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14

of 99

  • date post

    04-Jul-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Reinforcement Learning Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung...

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20171

    Reinforcement Learning

    from http://cs231n.stanford.edu/slides/2019/

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    So far… Supervised Learning

    5

    Data: (x, y) x is data, y is label

    Goal: Learn a function to map x -> y

    Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.

    Cat

    Classification

    This image is CC0 public domain

    https://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/ https://creativecommons.org/publicdomain/zero/1.0/deed.en https://pixabay.com/en/kitten-cute-feline-kitty-domestic-1246693/

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    So far… Unsupervised Learning

    6

    Data: x Just data, no labels!

    Goal: Learn some underlying hidden structure of the data

    Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. 2-d density estimation

    2-d density images left and right are CC0 public domain

    1-d density estimation

    https://commons.wikimedia.org/wiki/File:Bivariate_example.png https://www.flickr.com/photos/omegatron/8533520357 https://creativecommons.org/publicdomain/zero/1.0/deed.en

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Today: Reinforcement Learning

    7

    Problems involving an agent interacting with an environment, which provides numeric reward signals

    Goal: Learn how to take actions in order to maximize reward

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20178

    Overview

    - What is Reinforcement Learning? - Markov Decision Processes - Q-Learning - Policy Gradients

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 20179

    Agent

    Environment

    Reinforcement Learning

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201710

    Agent

    Environment

    State st

    Reinforcement Learning

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201711

    Agent

    Environment

    Action at State st

    Reinforcement Learning

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201712

    Agent

    Environment

    Action at State st Reward rt

    Reinforcement Learning

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201713

    Agent

    Environment

    Action at State st Reward rt

    Next state st+1

    Reinforcement Learning

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Cart-Pole Problem

    14

    Objective: Balance a pole on top of a movable cart

    State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright

    This image is CC0 public domain

    https://commons.wikimedia.org/wiki/File:Cart-pendulum.svg https://creativecommons.org/publicdomain/zero/1.0/deed.en https://commons.wikimedia.org/wiki/File:Cart-pendulum.svg

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Robot Locomotion

    15

    Objective: Make the robot move forward

    State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Atari Games

    16

    Objective: Complete the game with the highest score

    State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Go

    17

    Objective: Win the game!

    State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise

    This image is CC0 public domain

    https://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.png https://creativecommons.org/publicdomain/zero/1.0/deed.en https://upload.wikimedia.org/wikipedia/commons/a/ab/Go_game_Kobayashi-Kato.png

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 201718

    Agent

    Environment

    Action at State st Reward rt

    Next state st+1

    How can we mathematically formalize the RL problem?

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Markov Decision Process

    19

    - Mathematical formulation of the RL problem - Markov property: Current state completely characterises the state of the

    world

    Defined by:

    : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Markov Decision Process - At time step t=0, environment samples initial state s0 ~ p(s0) - Then, for t=0 until done:

    - Agent selects action at - Environment samples reward rt ~ R( . | st, at) - Environment samples next state st+1 ~ P( . | st, at) - Agent receives reward rt and next state st+1

    - A policy is a function from S to A that specifies what action to take in each state

    - Objective: find policy * that maximizes cumulative discounted reward:

    20

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    A simple MDP: Grid World

    21

    Objective: reach one of terminal states (greyed out) in least number of actions

    actions = {

    1. right

    2. left

    3. up

    4. down

    }

    Set a negative “reward” for each transition

    (e.g. r = -1)

    states

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    A simple MDP: Grid World

    22

    Random Policy Optimal Policy

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    The optimal policy *

    23

    We want to find optimal policy * that maximizes the sum of rewards.

    How do we handle the randomness (initial state, transition probability…)?

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    The optimal policy *

    24

    We want to find optimal policy * that maximizes the sum of rewards.

    How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards!

    Formally: with

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Definitions: Value function and Q-value function

    25

    Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Definitions: Value function and Q-value function

    26

    Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

    How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Definitions: Value function and Q-value function

    27

    Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

    How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

    How good is a state-action pair? The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Bellman equation

    28

    The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Bellman equation

    29

    Q* satisfies the following Bellman equation:

    Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of

    The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

  • Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017

    Bellman equation

    30

    Q* satisfies the following Bellman equation:

    Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that