MachineLearningforRobotics IntelligentSystemsSeries...

156
Machine Learning for Robotics Intelligent Systems Series Lecture 6 Georg Martius Slides adapted from David Silver, Deepmind MPI for Intelligent Systems, Tübingen, Germany May 31, 2017 Georg Martius Machine Learning for Robotics May 31, 2017 1 / 47

Transcript of MachineLearningforRobotics IntelligentSystemsSeries...

Page 1: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Machine Learning for RoboticsIntelligent Systems Series

Lecture 6

Georg MartiusSlides adapted from David Silver, Deepmind

MPI for Intelligent Systems, Tübingen, Germany

May 31, 2017

Georg Martius Machine Learning for Robotics May 31, 2017 1 / 47

Page 2: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Reminder: Branches of Machhine Learning

Georg Martius Machine Learning for Robotics May 31, 2017 2 / 47

Page 3: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Many Types and Areas of Reinforcement Learning

Georg Martius Machine Learning for Robotics May 31, 2017 3 / 47

Page 4: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learningparadigms?

• There is no supervisor, only a reward signal• Feedback is delayed, not instantaneous• Time really matters (sequential, non i.i.d data)• Agent’s actions affect the subsequent data it receives

Georg Martius Machine Learning for Robotics May 31, 2017 4 / 47

Page 5: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learningparadigms?• There is no supervisor, only a reward signal

• Feedback is delayed, not instantaneous• Time really matters (sequential, non i.i.d data)• Agent’s actions affect the subsequent data it receives

Georg Martius Machine Learning for Robotics May 31, 2017 4 / 47

Page 6: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learningparadigms?• There is no supervisor, only a reward signal• Feedback is delayed, not instantaneous

• Time really matters (sequential, non i.i.d data)• Agent’s actions affect the subsequent data it receives

Georg Martius Machine Learning for Robotics May 31, 2017 4 / 47

Page 7: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learningparadigms?• There is no supervisor, only a reward signal• Feedback is delayed, not instantaneous• Time really matters (sequential, non i.i.d data)

• Agent’s actions affect the subsequent data it receives

Georg Martius Machine Learning for Robotics May 31, 2017 4 / 47

Page 8: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learningparadigms?• There is no supervisor, only a reward signal• Feedback is delayed, not instantaneous• Time really matters (sequential, non i.i.d data)• Agent’s actions affect the subsequent data it receives

Georg Martius Machine Learning for Robotics May 31, 2017 4 / 47

Page 9: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter

• Defeat the world champion at Backgammon• Manage an investment portfolio• Control a power station• Make a humanoid robot walk• Play many different Atari games better than humans• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 10: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter• Defeat the world champion at Backgammon

• Manage an investment portfolio• Control a power station• Make a humanoid robot walk• Play many different Atari games better than humans• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 11: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter• Defeat the world champion at Backgammon• Manage an investment portfolio

• Control a power station• Make a humanoid robot walk• Play many different Atari games better than humans• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 12: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter• Defeat the world champion at Backgammon• Manage an investment portfolio• Control a power station

• Make a humanoid robot walk• Play many different Atari games better than humans• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 13: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter• Defeat the world champion at Backgammon• Manage an investment portfolio• Control a power station• Make a humanoid robot walk

• Play many different Atari games better than humans• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 14: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter• Defeat the world champion at Backgammon• Manage an investment portfolio• Control a power station• Make a humanoid robot walk• Play many different Atari games better than humans

• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 15: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Reinforcement Learning

• Fly stunt manoeuvres in a helicopter• Defeat the world champion at Backgammon• Manage an investment portfolio• Control a power station• Make a humanoid robot walk• Play many different Atari games better than humans• Beat the best human player in Go

Georg Martius Machine Learning for Robotics May 31, 2017 5 / 47

Page 16: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples – Helicopter Manoeuvres

https://www.youtube.com/watch?v=0JL04JJjocc

Georg Martius Machine Learning for Robotics May 31, 2017 6 / 47

Page 17: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples – Bipedal Robots

https://www.youtube.com/watch?v=No-JwwPbSLA

Georg Martius Machine Learning for Robotics May 31, 2017 7 / 47

Page 18: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples – Atari Games

https://www.youtube.com/watch?v=Vr5MR5lKOc8

Georg Martius Machine Learning for Robotics May 31, 2017 8 / 47

Page 19: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rewards

• A reward Rt is a scalar feedback signal• Indicates how well agent is doing at step t• The agent’s job is to maximize cumulative reward

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis)All goals can be described by the maximization of expected cumulative reward

Do you agree with this statement?

Georg Martius Machine Learning for Robotics May 31, 2017 9 / 47

Page 20: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rewards

• A reward Rt is a scalar feedback signal• Indicates how well agent is doing at step t• The agent’s job is to maximize cumulative reward

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis)All goals can be described by the maximization of expected cumulative reward

Do you agree with this statement?

Georg Martius Machine Learning for Robotics May 31, 2017 9 / 47

Page 21: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rewards

• A reward Rt is a scalar feedback signal• Indicates how well agent is doing at step t• The agent’s job is to maximize cumulative reward

Reinforcement learning is based on the reward hypothesis

Definition (Reward Hypothesis)All goals can be described by the maximization of expected cumulative reward

Do you agree with this statement?

Georg Martius Machine Learning for Robotics May 31, 2017 9 / 47

Page 22: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Rewards

• Fly stunt manoeuvres in a helicopter+ve reward for following desired trajectory-ve reward for crashing

• Defeat the world champion at Backgammon+/-ve reward for winning/losing a game

• Manage an investment portfolio+ve reward for each $ in bank

• Control a power station+ve reward for producing power-ve reward for exceeding safety thresholds

• Make a humanoid robot walk+ve reward for forward motion-ve reward for falling over

• Play many different Atari games better than humans+/-ve reward for increasing/decreasing score

Georg Martius Machine Learning for Robotics May 31, 2017 10 / 47

Page 23: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Rewards

• Fly stunt manoeuvres in a helicopter+ve reward for following desired trajectory-ve reward for crashing

• Defeat the world champion at Backgammon+/-ve reward for winning/losing a game

• Manage an investment portfolio+ve reward for each $ in bank

• Control a power station+ve reward for producing power-ve reward for exceeding safety thresholds

• Make a humanoid robot walk+ve reward for forward motion-ve reward for falling over

• Play many different Atari games better than humans+/-ve reward for increasing/decreasing score

Georg Martius Machine Learning for Robotics May 31, 2017 10 / 47

Page 24: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Rewards

• Fly stunt manoeuvres in a helicopter+ve reward for following desired trajectory-ve reward for crashing

• Defeat the world champion at Backgammon+/-ve reward for winning/losing a game

• Manage an investment portfolio+ve reward for each $ in bank

• Control a power station+ve reward for producing power-ve reward for exceeding safety thresholds

• Make a humanoid robot walk+ve reward for forward motion-ve reward for falling over

• Play many different Atari games better than humans+/-ve reward for increasing/decreasing score

Georg Martius Machine Learning for Robotics May 31, 2017 10 / 47

Page 25: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Rewards

• Fly stunt manoeuvres in a helicopter+ve reward for following desired trajectory-ve reward for crashing

• Defeat the world champion at Backgammon+/-ve reward for winning/losing a game

• Manage an investment portfolio+ve reward for each $ in bank

• Control a power station+ve reward for producing power-ve reward for exceeding safety thresholds

• Make a humanoid robot walk+ve reward for forward motion-ve reward for falling over

• Play many different Atari games better than humans+/-ve reward for increasing/decreasing score

Georg Martius Machine Learning for Robotics May 31, 2017 10 / 47

Page 26: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Rewards

• Fly stunt manoeuvres in a helicopter+ve reward for following desired trajectory-ve reward for crashing

• Defeat the world champion at Backgammon+/-ve reward for winning/losing a game

• Manage an investment portfolio+ve reward for each $ in bank

• Control a power station+ve reward for producing power-ve reward for exceeding safety thresholds

• Make a humanoid robot walk+ve reward for forward motion-ve reward for falling over

• Play many different Atari games better than humans+/-ve reward for increasing/decreasing score

Georg Martius Machine Learning for Robotics May 31, 2017 10 / 47

Page 27: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples of Rewards

• Fly stunt manoeuvres in a helicopter+ve reward for following desired trajectory-ve reward for crashing

• Defeat the world champion at Backgammon+/-ve reward for winning/losing a game

• Manage an investment portfolio+ve reward for each $ in bank

• Control a power station+ve reward for producing power-ve reward for exceeding safety thresholds

• Make a humanoid robot walk+ve reward for forward motion-ve reward for falling over

• Play many different Atari games better than humans+/-ve reward for increasing/decreasing score

Georg Martius Machine Learning for Robotics May 31, 2017 10 / 47

Page 28: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Sequential Decision Making

• Goal: select actions to maximize total future reward

• Actions may have long term consequences• Reward may be delayed• It may be better to sacrifice immediate reward to gain more long-term

reward• Examples:

A financial investment (may take months to mature)Refueling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many moves fromnow)

Georg Martius Machine Learning for Robotics May 31, 2017 11 / 47

Page 29: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Sequential Decision Making

• Goal: select actions to maximize total future reward• Actions may have long term consequences

• Reward may be delayed• It may be better to sacrifice immediate reward to gain more long-term

reward• Examples:

A financial investment (may take months to mature)Refueling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many moves fromnow)

Georg Martius Machine Learning for Robotics May 31, 2017 11 / 47

Page 30: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Sequential Decision Making

• Goal: select actions to maximize total future reward• Actions may have long term consequences• Reward may be delayed

• It may be better to sacrifice immediate reward to gain more long-termreward

• Examples:

A financial investment (may take months to mature)Refueling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many moves fromnow)

Georg Martius Machine Learning for Robotics May 31, 2017 11 / 47

Page 31: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Sequential Decision Making

• Goal: select actions to maximize total future reward• Actions may have long term consequences• Reward may be delayed• It may be better to sacrifice immediate reward to gain more long-termreward

• Examples:

A financial investment (may take months to mature)Refueling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many moves fromnow)

Georg Martius Machine Learning for Robotics May 31, 2017 11 / 47

Page 32: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Sequential Decision Making

• Goal: select actions to maximize total future reward• Actions may have long term consequences• Reward may be delayed• It may be better to sacrifice immediate reward to gain more long-termreward

• Examples:A financial investment (may take months to mature)Refueling a helicopter (might prevent a crash in several hours)Blocking opponent moves (might help winning chances many moves fromnow)

Georg Martius Machine Learning for Robotics May 31, 2017 11 / 47

Page 33: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent and Environment

Georg Martius Machine Learning for Robotics May 31, 2017 12 / 47

Page 34: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent and Environment

• At each step t the agent:Executes action At

Receives observation Ot

Receives scalar reward Rt

• The environment:Receives action At

Emits observation Ot+1Emits scalar reward Rt+1

• t increments at env. step

Georg Martius Machine Learning for Robotics May 31, 2017 13 / 47

Page 35: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent and Environment

• At each step t the agent:Executes action At

Receives observation Ot

Receives scalar reward Rt

• The environment:Receives action At

Emits observation Ot+1Emits scalar reward Rt+1

• t increments at env. step

Georg Martius Machine Learning for Robotics May 31, 2017 13 / 47

Page 36: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent and Environment

• At each step t the agent:Executes action At

Receives observation Ot

Receives scalar reward Rt

• The environment:Receives action At

Emits observation Ot+1Emits scalar reward Rt+1

• t increments at env. step

Georg Martius Machine Learning for Robotics May 31, 2017 13 / 47

Page 37: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 38: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t

• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 39: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent

• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 40: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 41: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actions

The environment selects observations/rewards• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 42: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 43: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next

• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 44: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

History and State

• The history is the sequence of observations, actions, rewards

Ht = O1, R1, A1, ..., At−1, Ot, Rt

• i.e. all observable variables up to time t• i.e. the sensorimotor stream of a robot or embodied agent• What happens next depends on the history:

The agent selects actionsThe environment selects observations/rewards

• State is the information used to determine what happens next• Formally, state is a function of the history:

St = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 14 / 47

Page 45: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Environment/World State

• The environment state Set is theenvironment’s privaterepresentation

• i.e. whatever data theenvironment uses to pick thenext observation/reward• The environment state is notusually visible to the agent• Even if Set is visible, it maycontain irrelevant information

Georg Martius Machine Learning for Robotics May 31, 2017 15 / 47

Page 46: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Environment/World State

• The environment state Set is theenvironment’s privaterepresentation• i.e. whatever data theenvironment uses to pick thenext observation/reward

• The environment state is notusually visible to the agent• Even if Set is visible, it maycontain irrelevant information

Georg Martius Machine Learning for Robotics May 31, 2017 15 / 47

Page 47: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Environment/World State

• The environment state Set is theenvironment’s privaterepresentation• i.e. whatever data theenvironment uses to pick thenext observation/reward• The environment state is notusually visible to the agent

• Even if Set is visible, it maycontain irrelevant information

Georg Martius Machine Learning for Robotics May 31, 2017 15 / 47

Page 48: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Environment/World State

• The environment state Set is theenvironment’s privaterepresentation• i.e. whatever data theenvironment uses to pick thenext observation/reward• The environment state is notusually visible to the agent• Even if Set is visible, it maycontain irrelevant information

Georg Martius Machine Learning for Robotics May 31, 2017 15 / 47

Page 49: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent State

• The agent state Sat is theagent’s internal representation

• i.e. whatever information theagent uses to pick the nextaction• i.e. it is the information used byreinforcement learningalgorithms• It can be any function of thehistory:

Sat = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 16 / 47

Page 50: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent State

• The agent state Sat is theagent’s internal representation• i.e. whatever information theagent uses to pick the nextaction

• i.e. it is the information used byreinforcement learningalgorithms• It can be any function of thehistory:

Sat = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 16 / 47

Page 51: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent State

• The agent state Sat is theagent’s internal representation• i.e. whatever information theagent uses to pick the nextaction• i.e. it is the information used byreinforcement learningalgorithms

• It can be any function of thehistory:

Sat = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 16 / 47

Page 52: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Agent State

• The agent state Sat is theagent’s internal representation• i.e. whatever information theagent uses to pick the nextaction• i.e. it is the information used byreinforcement learningalgorithms• It can be any function of thehistory:

Sat = f(Ht)

Georg Martius Machine Learning for Robotics May 31, 2017 16 / 47

Page 53: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Information State

An information state (a.k.a. Markov state) contains all useful information fromthe history.

DefinitionA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, ..., St)

• The future is independent of the past given the present

H1:t → St → Ht+1:∞

• Once the state is known, the history may be thrown away• i.e. the state is a sufficient statistic of the future• The environment state Set is Markov• The history Ht is Markov

Georg Martius Machine Learning for Robotics May 31, 2017 17 / 47

Page 54: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Information State

An information state (a.k.a. Markov state) contains all useful information fromthe history.

DefinitionA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, ..., St)

• The future is independent of the past given the present

H1:t → St → Ht+1:∞

• Once the state is known, the history may be thrown away• i.e. the state is a sufficient statistic of the future• The environment state Set is Markov• The history Ht is Markov

Georg Martius Machine Learning for Robotics May 31, 2017 17 / 47

Page 55: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Information State

An information state (a.k.a. Markov state) contains all useful information fromthe history.

DefinitionA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, ..., St)

• The future is independent of the past given the present

H1:t → St → Ht+1:∞

• Once the state is known, the history may be thrown away

• i.e. the state is a sufficient statistic of the future• The environment state Set is Markov• The history Ht is Markov

Georg Martius Machine Learning for Robotics May 31, 2017 17 / 47

Page 56: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Information State

An information state (a.k.a. Markov state) contains all useful information fromthe history.

DefinitionA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, ..., St)

• The future is independent of the past given the present

H1:t → St → Ht+1:∞

• Once the state is known, the history may be thrown away• i.e. the state is a sufficient statistic of the future

• The environment state Set is Markov• The history Ht is Markov

Georg Martius Machine Learning for Robotics May 31, 2017 17 / 47

Page 57: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Information State

An information state (a.k.a. Markov state) contains all useful information fromthe history.

DefinitionA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, ..., St)

• The future is independent of the past given the present

H1:t → St → Ht+1:∞

• Once the state is known, the history may be thrown away• i.e. the state is a sufficient statistic of the future• The environment state Set is Markov

• The history Ht is Markov

Georg Martius Machine Learning for Robotics May 31, 2017 17 / 47

Page 58: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Information State

An information state (a.k.a. Markov state) contains all useful information fromthe history.

DefinitionA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, ..., St)

• The future is independent of the past given the present

H1:t → St → Ht+1:∞

• Once the state is known, the history may be thrown away• i.e. the state is a sufficient statistic of the future• The environment state Set is Markov• The history Ht is Markov

Georg Martius Machine Learning for Robotics May 31, 2017 17 / 47

Page 59: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?• What if agent state = counts for lights, bells and levers?• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 60: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?• What if agent state = counts for lights, bells and levers?• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 61: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?• What if agent state = counts for lights, bells and levers?• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 62: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?• What if agent state = counts for lights, bells and levers?• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 63: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?

• What if agent state = counts for lights, bells and levers?• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 64: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?• What if agent state = counts for lights, bells and levers?

• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 65: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Rat Example

• What if agent state = last 3 items in sequence?• What if agent state = counts for lights, bells and levers?• What if agent state = complete sequence?

Georg Martius Machine Learning for Robotics May 31, 2017 18 / 47

Page 66: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Fully Observeable Environments

Full observability: agent directly observesenvironment state

Ot = Sat = Set

• Agent state = environment state =information state• Formally, this is a Markov decisionprocess (MDP)

Georg Martius Machine Learning for Robotics May 31, 2017 19 / 47

Page 67: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Fully Observeable Environments

Full observability: agent directly observesenvironment state

Ot = Sat = Set

• Agent state = environment state =information state• Formally, this is a Markov decisionprocess (MDP)

Georg Martius Machine Learning for Robotics May 31, 2017 19 / 47

Page 68: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state• Formally this is a partially observable Markov decision process (POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 69: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state

• Formally this is a partially observable Markov decision process (POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 70: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state• Formally this is a partially observable Markov decision process

(POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 71: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state• Formally this is a partially observable Markov decision process (POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 72: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state• Formally this is a partially observable Markov decision process (POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 73: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state• Formally this is a partially observable Markov decision process (POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 74: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Partially Observable Environments

• Partial observability: agent indirectly observes environment:A robot with camera vision isn’t told its absolute locationA trading agent only observes current pricesA poker playing agent only observes public cards

• Now agent state 6= environment state• Formally this is a partially observable Markov decision process (POMDP)• Agent must construct its own state representation Sat , e.g.

Complete history: Sat = Ht

Beliefs of environment state: Sat = (P (Se

t = s1), ..., P (Set = sn))

Recurrent neural network: Sat = σ(Sa

t−1Ws +OtWo)

Georg Martius Machine Learning for Robotics May 31, 2017 20 / 47

Page 75: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Major Components of an RL Agent

An RL agent may include one or more of these components:• Policy: agent’s behaviour function

• Value function: how good is each state and/or action• Model: agent’s representation of the environment

Georg Martius Machine Learning for Robotics May 31, 2017 21 / 47

Page 76: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Major Components of an RL Agent

An RL agent may include one or more of these components:• Policy: agent’s behaviour function• Value function: how good is each state and/or action

• Model: agent’s representation of the environment

Georg Martius Machine Learning for Robotics May 31, 2017 21 / 47

Page 77: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Major Components of an RL Agent

An RL agent may include one or more of these components:• Policy: agent’s behaviour function• Value function: how good is each state and/or action• Model: agent’s representation of the environment

Georg Martius Machine Learning for Robotics May 31, 2017 21 / 47

Page 78: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Policy

• A policy defines the agent’s behaviour

• It is a map from state to action, e.g.• Deterministic policy: a = π(s)• Stochastic policy: π(a|s) = P (At = a|St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 22 / 47

Page 79: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Policy

• A policy defines the agent’s behaviour• It is a map from state to action, e.g.

• Deterministic policy: a = π(s)• Stochastic policy: π(a|s) = P (At = a|St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 22 / 47

Page 80: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Policy

• A policy defines the agent’s behaviour• It is a map from state to action, e.g.• Deterministic policy: a = π(s)

• Stochastic policy: π(a|s) = P (At = a|St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 22 / 47

Page 81: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Policy

• A policy defines the agent’s behaviour• It is a map from state to action, e.g.• Deterministic policy: a = π(s)• Stochastic policy: π(a|s) = P (At = a|St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 22 / 47

Page 82: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Value Function

• Value function is a prediction of future reward

• Used to evaluate the goodness/badness of states• used to select which action to take• e.g. models the expected discounted future reward

vπ(s) = Eπ[Rt+1 + γRt+2 + γRt+3 + · · · | St = s]

Georg Martius Machine Learning for Robotics May 31, 2017 23 / 47

Page 83: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Value Function

• Value function is a prediction of future reward• Used to evaluate the goodness/badness of states

• used to select which action to take• e.g. models the expected discounted future reward

vπ(s) = Eπ[Rt+1 + γRt+2 + γRt+3 + · · · | St = s]

Georg Martius Machine Learning for Robotics May 31, 2017 23 / 47

Page 84: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Value Function

• Value function is a prediction of future reward• Used to evaluate the goodness/badness of states• used to select which action to take

• e.g. models the expected discounted future reward

vπ(s) = Eπ[Rt+1 + γRt+2 + γRt+3 + · · · | St = s]

Georg Martius Machine Learning for Robotics May 31, 2017 23 / 47

Page 85: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Value Function

• Value function is a prediction of future reward• Used to evaluate the goodness/badness of states• used to select which action to take• e.g. models the expected discounted future reward

vπ(s) = Eπ[Rt+1 + γRt+2 + γRt+3 + · · · | St = s]

Georg Martius Machine Learning for Robotics May 31, 2017 23 / 47

Page 86: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Model

• A model predicts what the environment will do next

• P predicts the next state• R predicts the next (immediate) reward, e.g.

Pass′ = P (St+1 = s′ | St = s,At = a)Ras = E[Rt+1 | St = s,At = a]

• Can be used for planning without actually performing actions

Georg Martius Machine Learning for Robotics May 31, 2017 24 / 47

Page 87: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Model

• A model predicts what the environment will do next• P predicts the next state

• R predicts the next (immediate) reward, e.g.

Pass′ = P (St+1 = s′ | St = s,At = a)Ras = E[Rt+1 | St = s,At = a]

• Can be used for planning without actually performing actions

Georg Martius Machine Learning for Robotics May 31, 2017 24 / 47

Page 88: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Model

• A model predicts what the environment will do next• P predicts the next state• R predicts the next (immediate) reward, e.g.

Pass′ = P (St+1 = s′ | St = s,At = a)Ras = E[Rt+1 | St = s,At = a]

• Can be used for planning without actually performing actions

Georg Martius Machine Learning for Robotics May 31, 2017 24 / 47

Page 89: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Model

• A model predicts what the environment will do next• P predicts the next state• R predicts the next (immediate) reward, e.g.

Pass′ = P (St+1 = s′ | St = s,At = a)Ras = E[Rt+1 | St = s,At = a]

• Can be used for planning without actually performing actions

Georg Martius Machine Learning for Robotics May 31, 2017 24 / 47

Page 90: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Maze Example

Environment

• Rewards: −1 per time-step• Actions: N, E, S, W• States: Agent’s location• End at Goal state

Georg Martius Machine Learning for Robotics May 31, 2017 25 / 47

Page 91: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Maze Example: Policy and Value Function

Arrows represent policy π(s)

Numbers represent value vπ(s)

Georg Martius Machine Learning for Robotics May 31, 2017 26 / 47

Page 92: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Maze Example: Policy and Value Function

Arrows represent policy π(s) Numbers represent value vπ(s)

Georg Martius Machine Learning for Robotics May 31, 2017 26 / 47

Page 93: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Maze Example: Model

• Agent may have an internal modelof the environment• Dynamics: how actions changethe state• Rewards: how much reward fromeach state• The model may be imperfect(most likely is)

• Grid layout represents transition model Pass′

• Numbers represent immediate reward Ras from each state s (same for all a)

Georg Martius Machine Learning for Robotics May 31, 2017 27 / 47

Page 94: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Categorization of RL agents

• Value BasedNo Policy (Implicit)Value Function

• Policy BasedPolicyNo Value Function

• Actor Critic• Policy• Value Function

• Model FreePolicy and/or Value FunctionNo Model

• Model BasedPolicy and/or Value FunctionModel

Georg Martius Machine Learning for Robotics May 31, 2017 28 / 47

Page 95: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

RL Agent Taxonomy

Georg Martius Machine Learning for Robotics May 31, 2017 29 / 47

Page 96: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:

A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 97: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknown

The agent interacts with the environmentThe agent improves its policy

• Planning:

A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 98: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environment

The agent improves its policy• Planning:

A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 99: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:

A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 100: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:

A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 101: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:A model of the environment is known

The agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 102: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)

The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 103: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policy

a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 104: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Learning and Planning

Two fundamental problems in sequential decision making• Reinforcement Learning:

The environment is initially unknownThe agent interacts with the environmentThe agent improves its policy

• Planning:A model of the environment is knownThe agent performs computations with its model (without any externalinteraction)The agent improves its policya.k.a. deliberation, reasoning, introspection, pondering, thought, search

Georg Martius Machine Learning for Robotics May 31, 2017 30 / 47

Page 105: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Reinforcement Learning

• Rules of the game are unknown

• Learn directly from interactivegame-play

• Pick actions on joystick, seepixels and scores

Georg Martius Machine Learning for Robotics May 31, 2017 31 / 47

Page 106: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Reinforcement Learning

• Rules of the game are unknown• Learn directly from interactivegame-play

• Pick actions on joystick, seepixels and scores

Georg Martius Machine Learning for Robotics May 31, 2017 31 / 47

Page 107: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Reinforcement Learning

• Rules of the game are unknown• Learn directly from interactivegame-play

• Pick actions on joystick, seepixels and scores

Georg Martius Machine Learning for Robotics May 31, 2017 31 / 47

Page 108: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Planning

• Rules of the game are known

• Can query emulatorperfect model inside agent’s brain

• If I take action a from state s:

what would the next state be?what would the score be?

• Plan ahead to find optimal policye.g. tree search

Georg Martius Machine Learning for Robotics May 31, 2017 32 / 47

Page 109: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Planning

• Rules of the game are known• Can query emulatorperfect model inside agent’s brain

• If I take action a from state s:

what would the next state be?what would the score be?

• Plan ahead to find optimal policye.g. tree search

Georg Martius Machine Learning for Robotics May 31, 2017 32 / 47

Page 110: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Planning

• Rules of the game are known• Can query emulatorperfect model inside agent’s brain

• If I take action a from state s:

what would the next state be?what would the score be?

• Plan ahead to find optimal policye.g. tree search

Georg Martius Machine Learning for Robotics May 31, 2017 32 / 47

Page 111: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Planning

• Rules of the game are known• Can query emulatorperfect model inside agent’s brain

• If I take action a from state s:what would the next state be?

what would the score be?• Plan ahead to find optimal policy

e.g. tree search

Georg Martius Machine Learning for Robotics May 31, 2017 32 / 47

Page 112: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Planning

• Rules of the game are known• Can query emulatorperfect model inside agent’s brain

• If I take action a from state s:what would the next state be?what would the score be?

• Plan ahead to find optimal policye.g. tree search

Georg Martius Machine Learning for Robotics May 31, 2017 32 / 47

Page 113: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Atari Example: Planning

• Rules of the game are known• Can query emulatorperfect model inside agent’s brain

• If I take action a from state s:what would the next state be?what would the score be?

• Plan ahead to find optimal policye.g. tree search

Georg Martius Machine Learning for Robotics May 31, 2017 32 / 47

Page 114: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning

• The agent should discover a good policy• From its experiences of the environment• Without losing too much reward along the way

• Exploration finds more information about the environment• Exploitation exploits known information to maximize reward• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 115: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning• The agent should discover a good policy

• From its experiences of the environment• Without losing too much reward along the way

• Exploration finds more information about the environment• Exploitation exploits known information to maximize reward• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 116: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning• The agent should discover a good policy• From its experiences of the environment

• Without losing too much reward along the way

• Exploration finds more information about the environment• Exploitation exploits known information to maximize reward• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 117: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning• The agent should discover a good policy• From its experiences of the environment• Without losing too much reward along the way

• Exploration finds more information about the environment• Exploitation exploits known information to maximize reward• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 118: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning• The agent should discover a good policy• From its experiences of the environment• Without losing too much reward along the way

• Exploration finds more information about the environment

• Exploitation exploits known information to maximize reward• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 119: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning• The agent should discover a good policy• From its experiences of the environment• Without losing too much reward along the way

• Exploration finds more information about the environment• Exploitation exploits known information to maximize reward

• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 120: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Exploration and Exploitation

• Reinforcement learning is like trial-and-error learning• The agent should discover a good policy• From its experiences of the environment• Without losing too much reward along the way

• Exploration finds more information about the environment• Exploitation exploits known information to maximize reward• It is usually important to explore as well as exploit

Georg Martius Machine Learning for Robotics May 31, 2017 33 / 47

Page 121: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples

• Restaurant SelectionExploitation Go to your favorite restaurantExploration Try a new restaurant

• Online Banner AdvertisementsExploitation Show the most successful advertExploration Show a different advert

• Game PlayingExploitation Play the move you believe is bestExploration Play an experimental move

• Robot ControlExploitation Do the movement you know works bestExploration Try a different movement

Georg Martius Machine Learning for Robotics May 31, 2017 34 / 47

Page 122: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples

• Restaurant SelectionExploitation Go to your favorite restaurantExploration Try a new restaurant

• Online Banner AdvertisementsExploitation Show the most successful advertExploration Show a different advert

• Game PlayingExploitation Play the move you believe is bestExploration Play an experimental move

• Robot ControlExploitation Do the movement you know works bestExploration Try a different movement

Georg Martius Machine Learning for Robotics May 31, 2017 34 / 47

Page 123: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples

• Restaurant SelectionExploitation Go to your favorite restaurantExploration Try a new restaurant

• Online Banner AdvertisementsExploitation Show the most successful advertExploration Show a different advert

• Game PlayingExploitation Play the move you believe is bestExploration Play an experimental move

• Robot ControlExploitation Do the movement you know works bestExploration Try a different movement

Georg Martius Machine Learning for Robotics May 31, 2017 34 / 47

Page 124: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Examples

• Restaurant SelectionExploitation Go to your favorite restaurantExploration Try a new restaurant

• Online Banner AdvertisementsExploitation Show the most successful advertExploration Show a different advert

• Game PlayingExploitation Play the move you believe is bestExploration Play an experimental move

• Robot ControlExploitation Do the movement you know works bestExploration Try a different movement

Georg Martius Machine Learning for Robotics May 31, 2017 34 / 47

Page 125: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Prediction and Control

• Prediction: evaluate the futureHow do I do given a policy?

• Control: optimize the futureFind the best policy

Georg Martius Machine Learning for Robotics May 31, 2017 35 / 47

Page 126: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Prediction and Control

• Prediction: evaluate the futureHow do I do given a policy?

• Control: optimize the futureFind the best policy

Georg Martius Machine Learning for Robotics May 31, 2017 35 / 47

Page 127: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Gridworld Example: Prediction

• What is the value function for the uniform random policy?

Georg Martius Machine Learning for Robotics May 31, 2017 36 / 47

Page 128: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Gridworld Example: Prediction

• What is the value function for the uniform random policy?

Georg Martius Machine Learning for Robotics May 31, 2017 36 / 47

Page 129: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Gridworld Example: Control

• What is the optimal value function over all possible policies?• What is the optimal policy?

Georg Martius Machine Learning for Robotics May 31, 2017 37 / 47

Page 130: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Gridworld Example: Control

• What is the optimal value function over all possible policies?• What is the optimal policy?

Georg Martius Machine Learning for Robotics May 31, 2017 37 / 47

Page 131: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Decision Processes

Georg Martius Machine Learning for Robotics May 31, 2017 38 / 47

Page 132: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 39 / 47

Page 133: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 39 / 47

Page 134: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics May 31, 2017 39 / 47

Page 135: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student Markov Chain

Georg Martius Machine Learning for Robotics May 31, 2017 40 / 47

Page 136: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics May 31, 2017 41 / 47

Page 137: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics May 31, 2017 41 / 47

Page 138: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics May 31, 2017 41 / 47

Page 139: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics May 31, 2017 41 / 47

Page 140: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student Markov Chain Transition Matrix

Georg Martius Machine Learning for Robotics May 31, 2017 42 / 47

Page 141: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics May 31, 2017 43 / 47

Page 142: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]

• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics May 31, 2017 43 / 47

Page 143: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics May 31, 2017 43 / 47

Page 144: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Example: Student MRP

Georg Martius Machine Learning for Robotics May 31, 2017 44 / 47

Page 145: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] is the present value of future rewards

• The value of receiving reward R after k + 1 time-steps is γkR.• This values immediate reward above delayed reward.

γ close to 0 leads to “myopic” evaluationγ close to 1 leads to “far-sighted” evaluation

Georg Martius Machine Learning for Robotics May 31, 2017 45 / 47

Page 146: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] is the present value of future rewards• The value of receiving reward R after k + 1 time-steps is γkR.

• This values immediate reward above delayed reward.

γ close to 0 leads to “myopic” evaluationγ close to 1 leads to “far-sighted” evaluation

Georg Martius Machine Learning for Robotics May 31, 2017 45 / 47

Page 147: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] is the present value of future rewards• The value of receiving reward R after k + 1 time-steps is γkR.• This values immediate reward above delayed reward.

γ close to 0 leads to “myopic” evaluationγ close to 1 leads to “far-sighted” evaluation

Georg Martius Machine Learning for Robotics May 31, 2017 45 / 47

Page 148: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] is the present value of future rewards• The value of receiving reward R after k + 1 time-steps is γkR.• This values immediate reward above delayed reward.

γ close to 0 leads to “myopic” evaluation

γ close to 1 leads to “far-sighted” evaluation

Georg Martius Machine Learning for Robotics May 31, 2017 45 / 47

Page 149: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] is the present value of future rewards• The value of receiving reward R after k + 1 time-steps is γkR.• This values immediate reward above delayed reward.

γ close to 0 leads to “myopic” evaluationγ close to 1 leads to “far-sighted” evaluation

Georg Martius Machine Learning for Robotics May 31, 2017 45 / 47

Page 150: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Why discount?

Most Markov reward and decision processes are discounted. Why?• Mathematically convenient to discount rewards

• Avoids infinite returns in cyclic Markov processes• Uncertainty about the future may not be fully represented• If the reward is financial, immediate rewards may earn more interest than

delayed rewards• Animal/human behaviour shows preference for immediate reward• It is sometimes possible to use undiscounted Markov reward processes (i.e.γ = 1), e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics May 31, 2017 46 / 47

Page 151: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Why discount?

Most Markov reward and decision processes are discounted. Why?• Mathematically convenient to discount rewards• Avoids infinite returns in cyclic Markov processes

• Uncertainty about the future may not be fully represented• If the reward is financial, immediate rewards may earn more interest than

delayed rewards• Animal/human behaviour shows preference for immediate reward• It is sometimes possible to use undiscounted Markov reward processes (i.e.γ = 1), e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics May 31, 2017 46 / 47

Page 152: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Why discount?

Most Markov reward and decision processes are discounted. Why?• Mathematically convenient to discount rewards• Avoids infinite returns in cyclic Markov processes• Uncertainty about the future may not be fully represented

• If the reward is financial, immediate rewards may earn more interest thandelayed rewards

• Animal/human behaviour shows preference for immediate reward• It is sometimes possible to use undiscounted Markov reward processes (i.e.γ = 1), e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics May 31, 2017 46 / 47

Page 153: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Why discount?

Most Markov reward and decision processes are discounted. Why?• Mathematically convenient to discount rewards• Avoids infinite returns in cyclic Markov processes• Uncertainty about the future may not be fully represented• If the reward is financial, immediate rewards may earn more interest thandelayed rewards

• Animal/human behaviour shows preference for immediate reward• It is sometimes possible to use undiscounted Markov reward processes (i.e.γ = 1), e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics May 31, 2017 46 / 47

Page 154: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Why discount?

Most Markov reward and decision processes are discounted. Why?• Mathematically convenient to discount rewards• Avoids infinite returns in cyclic Markov processes• Uncertainty about the future may not be fully represented• If the reward is financial, immediate rewards may earn more interest thandelayed rewards

• Animal/human behaviour shows preference for immediate reward

• It is sometimes possible to use undiscounted Markov reward processes (i.e.γ = 1), e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics May 31, 2017 46 / 47

Page 155: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Why discount?

Most Markov reward and decision processes are discounted. Why?• Mathematically convenient to discount rewards• Avoids infinite returns in cyclic Markov processes• Uncertainty about the future may not be fully represented• If the reward is financial, immediate rewards may earn more interest thandelayed rewards

• Animal/human behaviour shows preference for immediate reward• It is sometimes possible to use undiscounted Markov reward processes (i.e.γ = 1), e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics May 31, 2017 46 / 47

Page 156: MachineLearningforRobotics IntelligentSystemsSeries Lecture6public.hronopik.de/files/ML4Rob/lecture6.pdf · Lecture6 GeorgMartius SlidesadaptedfromDavidSilver,Deepmind MPI for Intelligent

Next time

• Continue with MDP’s and Bellmann Equation• Dynamic Programming and Q-Learning

Georg Martius Machine Learning for Robotics May 31, 2017 47 / 47