Reinforcement Learning Das Reinforcement Learning-Problem Alexander Schmid.
Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis...
Transcript of Reinforcement Learning 2 - courses.cit.cornell.edu · Reinforcement Learning 2 Pantelis P. Analytis...
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Reinforcement Learning 2
Pantelis P. Analytis
March 24, 2018
1 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
1 Introduction
2 Temporal difference learning
3 Q-learning
4 Applications
5 Midterm revision
2 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Different types of learning
3 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Characteristics of reinforcement learning
Evaluative feedback.
Sequentiality, delayed rewards.
Need for trial and error, to explore as well as to exploit.
Non stationary world.
4 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning
Broadly used to predict future rewards.It appears to be how the brain reward system works.It is learning a prediction from another later, learnedprediction.The TD error is the difference between two predictions,the temporal difference.
5 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning
V (s)← V (s) + α(
The TD target︷ ︸︸ ︷r + γV (s ′) −V (s))
r + γV (s ′) is known as the TD target
6 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning in the brain (Schultz,Dayan, Montague, 1997)
7 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning in the brain
V (s)← V (s) + α(
The TD target︷ ︸︸ ︷r + γV (s ′) −V (s))
r + γV (s ′) is known as the TD target8 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning: example
Predicting the outcome of a game like chess orbackgammon.Long-term predictions by simulation are complex and evensmall errors in one-step predictions might be amplified. 9 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning: example
Predicting the outcome of a game like chess orbackgammon.Long-term predictions by simulation are complex and evensmall errors in one-step predictions might be amplified. 10 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning
11 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Temporal difference learning
12 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Q-learning
13 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Q-learning
Q-learning converges to the optimal even if you are actingsub-optimally.
14 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Model based and model free learning
Many situations involve conflict between a model-freesystem like TD-learning and a model-based system thatplans ahead.
15 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Samuel’s checkers program
Inspired by Shannon’s paper on chess-playing computers.
It achieved good, but not expert level of playing.
Used a learning process that was similar to TD-learning.
16 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Tesauro’s TD-Grammon
Developed in 1992 by Gerard Tesauro. After playing300.000 games against itself it performed approximately atthe level of human world class players.
17 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Atari breakthrough
Google brained trained an agent that learned 49 Atarigames by receiving as input the pixels of the screen andevaluated the rewards from different positions of thejoystick. It learned half of them at human level.
18 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Alpha Go
Alpha go searched planned much deeper in the game tree.
It uses reinforcement learning to evaluate which pathswhere worthwhile searching.
19 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Attention allocation in online interfaces
20 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Music lab experiment
21 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Learning from others
22 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Clinical vs. actuarial decision making
23 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Exploration-exploitation dilemma
24 / 25
ReinforcementLearning 2
Pantelis P.Analytis
Introduction
Temporaldifferencelearning
Q-learning
Applications
Midtermrevision
Iowa gambling task
25 / 25