Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所...
-
Upload
wilfrid-hill -
Category
Documents
-
view
279 -
download
0
Transcript of Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所...
![Page 1: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/1.jpg)
Reinforcement LearningElementary Solution Methods
主講人:虞台文
大同大學資工所智慧型多媒體研究室
![Page 2: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/2.jpg)
ContentIntroductionDynamic ProgrammingMonte Carlo MethodsTemporal Difference Learning
![Page 3: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/3.jpg)
Reinforcement LearningElementary Solution Methods
Introduction
大同大學資工所智慧型多媒體研究室
![Page 4: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/4.jpg)
Basic Methods Dynamic programming
– well developed but require a complete and accurate model of the environment
Monte Carlo methods– don't require a model and are very simple
conceptually, but are not suited for step-by-step incremental computation
Temporal-difference learning– temporal-difference methods require no model and
are fully incremental, but are more complex to analyze
Q-Learning
![Page 5: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/5.jpg)
Reinforcement LearningElementary Solution Methods
Dynamic Programming
大同大學資工所智慧型多媒體研究室
![Page 6: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/6.jpg)
Dynamic Programming
A collection of algorithms that can be used to compute optimal policies given a perfect model of the environment.– e.g., a Markov decision process (MDP).
Theoretically important– An essential foundation for the understanding of
other methods.– Other methods attempt to achieve much the same
effect as DP, only with less computation and without assuming a perfect model of the environment.
![Page 7: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/7.jpg)
Finite MDP Environments
An MDP consists of: – A set of finite states S or S+, – A set of finite actions A, – A transition distribution
– Expected immediate rewards1( | , )t
ass t tP s s s s a a P ,s s +S
aA
1 1[ | , , ]t ts tas tE r s s a a s s R
![Page 8: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/8.jpg)
Review
21 2 3 1
0
kt t t t t k
k
R r r r r
State-Value function for policy :
10
( ) [ | ] kt t k
kt tV s R rs s sE sE
Bellman equation for V:
,( ) ( )( )a s
as
asssV s V ss a
RP
Bellman Optimality Equation:* *
( )( ) max ( )a
ssa s
s
assV s V s
ARP
![Page 9: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/9.jpg)
Methods of Dynamic Programming
Policy Evaluation Policy Improvement Policy Iteration Value IterationAsynchronous DP
![Page 10: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/10.jpg)
Policy Evaluation
Bellman equation for V:
,( ) ( )( )a s
as
asssV s V ss a
RP
A system of |S| linear equations.
Given policy , compute the state-value function.
It can be solved straightforward, but may be tedious.
We’ll use iterative method.
![Page 11: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/11.jpg)
Iterative Policy Evaluation
0 1 1k kV V V V V
a “sweep”
A sweep consists of applying a backup operation to each state.
,( ) ( )( )a s
as
asssV s V ss a
RP
1( ( )( ,) ) as
assk k
a ss ss aV s V
P R
full backup:
![Page 12: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/12.jpg)
The Algorithm Policy Evaluation
Input the policy to be evaluatedInitialize V(s) = 0 for all sS+
Repeat 0For each sS
v V(s)
max(, |v V(s)|)
Until < (a small positive number)Output V V
,( ) () )(a s
ass
assV s V ss a
RP
![Page 13: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/13.jpg)
Example (Grid World) Possible actions from any state s: A = {up, down, left, right} Terminal state in top-left & bottom right (same state) Reward is 1 on all transitions until terminal state is reached All values initialized to 0 Out of bounds results in staying in same state
![Page 14: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/14.jpg)
Example (Grid World)
We start with an equiprobable random policy, finally we obtain the optimal policy.
![Page 15: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/15.jpg)
Policy Improvement
Consider V for a deterministic policy .
In what condition, would it be better to do an action a (s) when we are in state s?
11 ( , ) ,( )
( )
t t t
a
t
ass ss
s
Q s a E r s
V s
s as aV
P R
( , ) ( )Q s a V s ?
The action-value of doing a in state s is:
Is it better to switch to action a if
![Page 16: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/16.jpg)
Policy Improvement
( , ) ( )Q s a V s
Let ’ be a policy the same as except in state s.
Suppose that ’(s) = a and
( ) ( )V s V s Given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action.
![Page 17: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/17.jpg)
Greedy Policy ’
Selecting at each state the action that appears best according to Q(s, a).
arg max ( , )
arg max
( )
( )
a
a ass ss
as
Q s as
V s
P R
( ) ( )V s V s
![Page 18: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/18.jpg)
Greedy Policy ’arg max( ) ( )a a
ss ssa
s
s V s
P R
V V What if ?
m )a( x) (a ass ss
as
V s V s
P R
m( ) (ax )a ass ss
as
V s V s
P R
Bellman Optimality Equation:
* *
( )( ) (max )a a
ss ssa s
s
V s V s
A
P R
What can you say about this?
![Page 19: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/19.jpg)
Policy Iteration
0 1 * *1
*0 V V V
policy evaluation policy improvement“greedification”
![Page 20: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/20.jpg)
Policy Evaluation
Policy Improvement
Policy Iteration
![Page 21: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/21.jpg)
Policy Iteration
Policy Evaluation
Policy Improvement
Optimal Policy
![Page 22: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/22.jpg)
Policy Iteration
Policy Evaluation
Policy Improvement
Optimal Policy
arg max( ) ( )a ass ss
as
s V s
P R
1( ) ( ), ) (a ass sk ks
a s
V V ss as
P R
![Page 23: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/23.jpg)
Value Iteration
Policy Evaluation
Policy Improvement
Optimal Policy
arg max( ) ( )a ass ss
as
s V s
P R
1( ) ( ), ) (a ass sk ks
a s
V V ss as
P R
Combine these two.
Combine these two.
![Page 24: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/24.jpg)
Policy Evaluation
Policy Improvementarg max( ) ( )a a
ss ssa
s
s V s
P R
1( ) ( ), ) (a ass sk ks
a s
V V ss as
P R
Combine these two.
Combine these two.
Value Iteration
Optimal Policy
1 arg max( ) ( )ak s s
as
ka
s sV s V s
P R
![Page 25: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/25.jpg)
Value Iteration
![Page 26: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/26.jpg)
Asynchronous DP
All the DP methods described so far require exhaustive sweeps of the entire state set.
Asynchronous DP does not use sweeps. Instead it works like this:– Repeat until convergence criterion is met:– Pick a state at random and apply the appropriate
backup Still need lots of computation, but does not
get locked into hopelessly long sweeps Can you select states to backup intelligently?
– YES: an agent’s experience can act as a guide.
![Page 27: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/27.jpg)
Generalized Policy Iteration (GPI)
V
Evaluation
V V
Improvement
( )greedy V
Optimal Policy
![Page 28: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/28.jpg)
Efficiency of DP To find an optimal policy is polynomial in the
number of states… BUT, the number of states is often astronomical
– e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).
In practice, classical DP can be applied to problems with a few millions of states.
Asynchronous DP can be applied to larger problems, and appropriate for parallel computation.
It is surprisingly easy to come up with MDPs for which DP methods are not practical.
![Page 29: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/29.jpg)
Reinforcement LearningElementary Solution Methods
Monte Carlo Methods
大同大學資工所智慧型多媒體研究室
![Page 30: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/30.jpg)
What is Monte Carlo methods?
Monte Carlo methods Random Search Method
It does not assume complete knowledge of the environment
Learning from actual experience– sample sequences of states, actions, and rewards from a
ctual or simulated interaction with an environment
![Page 31: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/31.jpg)
Monte Carlo methods vs. Reinforcement Learning
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns.
To ensure that well-defined returns are available, we define Monte Carlo methods only for episodic tasks.
Incremental in an episode-by-episode sense, but not in a step-by-step sense.
![Page 32: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/32.jpg)
Monte Carlo methods for Policy Evaluation V(s)
V
Evaluation
V V
Improvement
( )greedy V
Optimal Policy
Monte Carlo methods
![Page 33: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/33.jpg)
Monte Carlo methods for Policy Evaluation V(s)
Goal: learn V(s) Given: some number of episodes under which contain s Idea: Average returns observed after visits to s
s s s An episode:
A visit to s
The first visit to s
Return(s) Return(s) Return(s)
![Page 34: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/34.jpg)
Monte Carlo methods for Policy Evaluation V(s)
Every-Visit MC: – average returns for every time s is visited in
an episode
First-visit MC: – average returns only for first time s is
visited in an episode
Both converge asymptotically
![Page 35: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/35.jpg)
First-Visit MC Algorithm
Initialize Policy to be evaluated– V An arbitrary state value function– Returns(s) An empty list for all s S
Repeat forever Generate episode using the policy For each state, s, occurring in the episode
Get the return, R, following the first occurrence of s
Append R to Returns(s) Set V(s) with the average of Returns(s)
![Page 36: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/36.jpg)
Example: Blackjack
Object: – Have your card sum be greater than the dealers
without exceeding 21. States (200 of them):
– current sum (12-21)– dealer’s showing card (ace-10)– do I have a useable ace?
Reward: – +1 for winning, 0 for a draw, 1 for losing
Actions: – stick (stop receiving cards), hit (receive another card)
Policy: – Stick if my sum is 20 or 21, else hit
![Page 37: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/37.jpg)
Example: Blackjack
![Page 38: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/38.jpg)
Monte Carlo Estimation forAction Values Q(s,a)
If a model is not available, then it is particularly useful to estimate action values rather than state values.
By action values, we mean the expected return when starting in state s, taking action a, and thereafter following policy .
The every-visit MC method estimates the value of a state-action pair as the average of the returns that have followed visits to the state in which the action was selected.
The first-visit MC method is similar, but only records the first-visit (like before).
![Page 39: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/39.jpg)
Maintaining Exploration
Many relevant state-action pairs may never be visited.
exploring starts– The first step of each episode starts at a state-action
pair– Every such pair has a nonzero probability of being
selected as the start.
But not a great idea to do in practice. – It's better to just choose a policy which has a
nonzero probability of selecting all actions.
![Page 40: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/40.jpg)
Monte Carlo Control to Approximate Optimal Policy
Q
Evaluation
Q Q
Improvement
( )greedy Q
Optimal Policy
![Page 41: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/41.jpg)
Monte Carlo Control to Approximate Optimal Policy
0 10 1 2 * *E I E I E I EQ Q Q
: Complete Policy Evaluation
: Policy Improvement
E
I
1( ) arg max ( , )kk
as Q s a
![Page 42: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/42.jpg)
Monte Carlo Control to Approximate Optimal Policy
1( ) arg max ( , )kk
as Q s a
1 arg max ( , )( ) ,k k k
as Q sV Qs a
max ,k
aQ s a , ( )k
kQ s s
kV s
1k kV V What if ? 1 *k kV V V A ns.
![Page 43: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/43.jpg)
Monte Carlo Control to Approximate Optimal Policy
1k kV V What if ? 1 *k kV V V A ns.
This, however, requires that– Exploration starts with each state-action
pair having nonzero probability to be selected as the start.
– Infinite number of episodes.
![Page 44: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/44.jpg)
A Monte Carlo Control Algorithm Assuming Exploring Starts
Initialize– Q(s, a) arbitrary (s) arbitrary– Returns(s, a) empty list
Repeat forever Generate an episode using For each pair (s, a) appearing in the episode
R return following the first occurrence of (s, a) Append R to Returns(s, a) Q(s, a) average of Returns(s, a)
For each s in the episode (s) arg maxa Q(s, a)
![Page 45: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/45.jpg)
Example: Blackjack
Exploring starts Initial policy as described before
![Page 46: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/46.jpg)
On-Policy Monte Carlo Control
On-Policy– Learning from the current executing policy
What if we don't have exploring starts?We must adopt some method of
exploring states which would not have been explored otherwise.
We will introduce the –greedy method.
![Page 47: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/47.jpg)
-Soft and -Greedy
-soft policy: ( , ) , and ( )( )
s a s a ss
S AA
-greedy policy:
non-gready action( )
( , )
1 gready action( )
ss a
s
A
A
![Page 48: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/48.jpg)
-Greedy Algorithm Initialize for all states, s, and actions, a.
– Q(s, a) arbitrary. – Returns(s, a) empty list. an arbitrary -soft policy
Repeat Forever:– Generate an episode using .– For each (s, a) appearing in the episode.
R return following the first occurrence of (s, a)Append R to Returns(s, a) Q(s, a) average of Returns(s, a)
– For each state, s, in the episode: * arg max ( , )aa Q s a
For all a A(s)*
( )( , )
1 *( )
a as
s a
a as
A
A
![Page 49: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/49.jpg)
Evaluating One Policy While Following Another
( ) ?V s Goal:
Episodes:Generated using ’
How to evaluate V(s) using the episodes generated by ’?
Assumption: ( , ) 0 ( , ) 0s a s a
![Page 50: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/50.jpg)
Evaluating One Policy While Following Another
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
( , ) 0 ( , ) 0s a s a
![Page 51: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/51.jpg)
Evaluating One Policy While Following Another
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
( , ) 0 ( , ) 0s a s a
1
( ) ( )sm
i ii
V s p s E R
1
( )( )
( )
smi
i ii i
p sp s E R
p s
1
1
( )
( )
( )
( )
s
s
mi
ii i
mi
i i
p sE R
p s
p s
p s
![Page 52: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/52.jpg)
Evaluating One Policy While Following Another
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
1
1
( )
( )( )
( )
( )
s
s
mi
ii i
mi
i i
p sE R
p sV s
p s
p s
1
1
( )
( )( )
( )
( )
s
s
mi
ii i
mi
i i
p sE R
p sV s
p s
p s
( )ip s [ ( )]iE R s
. . . . . . . . . .
s
Suppose ns samples are taken using ’
Suppose ns samples are taken using ’
1
1
( )( )
( )( )
( )
( )
s
s
ni
ij j
ni
j i
p sR s
p sV s
p s
p s
1
1
( )( )
( )( )
( )
( )
s
s
ni
ij j
ni
j i
p sR s
p sV s
p s
p s
![Page 53: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/53.jpg)
Evaluating One Policy While Following Another
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
:
:
S
S
ith first visit to state s
( )ip s
( )ip s
( )
(?
)
i
i
p
p
s
s
![Page 54: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/54.jpg)
Evaluating One Policy While Following Another
:
:
S
S
ith first visit to state s
( )ip s
( )ip s
1
( ) 1
( ) ( , )i
k
k k
T sa
i t k k s sk t
p s s a
P
1
( ) 1
( ) ( , )i
k
k k
T sa
i t k k s sk t
p s s a
P
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
![Page 55: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/55.jpg)
Summary
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
How to approximate Q(s, a)?
![Page 56: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/56.jpg)
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP
![Page 57: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/57.jpg)
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
1
1
( )
( )
( )(
( ))
)
( )
(
s
s
i
i
ii
i
i
n
n
i
R sp s
p
V s
s
s
s
sV
p
p
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
)
1
(
( )
1
( )(
,
, )
( )( )
i
iT si t
k
T s
k ki t k t
kk t
p ss
p s
a
s a
To obtain Q(s, a),set (s, a) = 1
To obtain Q(s, a),set (s, a) = 1
![Page 58: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/58.jpg)
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
To obtain Q(s, a),set (s, a) = 1
To obtain Q(s, a),set (s, a) = 1
![Page 59: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/59.jpg)
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
1
( )
(( , ) ( ,
( )
)
))
( )
(
s
s
ii
i
n
in
i
i
i
p s
p
R sp s
p
Q s ss
a Q
s
a
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
1
11
( )
( )
( )
)
( )
(
,
( ,)
i
iT
T si t
k kk t
s
k ki t k t
p s
s
s
as
a
p
To obtain Q(s, a),set (s, a) = 1
To obtain Q(s, a),set (s, a) = 1
( )V s ( )V s
( )ass
s
V s
P ( )ass
s
V s
P
( )
( , ) ( , )ass
s a A s
s a Q s a
P( )
( , ) ( , )ass
s a A s
s a Q s a
P
![Page 60: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/60.jpg)
Evaluating One Policy While Following Another
How to approximate Q(s, a)?
. . . . . . . . . .
s
a( , )s a
. . . . . . . . . .
s
a
( , )s a
. . . . . .s . . . . . .sa
ssP assP1
1
( , ) ( ,
( )( )
( )
)
1
1
s
s
ii
ni
i i
n
Q s a Q s
R sp s
p s
a
1
1
( , ) ( ,
( )( )
( )
)
1
1
s
s
ii
ni
i i
n
Q s a Q s
R sp s
p s
a
( )V s ( )V s
( ) 1
( ) ( , )iT s
i t k kk t
p s s a
( ) 1
( ) ( , )iT s
i t k kk t
p s s a
![Page 61: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/61.jpg)
Off-Policy Monte Carlo Control
Require two policies.– estimation policy (deterministic)E.g., greedy
– behaviour policy (stochastic)E.g., -soft
![Page 62: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/62.jpg)
Off-Policy Monte Carlo Control
Policy Evaluation
Policy Improvement
![Page 63: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/63.jpg)
Incremental Implementation
MC can be implemented incrementally– saves memory
Compute the weighted average of each return
![Page 64: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/64.jpg)
Incremental Implementation1
1
(( )
( )( ) )
( )
(
)( )
i
i
i
n
in
i
i
i
R sp s
p sV s
p s
p
V
s
s
1
1
(( )
( )( ) )
( )
(
)( )
i
i
i
n
in
i
i
i
R sp s
p sV s
p s
p
V
s
s
1
1
( )( )
n
ii
n
i
n
i
i R sV s
w
w
1
1
1 11
0 0
1
( ) ( ) ( )
0
n
n
nn n n
n
n
n
W
W
wV s V R s
w
V s
V W
W
incrementalequivalentnon-incremental
( ) ( ) ( )ttt t tV s V s R V s
![Page 65: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/65.jpg)
Incremental Implementation
1
1
( )( )
n
ii
n
i
n
i
i R sV s
w
w
1
1
1 11
0 0
1
( ) ( ) ( )
0
n
n
nn n n
n
n
n
W
W
wV s V R s
w
V s
V W
W
incrementalequivalentnon-incremental
( ) ( ) ( )ttt t tV s V s R V s
If t is held constant, it is called the constant- MC.
If t is held constant, it is called the constant- MC.
![Page 66: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/66.jpg)
Summary MC has several advantages over DP:
– Can learn directly from interaction with environment– No need for full models– No need to learn about ALL states– Less harm by Markovian violations
MC methods provide an alternate policy evaluation process
One issue to watch for: maintaining sufficient exploration– exploring starts, soft policies
No bootstrapping (as opposed to DP)
![Page 67: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/67.jpg)
Reinforcement LearningElementary Solution Methods
Temporal Difference Learning
大同大學資工所智慧型多媒體研究室
![Page 68: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/68.jpg)
Temporal Difference Learning
Combine the ideas of Monte Carlo and dynamic programming (DP).
Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics.
Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
![Page 69: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/69.jpg)
Monte Carlo Methods
( ) ( ) ( )ttt t tV s V s R V s
T T T TT
T T T T T
ts
T T
T T
TT T
T TT
![Page 70: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/70.jpg)
Dynamic Programming
1 1( ) ( )t t tV s E r V s
1tr 1ts
T
T T T
ts
T
TT
T
TT
T
T
T
![Page 71: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/71.jpg)
Basic Concept of TD(0)
Dynamic Programming
Monte Carlo Methods
1 1( ) ( ) ( ) ( )t t t t tV s V s r V s V s
1 1( ) ( )t t tV s E r V s
( ) ( ) ( )t t t tV s V s R V s
TD(0):True
returnPredicted value
on time t + 1
Temporal Difference
![Page 72: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/72.jpg)
Basic Concept of TD(0)
1 1( ) ( ) ( ) ( )t t t t tV s V s r V s V s
T T T TT
T T T T TTTTTT
T T T T T
1ts 1tr
ts
![Page 73: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/73.jpg)
TD(0) Algorithm
Initialize V (s) arbitrarily for the policy to be evaluated
Repeat (for each episode): – Initialize s – Repeat (for each step of episode):
a action given by for sTake action a; observe reward, r, and next state, s’V(s) V(s) + [r + V(s’) V(s) ] s s’
– until s is a terminal
![Page 74: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/74.jpg)
Example (Driving Home)
StateElapsed Time
(minutes)Predicted
Time to GoPredictedTotal Time
Leaving office
Reach car, Raining
Exit highway
Behind truck
Home street
Arrive home
0 30 30
5 35 40
20 15 35
30 10 40
40 3 43
43 0 43
![Page 75: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/75.jpg)
Example (Driving Home)
StateElapsed Time
(minutes)Predicted
Time to GoPredictedTotal Time
Leaving office
Reach car, Raining
Exit highway
Behind truck
Home street
Arrive home
0 30 30
5 35 40
20 15 35
30 10 40
40 3 43
43 0 43
![Page 76: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/76.jpg)
TD Bootstraps and Samples
Bootstrapping: update involves an estimate– MC does not bootstrap– DP bootstraps– TD bootstraps
Sampling: update does not involve an expected value– MC samples– DP does not sample– TD samples
![Page 77: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/77.jpg)
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
V(s) 1/6 2/6 3/6 4/6 5/6
![Page 78: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/78.jpg)
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
Values learned by TD(0) after various numbers of episodes
![Page 79: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/79.jpg)
Example (Random Walk)
A B C D E
start
0 0 0 0 0 1
Data averaged over100 sequences of episodes
![Page 80: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/80.jpg)
Optimality of TD(0) Batch Updating: train completely on a finite
amount of data, e.g., train repeatedly on 10 episodes until convergence. – Compute updates according to TD(0), but only update
estimates after each complete pass through the data
For any finite Markov prediction task, under batch updating– TD(0) converges for sufficiently small a.– Constant- MC also converges under these conditions,
but to a difference answer!
![Page 81: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/81.jpg)
Example: Random Walk under Batch Updating
After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.
![Page 82: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/82.jpg)
Why is TD better at generalizing in the batch update?
MC susceptible to poor state sampling and weird episodes
TD less affected by weird episodes & sampling because estimates linked to other states that may be better sampled– i.e., estimates smoothed across states.
TD converges to correct value function for max likelihood model of the environment (certainty-equivalence estimate)
![Page 83: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/83.jpg)
Example: You are the predictor
Suppose you observe the following 8 episodes from an MDP:
A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0
A B0
1
0100%
75%
25%
( ) ?V A
( ) ?V B
What for TD(0)?
What for constant- MC?
What by you?
![Page 84: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/84.jpg)
Learning An Action-Value Function
st st+1 st+2st, at st+1, at+1 st+2, at+2
rt+1 rt+2
( , ) ?Q s a
1 1 1( , ) ( , ) ( , ) ( , )t t t t t t t t tQ s a Q s a r Q s a Q s a
After every transition from a nonterminal state st, do:
If st+1 is a terminal, then 1 1( , ) 0t tQ s a
![Page 85: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/85.jpg)
Sarsa: On-Policy TD Control
Initialize Q (s, a) arbitrarily Repeat (for each episode):
– Initialize s – Repeat (for each step of episode):
Choose a from s using policy derived from Q (e.g., -greedy) Take action a; observe reward, r, and next state, s’ Choose a’ from s’ using policy derived from Q (e.g., -gre
edy) Q(s, a) Q(s, a) + [r + Q(s’, a’) Q(s, a)] s s’, a a’
– until s is a terminal
![Page 86: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/86.jpg)
Example (Windy World)
undiscounted, episodic, reward = –1 until goal
Standardmoves
King’smoves
![Page 87: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/87.jpg)
Example (Windy World)
Applying -greedy Sarsa to this task, with = 0.1, = 0.1, and the initial values Q(s, a) = 0 for all s, a.
Applying -greedy Sarsa to this task, with = 0.1, = 0.1, and the initial values Q(s, a) = 0 for all s, a.
![Page 88: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/88.jpg)
Q-Learning: Off-Policy TD Control
1 1( , ) ( , ) max ( , ) ( , )t t t t t t t ta
Q s a Q s a r Q s a Q s a
One-step Q-Learning:
Deterministicpolicy
Stochasticpolicy
![Page 89: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/89.jpg)
Q-Learning: Off-Policy TD Control
Initialize Q (s, a) arbitrarily
Repeat (for each episode): – Initialize s
– Repeat (for each step of episode): Choose a from s using policy derived from Q (e.g., -greedy)
Take action a; observe reward, r, and next state, s’
Choose a’ from s’ using policy derived from Q (e.g., -greedy)
s s’
– until s is a terminal
1 1( , ) ( , ) max ( , ) ( , )t t t t t t t ta
Q s a Q s a r Q s a Q s a
![Page 90: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/90.jpg)
Example (Cliff Walking)
![Page 91: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/91.jpg)
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Explicit representation of policy as well as value function
Minimal computation to select actions
Can learn an explicit stochastic policy
Can put constraints on policies
Appealing as psychological and neural models
![Page 92: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/92.jpg)
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Policy Parameters: ( , )p s apreference
Policy: ( , ) Pr( | )t t ts a a a s s
( , )
( , )
p s a
p s b
b
e
e
TD Error:
1 1( ) ( )t t t tr V s V s
![Page 93: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/93.jpg)
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
Policy Parameters: ( , )p s apreference
Policy: ( , ) Pr( | )t t ts a a a s s
TD Error:
1 1( ) ( )t t t tr V s V s
Update state-value
function using TD(0)
How to update policy
parameters?
( , )
( , )
p s a
p s b
b
e
e
![Page 94: Reinforcement Learning Elementary Solution Methods 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.](https://reader033.fdocuments.net/reader033/viewer/2022061520/5697bf711a28abf838c7dd21/html5/thumbnails/94.jpg)
Actor-Critic Methods
Environment
ValueFunction
Policy
Critic
Actor
Action
state reward
TDerror
TD Error:
1 1( ) ( )t t t tr V s V s
t> =< 0 We are tend to maximize the value.
( , ) ( , )t t t t tp s a p s a
How to update policy
parameters? Method 1:
( , ) ( , ) [1 ( , )]t t tt t t ttp s a a ap s s Method 2: