Network Utility Maximization over Partially Observable Markov Channels
Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)
description
Transcript of Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)
![Page 1: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/1.jpg)
Reinforcement Learning
Partially ObservableMarkov Decision Processes
(POMDP)
主講人:虞台文
大同大學資工所智慧型多媒體研究室
![Page 2: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/2.jpg)
Content IntroductionValue iteration for MDPBelief States & Infinite-State MDPValue Function of POMDPThe PWLC Property of Value
Function
![Page 3: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/3.jpg)
Reinforcement Learning
Partially ObservableMarkov Decision Processes
(POMDP)
Introduction
大同大學資工所智慧型多媒體研究室
![Page 4: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/4.jpg)
Definition MDP A Markov decision process is a tuple
S a finite set of states of the world A a finite set of actions T: SA (S) state-transition function
R: SA R the reward function
, , ,S A T R
1( , , ) ( | , )t t tT s a s P s s s s a a
![Page 5: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/5.jpg)
Complete Observability
Solution procedures for MDPs give values or policies for each state.
Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability.
Therefore, it is called CO-MDP (completely observable)
![Page 6: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/6.jpg)
Partial Observability Instead of directly measuring the current
state, the agent makes an observation to get a hint about what state it is in.
How to get hint (guess the state)?– To do an action and take an observation.– The observation can be probabilistic, i.e., it
provides hint only.– The ‘state’ will be defined in probability
sense.
![Page 7: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/7.jpg)
Observation Model
: ( )O S A
a finite set of observations the agent can experience of its world.
1 1( , , ) ( | , )t t tO s a o P o o s s a a
The probability of getting observation o given that the agent took action a and landed in state s’.
![Page 8: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/8.jpg)
Definition POMDP
, , , , ,S A T R O
, , ,S A T R describes an MDP.
: ( )O S A is the observation function.
A POMDP is a tuple
How to find optimal policy in such an environment?
![Page 9: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/9.jpg)
Reinforcement Learning
Partially ObservableMarkov Decision Processes
(POMDP)
Value Iteration for MDP
大同大學資工所智慧型多媒體研究室
![Page 10: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/10.jpg)
Acting Optimality
Finite-Horizon Model
Infinite-Horizon Discounted Model
0
maximize k
tt
E r
Maximize the expected
total reward of the next k steps.
0
maximize tt
t
E r
Maximize the expected
discounted total reward.
0 1
Are there any difference on the nature of their optimal policies?
![Page 11: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/11.jpg)
Stationary vs. Non-Stationary Policies
Finite-Horizon Model
Infinite-Horizon Discounted Model
The optimal policy is dependent on the number of time steps remained.
The optimal policy is independent on the number of time steps remained.
Use non-stationary policy
Use stationary policy : S A
:t S A
![Page 12: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/12.jpg)
Stationary vs. Non-Stationary Policies
Finite-Horizon Model
Infinite-Horizon Discounted Model
The optimal policy is dependent on the number of time steps remained.
The optimal policy is independent on the number of time steps remained.
Use non-stationary policy
Use stationary policy : S A
:t S A The remained time steps.
![Page 13: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/13.jpg)
Value Functions
Finite-Horizon Model
Infinite-Horizon Discounted Model
Non-stationary policy
Stationary policy
, , 1( ,( ( )) ( , ( ), )) ( )tt ts S
tR s s T s s sV s V s
( , ( )) ( , ( ) ),) ()(s S
V s V sR s s T s s s
![Page 14: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/14.jpg)
Optimal PoliciesFinite-Horizon Model
Infinite-Horizon Discounted Model
Non-stationary policy
Stationary policy
, , 1( ,( ( )) ( , ( ), )) ( )tt ts S
tR s s T s s sV s V s
( , ( )) ( , ( ) ),) ()(s S
V s V sR s s T s s s
* *1arg max ( , ) ( , ,( ( )))t t
as S
R s a T s as V ss
* *arg max (( ) (, ( , ) )) ,a
s S
R s a T s a ss V s
*1 arg m () )( ax ,
aRs s a *
1 arg m () )( ax ,a
Rs s a
![Page 15: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/15.jpg)
Optimal PoliciesFinite-Horizon Model
Infinite-Horizon Discounted Model
Non-stationary policy
Stationary policy
* *1max ( , ) ( , , )( ) ( )t t
as S
V s V sR s a T s a s
* *max (( , ) ( , , )) ( )s S
V s V sR s a T s a s
* *1arg max ( , ) ( , ,( ( )))t t
as S
R s a T s as V ss
* *arg max (( ) (, ( , ) )) ,a
s S
R s a T s a ss V s
*1 arg m () )( ax ,
aRs s a *
1 arg m () )( ax ,a
Rs s a
![Page 16: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/16.jpg)
Optimal PoliciesFinite-Horizon Model Non-stationary policy
* *1max ( , ) ( , , )( ) ( )t t
as S
V s V sR s a T s a s
* *
1arg max ( , ) ( , ,( ( )))t ta
s S
R s a T s as V ss
*1 arg m () )( ax ,
aRs s a *
1 arg m () )( ax ,a
Rs s a
How about t ?
How about Vt(s) Vt1(s) s?
How about t if Vt(s) Vt1(s) s?
To find an optimal policy, do we need to pay infinite time?
![Page 17: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/17.jpg)
Value IterationThe MDP has finite number of states.
![Page 18: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/18.jpg)
Reinforcement Learning
Partially ObservableMarkov Decision Processes
(POMDP)
Belief States & Infinite-State MDP
大同大學資工所智慧型多媒體研究室
![Page 19: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/19.jpg)
Agent Agent
POMDP Framework
World (MDP)
SESE
observationaction
bbelief state
SE: state estimator
![Page 20: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/20.jpg)
Belief States
1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b
( ) 1s S
b s
There are uncountably infinite number of belief states.
![Page 21: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/21.jpg)
State Space
1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b
( ) 1s S
b s
There are uncountably infinite number of belief states.
0 1
2-state POMDP
1( )b s 0 1
13-state POMDP
![Page 22: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/22.jpg)
State Estimation
1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b
( ) 1s S
b s
State estimation:
Given bt, at and ot+1, bt+1=?
There are uncountably infinite number of belief states.
![Page 23: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/23.jpg)
State Estimation
1( ) ( | , , )t tb s P s o a b
1 2( ), ( )( ),tt tTb s b sb
1 1 21 1( ),( )( ), Ttt tb s b s b
(
( | , , (
|
| ,
)
)
,
)t
t
tP s
P o
P o s
a
a a
b
b
b
( | , , )
( | , )
( | ( | ,, ) )t ts S
t
P s s a P
P
P o s a s
o a
a
b
b
b
( | , )( | , )
( | ,
( )
)ts S
t
P s s
P
a
o a
a bo sP s
b
( , , )( , , )
,
(
( |
)
)ts S
t
O T s a s bs a o
P a
s
o
b Normalization Factor
![Page 24: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/24.jpg)
State Estimation
1( ) ( | , , )t tb s P s o a b
1 2( ), ( )( ),tt tTb s b sb
1 1 21 1( ),( )( ), Ttt tb s b s b
(
( | , , (
|
| ,
)
)
,
)t
t
tP s
P o
P o s
a
a a
b
b
b
( | , , )
( | , )
( | ( | ,, ) )t ts S
t
P s s a P
P
P o s a s
o a
a
b
b
b
( | , )( | , )
( | ,
( )
)ts S
t
P s s
P
a
o a
a bo sP s
b
( , , )( , , )
,
(
( |
)
)ts S
t
O T s a s bs a o
P a
s
o
b Normalization Factor
1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b
,1 ( | , )
a o tt
tP o a T b
bb
,1 ( | , )
a o tt
tP o a T b
bb
Remember these.Remember these.
![Page 25: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/25.jpg)
State Estimation
1( ) ( | , , )t tb s P s o a b
1 2( ), ( )( ),tt tTb s b sb
1 1 21 1( ),( )( ), Ttt tb s b s b
(
( | , , (
|
| ,
)
)
,
)t
t
tP s
P o
P o s
a
a a
b
b
b
( | , , )
( | , )
( | ( | ,, ) )t ts S
t
P s s a P
P
P o s a s
o a
a
b
b
b
( | , )( | , )
( | ,
( )
)ts S
t
P s s
P
a
o a
a bo sP s
b
( , , )( , , )
,
(
( |
)
)ts S
t
O T s a s bs a o
P a
s
o
b Normalization Factor
1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b
( | , ) ( , , ) ( , , ) ( )t ts S s SP o a O s a o T s a s b s
b( | , ) ( , , ) ( , , ) ( )t ts S s S
P o a O s a o T s a s b s
b
It is linearw.r.t bt
![Page 26: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/26.jpg)
State Transition Function1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b
ba
b’
( , , ) ( | , )a P a b b b b
( , , )SE a ob b( , , )SE a ob b
( | , , ) ( | , )o
P a o P o a
b b b
( , , )
( | , )o
SE a o
P o a
b b
b
It is linearw.r.t bt
![Page 27: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/27.jpg)
State Transition Function
( , , ) ( | , )a P a b b b b
( , , )SE a ob b( , , )SE a ob b
( | , , ) ( | , )o
P a o P o a
b b b
( , , )
( | , )o
SE a o
P o a
b b
b
It is linearw.r.t bt
Suppose that ( , , ) ( , , ) i jSE a o SE a o i j b b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
![Page 28: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/28.jpg)
POMDP = Infinite-State MDP
A POMDP is an MDP with tuple B a set of Belief states A the finite set of actions (the same as
the original MDP) : BA (B) state-transition function
: BA R the reward function1( , , ) ( | , )t t ta P a a b b b b b b
What is the reward function?
, , , B A
![Page 29: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/29.jpg)
Reward Function
( , ) ( ) ( , )s S
R sa b s a
b
The reward function ofthe original MDP
Good news: It is Linear.
![Page 30: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/30.jpg)
Reinforcement Learning
Partially ObservableMarkov Decision Processes
(POMDP)
Value Function of POMDP
大同大學資工所智慧型多媒體研究室
![Page 31: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/31.jpg)
Consider a 2-state POMDP:
Value Function over Belief Space
b0 1
V(b) How to obtain the value function in belief space?
Can we use the table-based method?
![Page 32: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/32.jpg)
Finding Optimal Policy
POMDP = Infinite-State MDPThe general method of MDP:
– To determine the value function and, then followed by policy improvement.
Value functions– State value function– Action value function
![Page 33: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/33.jpg)
Review Value Iteration
Based on finite-horizon value function.
It finds on each iteration.*t
What is*1 ?
![Page 34: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/34.jpg)
The and *1 *
1V
( , ) ( ) ( , )s S
R sa b s a
bImmediateReward
*1( ) (( ) , )
s S
a R sbQ as
b
**
1 1( ) arg m (ax )
a
aQV bb
![Page 35: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/35.jpg)
The and *1 *
1V*1
( ) (( ) , )s S
a R sbQ as
b
**
1 1( ) arg m (ax )
a
aQV bb
Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).
b0 1
*1
aQ
a1
a2*
1V
b0 1
a1
a2
![Page 36: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/36.jpg)
Horizon-1 Policy Trees
Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).
*1V
b0 1
a1
a2
a2 a1
P1
*1
![Page 37: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/37.jpg)
Horizon-1 Policy Trees
Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).
*1V
b0 1
It is piecewise linear and convex.(PWLC)
a2 a1
P1
*1
![Page 38: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/38.jpg)
s1
s2
(0,0)(1,0)
(1,0)
The and *1 *
1V*1
( ) (( ) , )s S
a R sbQ as
b
**
1 1( ) arg m (ax )
a
aQV bb
How about 3-state POMDP and more?
It is PWLC.
What is the policy?
![Page 39: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/39.jpg)
The and *1 *
1V*1
( ) (( ) , )s S
a R sbQ as
b
**
1 1( ) arg m (ax )
a
aQV bb
How about 3-state POMDP and more?
What is the policy?
![Page 40: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/40.jpg)
The PWLC A Piecewise Linear function consists of linear,
or hyperplane segments
– Linear function:
– kth linear segment:
– the -vector:
– each segment could be represented as
0 0 11i Ni
i Nx x x x
0
N
ii
ki x
0 1[ , ,..., ]k k k kN α
( )k tα
![Page 41: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/41.jpg)
The PWLC A Piecewise Linear function consists of linear,
or hyperplane segments
– Linear function:
– kth linear segment:
– the -vector:
– each segment could be represented as
0 0 11i Ni
i Nx x x x
0
N
ii
ki x
0 1[ , ,..., ]k k k kN α
( )k tα
( ) max is PWLC.Tk
kf x α x( ) max is PWLC.T
kk
f x α x
![Page 42: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/42.jpg)
The and *t *
tV
**
1( ) ( , ) ( , , ) ( )att
Q a a V
b
b b b b b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
Immediatereward
*1 ( ,( , ) ( | , ) ( )),t
o
a P o a oa SEV
bb b
Value of observation o for doing action a
on the current stat b. Prob. of observation o
for doing action aon the current stat b.
*, ,11 ( )a o
tV b
![Page 43: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/43.jpg)
The and *t *
tV
**
1( ) ( , ) ( , , ) ( )att
Q a a V
b
b b b b b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
Immediatereward
*1 ( ,( , ) ( | , ) ( )),t
o
a P o a oa SEV
bb b
Value of observation o for doing action a
on the current stat b. Prob. of observation o
for doing action aon the current stat b.
*, ,11 ( )a o
tV b
PWLC
PWLC?
Yes, it is.But, I will defer the proof.Yes, it is.But, I will defer the proof.
![Page 44: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/44.jpg)
The and *2 *
2V
**
12( ) ( , ) ( , , ) ( )aQ a a V
b
b bb b b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
( | , ) ( , , )( , , )
0
P o a SE a oa
otherwise
b b bb b
*, ,1( , ) ( )a o
o
a V
b b
*1( , ) ( ( ,| , , ))) (
o
a P o a V SE a o
b b b
** *, ,
2 12( ) arg max ( ) arg max ( , ) ( )a a o
a ao
V Q a V
b b b b
![Page 45: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/45.jpg)
The and *2 *
2V
**
12( ) ( , ) ( | , ) ( )a
o
Q a P o a V
b bb b
0 1
a1
*1V
0 1
a1
a2
Compute 1*2
aQ
1( , )a bb
b’o1
o3 o2
( , , )SE a o bb ( , , )SE a o bb
![Page 46: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/46.jpg)
( , , )SE a o bb ( , , )SE a o bb
**
12( ) ( , ) ( | , ) ( )a
o
Q a P o a V
b bb b
The and *2 *
2V
0 1
a1
*1V
0 1
a1
a2
Compute 1*2
aQ
1( , )a bb
b’o1
o3 o2
What action will you take if the observation is oi after a1 is taken? What action will you take if the observation is oi after a1 is taken?
![Page 47: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/47.jpg)
The and *2 *
2V
**
12( ) ( , ) ( | , ) ( )a
o
Q a P o a V
b bb b
Consider individual observation (o) after action (a) is taken.
*, , *1 1( ) ( | , ) ( )a oV P o a V b b b
*1( | , ,( )() , )SE a oP o a V bb
Define
**, ,
12( ) ( , ) ( )a a o
o
Q a V
bb b
![Page 48: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/48.jpg)
The and *2 *
2V
0 1
a1( , )a b
0 1
a1a2*, ,
1 ( )a oV b
Transformed value function
0 1
a1a2 *
1 ( )V b
**, ,
12( ) ( , ) ( )a a o
o
Q a V
bb b
![Page 49: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/49.jpg)
The and *2 *
2V*
*, ,12
( ) ( , ) ( )a a o
o
Q a V
bb b
0 1
a1
( , )a b
1 1,*,1 ( )a oV b
0 1
1 3,*,1 ( )a oV b
0 1
1 2,*,1 ( )a oV b
0 1
![Page 50: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/50.jpg)
The and *2 *
2V*
*, ,12
( ) ( , ) ( )a a o
o
Q a V
bb b
0 1
a1
( , )a b
1 1,*,1 ( )a oV b
0 1
1 3,*,1 ( )a oV b
0 1
1 2,*,1 ( )a oV b
0 1
1*2
aQ
o1o2o3
![Page 51: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/51.jpg)
Horizon-2 Tree for Action 1
12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a
1*2
aQ
o1o2o3
a1
o1 o2o3
P1 P1 P1
*1
![Page 52: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/52.jpg)
Horizon-2 Tree for Action 1
12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a
1*2
aQ
o1o2o3
12 2( , , )aa a 1 1 2( , ),a a a
2*2
aQ
o1o2o3
![Page 53: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/53.jpg)
The and *2 *
2V
1*2
aQ2*2
aQ
![Page 54: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/54.jpg)
The and *2 *
2V
1*2
aQ2*2
aQ
a1 a2 a1 a2
*2V
![Page 55: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/55.jpg)
Horizon-2 Policy Tree
a1 a2 a1 a2
*2V
o1 o2 o3
P1 P1 P1
*1
P2
*2
Can you figure out How to determine the value function for horizon 3 from the above discussion?
![Page 56: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/56.jpg)
The and *3
b
1 1*, ,2 ( )a oV b
a11 2*, ,
2 ( )a oV b1 3*, ,
2 ( )a oV b
a2
2 1*, ,2 ( )a oV b
2 2*, ,2 ( )a oV b
2 3*, ,2 ( )a oV b
*3V
1*3( )aQ b
2*3
( )aQ b
*3 ( )V b
a1 a2 a1 a2
*2V
![Page 57: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/57.jpg)
The and *3 *
3V
1o
2o
3o
1o
2o
3o
1*3( )aQ b 2
*3( )aQ b
![Page 58: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/58.jpg)
The and *3 *
3V
1o
2o
3o
1o
2o
3o
1*3( )aQ b 2
*3( )aQ b
*3V *
3V
How for *t *
tVand ?
![Page 59: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/59.jpg)
*3V
Horizon-3 Policy Tree
o1 o2 o3
P1 P1 P1
*1
P2
*2
P3
o1 o2 o3
P2P2
*3
![Page 60: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/60.jpg)
Reinforcement Learning
Partially ObservableMarkov Decision Processes
(POMDP)
The PWLC Property of Value Function
大同大學資工所智慧型多媒體研究室
![Page 61: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/61.jpg)
Value Function for POMDP
**
1( ) ( , ) ( , , ) ( )att
Q a a V
b
b b b b b
* *1( ) max ( , ) ( | , ) ( ,( , ))t t
ao
SE a oV a P o a V
b bb b
*1 ( ,( , ) ( | , ) ( )),t
o
a P o a oa SEV
bb b
*1 ( ) max ( , ) max ( ) ( , )i iia a
V a b s R s a b b
![Page 62: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/62.jpg)
Value Function for POMDP
*1 ( ) max ( , ) max ( ) ( , )i iia a
V a b s R s a b b
Let 1 2( , ), ( , ),T
a R s a R s ar
*1 ( ) max T
aa
V b r b
![Page 63: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/63.jpg)
Value Function for POMDP
Let 1 2( , ), ( , ),T
a R s a R s ar
*1 ( ) max T
aa
V b r b
Let ,1 kk aα r *1 ,1( ) max T
kk
V b α b
![Page 64: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/64.jpg)
Theorem
*( )tV b is PWLC.
![Page 65: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/65.jpg)
Proof*( )tV b is PWLC.
By induction
We already know*
1 ( )V b is true.
Assume*
1( )tV b is also true.
We then show *( )tV b must be true.
![Page 66: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/66.jpg)
Proof
* *1( ) max ( , ) ( | , ) ( ,( , ))t t
ao
SE a oV a P o a V
b bb b
From the assumption, we have
*1 , 1( ) m( , , x) a T
t k tk
SE aV o b α b
b
, 1 ( , , )max Tk t
kSE a o bα
*, 1 (( ) max ( , ) ( | , ) m )a , ,x T
t k ta k
o
V a SEo aP oa
b b α bb
*, 1 (( ) max ( , ) ( | , ) m )a , ,x T
t k ta k
o
V a SEo aP oa
b b α bb
![Page 67: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/67.jpg)
Proof
, 1arg ma ( , ,( , , )x) Tk t
ko aa SE o b bαLet
*, 1 (( ) max ( , ) ( | , ) m )a , ,x T
t k ta k
o
V a SEo aP oa
b b α bb
*, 1 (( ) max ( , ) ( | , ) m )a , ,x T
t k ta k
o
V a SEo aP oa
b b α bb
*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o
Tt t
ao
SV o a oa EP a
b bb b b α
,( ,| , )
,(
) a oSE a oP o a
T b
bb
,( ,| , )
,(
) a oSE a oP o a
T b
bb
, 1( , ,, )max ( , ) Tt a o
aa
ooa
b bα Tb
![Page 68: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/68.jpg)
Proof
, 1arg ma ( , ,( , , )x) Tk t
ko aa SE o b bαLet
*, 1 (( ) max ( , ) ( | , ) m )a , ,x T
t k ta k
o
V a SEo aP oa
b b α bb
*, 1 (( ) max ( , ) ( | , ) m )a , ,x T
t k ta k
o
V a SEo aP oa
b b α bb
*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o
Tt t
ao
SV o a oa EP a
b bb b b α
, 1( , ,, )max ( , ) Tt a o
aa
ooa
bb bα T
, 1, ) ,( ,max T Ta t aa
ao
oo
b T br α
![Page 69: Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)](https://reader035.fdocuments.net/reader035/viewer/2022081420/56813adb550346895da32442/html5/thumbnails/69.jpg)
Proof
( , , ,)*
, 1( ) max T Tt a t
ao
aa ooV
b bb r Tα
Let ( , , ), , 1 ,ii ia o a oT T
k t a to
b Tα r α
*,( ) max T
t k tk
V b α b
*( )tV b is PWLC.