Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Reinforcement Learning

Partially ObservableMarkov Decision Processes

(POMDP)

主講人：虞台文

大同大學資工所智慧型多媒體研究室

Content IntroductionValue iteration for MDPBelief States & Infinite-State MDPValue Function of POMDPThe PWLC Property of Value

Function



(POMDP)

Introduction


Definition MDP A Markov decision process is a tuple

S a finite set of states of the world A a finite set of actions T: SA (S) state-transition function

R: SA R the reward function

, , ,S A T R

1( , , ) ( | , )t t tT s a s P s s s s a a

Complete Observability

Solution procedures for MDPs give values or policies for each state.

Use of these solutions requires that the agent is able to detect the state it is currently in with complete reliability.

Therefore, it is called CO-MDP (completely observable)

Partial Observability Instead of directly measuring the current

state, the agent makes an observation to get a hint about what state it is in.

How to get hint (guess the state)?– To do an action and take an observation.– The observation can be probabilistic, i.e., it

provides hint only.– The ‘state’ will be defined in probability

sense.

Observation Model

: ( )O S A

a finite set of observations the agent can experience of its world.

1 1( , , ) ( | , )t t tO s a o P o o s s a a

The probability of getting observation o given that the agent took action a and landed in state s’.

Definition POMDP

, , , , ,S A T R O

, , ,S A T R describes an MDP.

: ( )O S A is the observation function.

A POMDP is a tuple

How to find optimal policy in such an environment?



(POMDP)

Value Iteration for MDP


Acting Optimality

Finite-Horizon Model

Infinite-Horizon Discounted Model

0

maximize k

tt

E r

Maximize the expected

total reward of the next k steps.

0

maximize tt

t

E r

Maximize the expected

discounted total reward.

0 1

Are there any difference on the nature of their optimal policies?

Stationary vs. Non-Stationary Policies



The optimal policy is dependent on the number of time steps remained.

The optimal policy is independent on the number of time steps remained.

Use non-stationary policy

Use stationary policy : S A

:t S A

Stationary vs. Non-Stationary Policies



The optimal policy is dependent on the number of time steps remained.

The optimal policy is independent on the number of time steps remained.

Use non-stationary policy

Use stationary policy : S A

:t S A The remained time steps.

Value Functions



Non-stationary policy

Stationary policy

, , 1( ,( ( )) ( , ( ), )) ( )tt ts S

tR s s T s s sV s V s

( , ( )) ( , ( ) ),) ()(s S

V s V sR s s T s s s

Optimal PoliciesFinite-Horizon Model



Stationary policy

, , 1( ,( ( )) ( , ( ), )) ( )tt ts S

tR s s T s s sV s V s

( , ( )) ( , ( ) ),) ()(s S

V s V sR s s T s s s

* *1arg max ( , ) ( , ,( ( )))t t

as S

R s a T s as V ss

* *arg max (( ) (, ( , ) )) ,a

s S

R s a T s a ss V s

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

Optimal PoliciesFinite-Horizon Model



Stationary policy

* *1max ( , ) ( , , )( ) ( )t t

as S

V s V sR s a T s a s

* *max (( , ) ( , , )) ( )s S


* *1arg max ( , ) ( , ,( ( )))t t

as S

R s a T s as V ss

* *arg max (( ) (, ( , ) )) ,a

s S

R s a T s a ss V s

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

Optimal PoliciesFinite-Horizon Model Non-stationary policy

* *1max ( , ) ( , , )( ) ( )t t

as S


* *

1arg max ( , ) ( , ,( ( )))t ta

s S

R s a T s as V ss

*1 arg m () )( ax ,

aRs s a *

1 arg m () )( ax ,a

Rs s a

How about t ?

How about Vt(s) Vt1(s) s?

How about t if Vt(s) Vt1(s) s?

To find an optimal policy, do we need to pay infinite time?

Value IterationThe MDP has finite number of states.



(POMDP)

Belief States & Infinite-State MDP


Agent Agent

POMDP Framework

World (MDP)

SESE

observationaction

bbelief state

SE: state estimator

Belief States

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

b s

There are uncountably infinite number of belief states.

State Space

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

b s


0 1

2-state POMDP

1( )b s 0 1

13-state POMDP

State Estimation

1 2( ) , , ( ), ( ), ( ) 0Ti ib s b s S b ss b

( ) 1s S

b s

State estimation:

Given bt, at and ot+1, bt+1=?


State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

(

( | , , (

|

| ,

)

)

,

)t

t

tP s

P o

P o s

a

a a

b

b

b

( | , , )

( | , )

( | ( | ,, ) )t ts S

t

P s s a P

P

P o s a s

o a

a

b

b

b

( | , )( | , )

( | ,

( )

)ts S

t

P s s

P

a

o a

a bo sP s

b

( , , )( , , )

,

(

( |

)

)ts S

t

O T s a s bs a o

P a

s

o

b Normalization Factor

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

(

( | , , (

|

| ,

)

)

,

)t

t

tP s

P o

P o s

a

a a

b

b

b

( | , , )

( | , )

( | ( | ,, ) )t ts S

t

P s s a P

P

P o s a s

o a

a

b

b

b

( | , )( | , )

( | ,

( )

)ts S

t

P s s

P

a

o a

a bo sP s

b

( , , )( , , )

,

(

( |

)

)ts S

t

O T s a s bs a o

P a

s

o


1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

,1 ( | , )

a o tt

tP o a T b

bb

,1 ( | , )

a o tt

tP o a T b

bb

Remember these.Remember these.

State Estimation

1( ) ( | , , )t tb s P s o a b

1 2( ), ( )( ),tt tTb s b sb

1 1 21 1( ),( )( ), Ttt tb s b s b

(

( | , , (

|

| ,

)

)

,

)t

t

tP s

P o

P o s

a

a a

b

b

b

( | , , )

( | , )

( | ( | ,, ) )t ts S

t

P s s a P

P

P o s a s

o a

a

b

b

b

( | , )( | , )

( | ,

( )

)ts S

t

P s s

P

a

o a

a bo sP s

b

( , , )( , , )

,

(

( |

)

)ts S

t

O T s a s bs a o

P a

s

o


1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

( | , ) ( , , ) ( , , ) ( )t ts S s SP o a O s a o T s a s b s

b( | , ) ( , , ) ( , , ) ( )t ts S s S

P o a O s a o T s a s b s

b

It is linearw.r.t bt

State Transition Function1 ( , , )t tSE a o b b1 ( , , )t tSE a o b b

ba

b’

( , , ) ( | , )a P a b b b b

( , , )SE a ob b( , , )SE a ob b

( | , , ) ( | , )o

P a o P o a

b b b

( , , )

( | , )o

SE a o

P o a

b b

b


State Transition Function

( , , ) ( | , )a P a b b b b

( , , )SE a ob b( , , )SE a ob b

( | , , ) ( | , )o

P a o P o a

b b b

( , , )

( | , )o

SE a o

P o a

b b

b


Suppose that ( , , ) ( , , ) i jSE a o SE a o i j b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

POMDP = Infinite-State MDP

A POMDP is an MDP with tuple B a set of Belief states A the finite set of actions (the same as

the original MDP) : BA (B) state-transition function

: BA R the reward function1( , , ) ( | , )t t ta P a a b b b b b b

What is the reward function?

, , , B A

Reward Function

( , ) ( ) ( , )s S

R sa b s a

b

The reward function ofthe original MDP

Good news: It is Linear.



(POMDP)

Value Function of POMDP


Consider a 2-state POMDP:

Value Function over Belief Space

b0 1

V(b) How to obtain the value function in belief space?

Can we use the table-based method?

Finding Optimal Policy

POMDP = Infinite-State MDPThe general method of MDP:

– To determine the value function and, then followed by policy improvement.

Value functions– State value function– Action value function

Review Value Iteration

Based on finite-horizon value function.

It finds on each iteration.*t

What is*1 ?

The and *1 *

1V

( , ) ( ) ( , )s S

R sa b s a

bImmediateReward

*1( ) (( ) , )

s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

The and *1 *

1V*1

( ) (( ) , )s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

Consider a 2-state POMDP with two actions (a1, a2) and three observations (o1, o2 , o3).

b0 1

*1

aQ

a1

a2*

1V

b0 1

a1

a2

Horizon-1 Policy Trees


*1V

b0 1

a1

a2

a2 a1

P1

*1

Horizon-1 Policy Trees


*1V

b0 1

It is piecewise linear and convex.(PWLC)

a2 a1

P1

*1

s1

s2

(0,0)(1,0)

(1,0)

The and *1 *

1V*1

( ) (( ) , )s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

How about 3-state POMDP and more?

It is PWLC.

What is the policy?

The and *1 *

1V*1

( ) (( ) , )s S

a R sbQ as

b

**

1 1( ) arg m (ax )

a

aQV bb

How about 3-state POMDP and more?

What is the policy?

The PWLC A Piecewise Linear function consists of linear,

or hyperplane segments

– Linear function:

– kth linear segment:

– the -vector:

– each segment could be represented as

0 0 11i Ni

i Nx x x x

0

N

ii

ki x

0 1[ , ,..., ]k k k kN α

( )k tα

The PWLC A Piecewise Linear function consists of linear,

or hyperplane segments

– Linear function:

– kth linear segment:

– the -vector:

– each segment could be represented as

0 0 11i Ni

i Nx x x x

0

N

ii

ki x

0 1[ , ,..., ]k k k kN α

( )k tα

( ) max is PWLC.Tk

kf x α x( ) max is PWLC.T

kk

f x α x

The and *t *

tV

**

1( ) ( , ) ( , , ) ( )att

Q a a V

b

b b b b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

Immediatereward

*1 ( ,( , ) ( | , ) ( )),t

o

a P o a oa SEV

bb b

Value of observation o for doing action a

on the current stat b. Prob. of observation o

for doing action aon the current stat b.

*, ,11 ( )a o

tV b

The and *t *

tV

**

1( ) ( , ) ( , , ) ( )att

Q a a V

b

b b b b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

Immediatereward

*1 ( ,( , ) ( | , ) ( )),t

o

a P o a oa SEV

bb b

Value of observation o for doing action a

on the current stat b. Prob. of observation o

for doing action aon the current stat b.

*, ,11 ( )a o

tV b

PWLC

PWLC?

Yes, it is.But, I will defer the proof.Yes, it is.But, I will defer the proof.

The and *2 *

2V

**

12( ) ( , ) ( , , ) ( )aQ a a V

b

b bb b b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

( | , ) ( , , )( , , )

0

P o a SE a oa

otherwise

b b bb b

*, ,1( , ) ( )a o

o

a V

b b

*1( , ) ( ( ,| , , ))) (

o

a P o a V SE a o

b b b

** *, ,

2 12( ) arg max ( ) arg max ( , ) ( )a a o

a ao

V Q a V

b b b b

The and *2 *

2V

**

12( ) ( , ) ( | , ) ( )a

o

Q a P o a V

b bb b

0 1

a1

*1V

0 1

a1

a2

Compute 1*2

aQ

1( , )a bb

b’o1

o3 o2

( , , )SE a o bb ( , , )SE a o bb

( , , )SE a o bb ( , , )SE a o bb

**

12( ) ( , ) ( | , ) ( )a

o

Q a P o a V

b bb b

The and *2 *

2V

0 1

a1

*1V

0 1

a1

a2

Compute 1*2

aQ

1( , )a bb

b’o1

o3 o2

What action will you take if the observation is oi after a1 is taken? What action will you take if the observation is oi after a1 is taken?

The and *2 *

2V

**

12( ) ( , ) ( | , ) ( )a

o

Q a P o a V

b bb b

Consider individual observation (o) after action (a) is taken.

*, , *1 1( ) ( | , ) ( )a oV P o a V b b b

*1( | , ,( )() , )SE a oP o a V bb

Define

**, ,

12( ) ( , ) ( )a a o

o

Q a V

bb b

The and *2 *

2V

0 1

a1( , )a b

0 1

a1a2*, ,

1 ( )a oV b

Transformed value function

0 1

a1a2 *

1 ( )V b

**, ,

12( ) ( , ) ( )a a o

o

Q a V

bb b

The and *2 *

2V*

*, ,12

( ) ( , ) ( )a a o

o

Q a V

bb b

0 1

a1

( , )a b

1 1,*,1 ( )a oV b

0 1

1 3,*,1 ( )a oV b

0 1

1 2,*,1 ( )a oV b

0 1

The and *2 *

2V*

*, ,12

( ) ( , ) ( )a a o

o

Q a V

bb b

0 1

a1

( , )a b

1 1,*,1 ( )a oV b

0 1

1 3,*,1 ( )a oV b

0 1

1 2,*,1 ( )a oV b

0 1

1*2

aQ

o1o2o3

Horizon-2 Tree for Action 1

12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a

1*2

aQ

o1o2o3

a1

o1 o2o3

P1 P1 P1

*1

Horizon-2 Tree for Action 1

12 2( , , )aa a 1 1 2( , , )a a a 2 21( , ),aa a

1*2

aQ

o1o2o3

12 2( , , )aa a 1 1 2( , ),a a a

2*2

aQ

o1o2o3

The and *2 *

2V

1*2

aQ2*2

aQ

The and *2 *

2V

1*2

aQ2*2

aQ

a1 a2 a1 a2

*2V

Horizon-2 Policy Tree

a1 a2 a1 a2

*2V

o1 o2 o3

P1 P1 P1

*1

P2

*2

Can you figure out How to determine the value function for horizon 3 from the above discussion?

The and *3

b

1 1*, ,2 ( )a oV b

a11 2*, ,

2 ( )a oV b1 3*, ,

2 ( )a oV b

a2

2 1*, ,2 ( )a oV b

2 2*, ,2 ( )a oV b

2 3*, ,2 ( )a oV b

*3V

1*3( )aQ b

2*3

( )aQ b

*3 ( )V b

a1 a2 a1 a2

*2V

The and *3 *

3V

1o

2o

3o

1o

2o

3o

1*3( )aQ b 2

*3( )aQ b

The and *3 *

3V

1o

2o

3o

1o

2o

3o

1*3( )aQ b 2

*3( )aQ b

*3V *

3V

How for *t *

tVand ?

*3V

Horizon-3 Policy Tree

o1 o2 o3

P1 P1 P1

*1

P2

*2

P3

o1 o2 o3

P2P2

*3



(POMDP)

The PWLC Property of Value Function


Value Function for POMDP

**

1( ) ( , ) ( , , ) ( )att

Q a a V

b

b b b b b

* *1( ) max ( , ) ( | , ) ( ,( , ))t t

ao

SE a oV a P o a V

b bb b

*1 ( ,( , ) ( | , ) ( )),t

o

a P o a oa SEV

bb b

*1 ( ) max ( , ) max ( ) ( , )i iia a

V a b s R s a b b


*1 ( ) max ( , ) max ( ) ( , )i iia a

V a b s R s a b b

Let 1 2( , ), ( , ),T

a R s a R s ar

*1 ( ) max T

aa

V b r b


Let 1 2( , ), ( , ),T

a R s a R s ar

*1 ( ) max T

aa

V b r b

Let ,1 kk aα r *1 ,1( ) max T

kk

V b α b

Theorem

*( )tV b is PWLC.

Proof*( )tV b is PWLC.

By induction

We already know*

1 ( )V b is true.

Assume*

1( )tV b is also true.

We then show *( )tV b must be true.

Proof

* *1( ) max ( , ) ( | , ) ( ,( , ))t t

ao

SE a oV a P o a V

b bb b

From the assumption, we have

*1 , 1( ) m( , , x) a T

t k tk

SE aV o b α b

b

, 1 ( , , )max Tk t

kSE a o bα

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

Proof

, 1arg ma ( , ,( , , )x) Tk t

ko aa SE o b bαLet

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o

Tt t

ao

SV o a oa EP a

b bb b b α

,( ,| , )

,(

) a oSE a oP o a

T b

bb

,( ,| , )

,(

) a oSE a oP o a

T b

bb

, 1( , ,, )max ( , ) Tt a o

aa

ooa

b bα Tb

Proof

, 1arg ma ( , ,( , , )x) Tk t

ko aa SE o b bαLet

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1 (( ) max ( , ) ( | , ) m )a , ,x T

t k ta k

o

V a SEo aP oa

b b α bb

*, 1( , , ) ( , , )( ) max ( , ) ( | , ) a o

Tt t

ao

SV o a oa EP a

b bb b b α

, 1( , ,, )max ( , ) Tt a o

aa

ooa

bb bα T

, 1, ) ,( ,max T Ta t aa

ao

oo

b T br α

Proof

( , , ,)*

, 1( ) max T Tt a t

ao

aa ooV

b bb r Tα

Let ( , , ), , 1 ,ii ia o a oT T

k t a to

b Tα r α

*,( ) max T

t k tk

V b α b

*( )tV b is PWLC.

Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)

Documents

Transcript of Reinforcement Learning Partially Observable Markov Decision Processes (POMDP)