Model-based Bayesian Reinforcement Learning in Partially Observable Domains
description
Transcript of Model-based Bayesian Reinforcement Learning in Partially Observable Domains
![Page 1: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/1.jpg)
Model-based Bayesian Reinforcement Learningin Partially Observable Domains
by
Pascal Poupart and Nikos Vlassis
(2008 International Symposium on Artificial Intelligence and Math)
Presented by Lihan HeECE, Duke University
Oct 3, 2008
![Page 2: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/2.jpg)
Introduction
POMDP represented as dynamic decision network (DDN)
Partially observable reinforcement learning Belief update
Value function and optimal action
Partially observable BEETLE
Offline policy optimization
Online policy execution
Conclusion
Outline
1/14
![Page 3: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/3.jpg)
Introduction
Final objective: learn optimal actions (policy) to achieve best reward
POMDP: partially observable Markov decision process represented by sequential decision-making problem
ROTAS ,,,,,
Reinforcement learning for POMDP: solve the decision-making problem given feedback from environment, when the dynamics of the environment (T and O) are unknown.
given action-observation sequence as history model-based: explicitly model the environment model-free: avoid to explicitly model the environment online learning: policy learning and execution at the same time offline learning: learn policy first given training data, and then
execute policy without modifying the policy
2/14
![Page 4: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/4.jpg)
Introduction
This paper:
Bayesian model-based approach Set the prior for belief as mixture of products of Dirichlets The posterior belief is a mixture of products of Dirichlets The value function is also a mixture of products of Dirichlets The number of the mixture components increases exponentially with
respect to the time step PO-BEETLE algorithm
3/14
![Page 5: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/5.jpg)
POMDP and DDN
Redefine POMDP as dynamic decision network (DDN)
EXXG ,',
},,,,{ RDCBAX
},,,{ RDCBS }{DO SO }{RR SR
X, X’ : two consecutive time steps
Observation and reward are subsets of state variableThe conditional probability distributions of state Pr(s’|pas’) jointly
encode the transition, observation and reward models T, O and R
4/14
![Page 6: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/6.jpg)
POMDP and DDN
The optimal value function satisfies Bellman’s equation
Given X, S, R, O, A, edge E and the dynamics Pr(s’|pas’):
Belief update:
Objective: finding a policy that maximizes the expected total reward
Value iteration algorithms optimize the value function by iteratively computing the right hand side of the Bellman’s equation.
5/14
![Page 7: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/7.jpg)
POMDP and DDN
For reinforcement learning, assume X, S, R, O, A are known, and edges E are known, but the dynamics Pr(s’|pas’) are unknown.
We augment graph: Dynamics are included in the graph, denoted by parameter Θ.
If the unknown model is static,
Belief over s joint belief over s and θ
6/14
![Page 8: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/8.jpg)
PORL: belief update
Problem: number of mixture components increases by a factor of |S| (exponential growth with time)
Prior setting for belief: a mixture of products of Dirichlets
Posterior belief (after taking action a and receiving observation o’) is again a mixture of products of Dirichlets
7/14
![Page 9: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/9.jpg)
PORL: value function and optimal action
The augmented POMDP is hybrid, with discrete state variables S and continuous model variables Θ
Discrete state POMDP: )(max)(* bbV
s
sbsab )()()(with
Continuous state POMDP [1]: dssbsabs )()()(
[1] Porta, J. M.; Vlassis, N. A.; Spaaan, M. T. J.; and Poupart, P. 2006. Point-based value iteration for continuous POMDPs. Journal of Machine Learning Research 7:2329–2367.
The α-function α(s,θ) can also be represented as a mixture of products of Dirichlets
Hybrid state POMDP: s
dsbsab
),(),()(
8/14
![Page 10: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/10.jpg)
PORL: value function and optimal action
Assume for k step-to-go is )(bV k
then for k+1 step-to-go is )(1 bV k
decomposed in 3 steps
find optimal action for belief b
find the corresponding α-function
with
Problem: number of mixture components increases by a factor of |S| (exponential growth with time) 9/14
1)
2)
3)
![Page 11: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/11.jpg)
PO-BEETLE: offline policy optimization
Policy learning is performed offline, given sufficient training data (action-observation sequence)
10/14
![Page 12: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/12.jpg)
PO-BEETLE: offline policy optimization
Keep the number of mixture components for α-functions bounded:Approach 1: approximation using basis functions
Approach 2: approximation by important components
11/14
![Page 13: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/13.jpg)
PO-BEETLE: online policy execution
Given policy, the agent executes the policy and updates belief online.
Keep the number of mixture components for belief b bounded:Approach 1: approximation using importance sampling
12/14
![Page 14: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/14.jpg)
PO-BEETLE: online policy execution
Approach 2: particle filtering: simultaneously update belief and reduce the number of mixture components
Sample one updated component (after taking a and receiving o’)
The updated belief is represented by k particles
13/14
![Page 15: Model-based Bayesian Reinforcement Learning in Partially Observable Domains](https://reader036.fdocuments.net/reader036/viewer/2022062316/56816828550346895dddb9e8/html5/thumbnails/15.jpg)
Conclusion
Bayesian model-based reinforcement learning; Prior belief is a mixture of products of Dirichlets; Posterior belief is also a mixture of products of Dirichlets,
with the number of mixture components growing exponentially with time;
α-functions (associated with value functions) are also represented as mixtures of products of Dirichlets that grow exponentially with time;
Partially observable BEETLE algorithm.
14/14