Introduction to Active Learning and Active Learning Classrooms :
Active Learning in POMDPs
description
Transcript of Active Learning in POMDPs
![Page 1: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/1.jpg)
Active Learning in POMDPs
Robin JAULMESSupervisors: Doina PRECUP and Joelle PINEAU
McGill University
![Page 2: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/2.jpg)
Outline
1) Partially Observable Markov Decision Processes (POMDPs)
2) Active Learning in POMDPs 3) The MEDUSA algorithm.
![Page 3: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/3.jpg)
Markov Decision Processes(MDPs)
Markov Decision Processes:States SActions AProbabilistic transitions P(s’|s,a) Immediate Rewards R(s,a)A discount factor
The current state is always perfectly observed.
![Page 4: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/4.jpg)
Partially Observable Markov Decision Processes (POMDPs) A POMDP has:
States S Actions A Probabilistic transitions Immediate Rewards A discount factor Observations Z Observation probabilities An initial belief b0
![Page 5: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/5.jpg)
Applications of POMDPs
The ability to render environments in which the state is not fully observed can allow applications in:
Dialogue management Vision Robot navigation High-level control of robots Medical diagnosis Network maintenance
![Page 6: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/6.jpg)
A POMDP example: The Tiger Problem
![Page 7: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/7.jpg)
The Tiger Problem
Description:2 states: Tiger_Left, Tiger_Right3 actions: Listen, Open_Left, Open_Right2 observations: Hear_Left, Hear_RightRewards are:
-1 for the Listen action -100 for the Open_Left in the Tiger_Left state +10 for the Open_Right in the Tiger_Left state
![Page 8: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/8.jpg)
The Tiger Problem
Furthermore:The hear action does not change the stateThe open action puts the tiger behind any
door with 50% chance.The open action leads to A a useless
observation (50% hear_left, 50% hear_right)The hear action gives the correct information
85% of the time.
![Page 9: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/9.jpg)
Solving a POMDP
To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:
![Page 10: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/10.jpg)
The belief state
Instead of maintaining the complete action/observation history, we maintain a belief state b.
The belief is a probability distribution over the states. Dim(b) = |S|-1
![Page 11: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/11.jpg)
The belief spaceHere is a representation of the belief space when we have two states (s0,s1)
![Page 12: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/12.jpg)
The belief spaceHere is a representation of the belief state when we have three states (s0,s1,s2)
![Page 13: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/13.jpg)
The belief spaceHere is a representation of the belief state when we have four states (s0,s1,s2,s3)
![Page 14: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/14.jpg)
The belief space
The belief space is continuous but we only visit a countable number of belief points.
![Page 15: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/15.jpg)
The Bayesian update
![Page 16: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/16.jpg)
Value Function in POMDPs
We will compute the value function over the belief space.Hard: the belief space is continuous !!But we can use a property of the optimal value
function for a finite horizon: it is piecewise-linear and convex.
We can represent any finite-horizon solution by a finite set of alpha-vectors.
V(b) = max_α[Σ_s α(s)b(s)]
![Page 17: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/17.jpg)
Alpha-Vectors
They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hyperplane with the highest value.
![Page 18: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/18.jpg)
Value Iteration in POMDPs
Value iteration: Initialize value function (horizon 1 value)
V(b) = max_a Σ_s R(s,a) b(s)
This produces 1 alpha-vector per action.
Compute the value function at the next iteration using Bellman’s equation:
V(b)= max_a [Σ_s R(s,a)b(s)+Σ_s’[T(s,a,s’)O(s’,a,z)α(s’)]]
![Page 19: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/19.jpg)
PBVI: Point-Based Value Iteration
Always keep a bounded number of alpha vectors.
Use value iteration starting from belief points on a grid to produce new sets of alpha vectors.
Stop after n steps (finite horizon). The solution is approximate but found in a
reasonable amount of time and memory. Good tradeoff between computation time and
qualitySee [Pineau et al., 2003]
![Page 20: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/20.jpg)
Learning a POMDP
What happens if we don’t know for sure the model of the POMDP?We have to learn it.The two solutions in the literature are:
EM-based approaches (prone to local minima) History-based approach (require of the order of
1,000,000 samples for 2 state problems) [Singh et al. 2003]
![Page 21: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/21.jpg)
Active Learning
In an Active Learning Problem the learner has the ability to influence its training data.
The learner asks for what is the most useful given its current knowledge.
Methods to find the most useful query have been shown by Cohn et al. (95)
![Page 22: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/22.jpg)
Active Learning (Cohn et al. 95)
Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner.
They showed how this could be done exactly:For a mixture of Gaussians model.For locally weighted regression.
![Page 23: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/23.jpg)
Applying Active Learning to POMDPs
We will suppose in this work that we have an oracle to determine the hidden state of a system on request.
However, this action is costly and we want to use it as little as possible.
In this setting, the active learning query will be to ask for the hidden state.
![Page 24: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/24.jpg)
Applying Active Learning to POMDPs We propose two solutions:
Integrate the model uncertainty and the query possibility inside the POMDP framework to take advantage of existing algorithms.
The MEDUSA algorithm. It uses a Dirichlet distribution over possible models to determine which actions to take and which queries to ask.
![Page 25: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/25.jpg)
Decision-Theoretic Model Learning
We want to integrate in the POMDP model the fact that: We have only a rough estimation of its parameters.
The agent can query the hidden state.
These queries should not be used too often, and only used to learn.
![Page 26: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/26.jpg)
Decision-Theoretic Model Learning
So we modify our POMDP: For each uncertain parameter we introduce an
additional state feature. This feature is discretized into n levels.
At initialization we are uniformly distributed among these n groups of states but we remain in this group as the transitions occur.
We introduce a query action that returns the hidden state.
This action is attached to a negative reward Rq.
Then we solve this new POMDP using the usual methods.
![Page 27: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/27.jpg)
Decision-Theoretic Model Learning
![Page 28: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/28.jpg)
D-T Planning: Results
![Page 29: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/29.jpg)
DT-Planning: Conclusions
Theoretically sound, but: The results are very sensitive to the value of the
query penalty, which is therefore very difficult to establish.
The number of states becomes exponential in the number of uncertain parameters ! This increases greatly the complexity of the problem.
With MEDUSA, we leave the theoretical guarantees of optimality to get a tractable algorithm.
![Page 30: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/30.jpg)
MEDUSA: The main ideas
Markovian Exploration with Decision based on the Use of Samples Algorithm
Use Dirichlet distributions to represent current knowledge about the parameters of the POMDP model.
Sample models from the distribution. Use models to take actions that could be good. Use queries to improve current knowledge.
![Page 31: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/31.jpg)
Dirichlet distributions
Let X Є [0 ; 1 ;2 ; ... N]. X is drawn from a multinomial distribution of parameters (θ1,... θN) iff p(X=i)= θi
The Dirichlet distribution is a distribution of multinomial distribution parameters (of (θ1,... θN) tuples such that θi > 0 and Σ θi = 1)
![Page 32: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/32.jpg)
Dirichlet distributions
Dirichlet distributions have parameters <α1… αN> s.t. αi>0.
We can sample from Dirichlet distributions by using Gamma distributions.
The most likely parameters in a Dirichlet distribution are the following:
![Page 33: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/33.jpg)
Dirichlet distributions
We can also compute the probability of multinomial distribution parameters according to the Dirichlet.
![Page 34: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/34.jpg)
The MEDUSA algorithm
Step 1: initialize the Dirichlet distribution.
Step 2: sample k(=20) POMDPs from the Dirichlet distribution and compute their probabilities according to the Dirichlet. Normalize them to get the weights.
Step 3: solve the k POMDPs with an approximate method (PBVI, finite horizon)
![Page 35: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/35.jpg)
The MEDUSA algorithm
Step 4: run the experiment…
At each time step:Compute the optimal actions for all POMDPs.Execute one of them.Update the belief for each POMDP. If some conditions are met, do a state query.
Update the Dirichlet parameters according to this query.
![Page 36: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/36.jpg)
The MEDUSA algorithm
At each time step:Recompute the POMDP weightsAt fixed intervals, erase the POMDP with the
lowest weight and redraw another according to the current Dirichlet distribution.
Compute the belief of the new POMDP according to the action-observation history until current time.
![Page 37: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/37.jpg)
Theoretical analysis
We can compute the policy to which MEDUSA converges with an infinite number of models using integrals over the whole space of models.
Under some assumptions over the POMDP, we can prove that MEDUSA converges to the true model.
![Page 38: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/38.jpg)
MEDUSA on Tiger
Evolution of mean discounted reward with time steps (query at every step)
![Page 39: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/39.jpg)
Diminishing the complexity
The algorithm is flexible. We can have a wide variety of priors.
Some parameters may be certain. They can also be dependent (if we use the same alpha parameters for different distributions)
So if we have additional information about the POMDP’s dynamics we can therefore diminish the number of alpha- parameters.
![Page 40: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/40.jpg)
Diminishing the complexity
On the Tiger problem: if we know that:
The “hear” action does not change the state The problem is symmetric Opening a door brings an uninformative
observation and puts back the tiger with a 0.5 probability behind each door.
We can diminish the number of alpha-parameters from 24 to 2.
![Page 41: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/41.jpg)
MEDUSA on simplified Tiger
Evolution of mean discounted reward with time steps (query at every step)
Blue: normal problem Black: simplified problem
![Page 42: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/42.jpg)
Learning without query
The alternate belief ß keeps track of the knowledge brought by the last query.
The non-query update updates each parameter proportionally to the probability a query would have of updating it, given the alternate belief and the last action/observation.
![Page 43: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/43.jpg)
Learning without query
Non-query learning: Has high variance: learning rate needs to be lower,
therefore more time steps are needed. Is prone to local minima. Convergence to the
correct values is not guaranteed. Can converge to the right solution if the initial prior
is “good enough”. MEDUSA should use non-query learning when it has
done “enough” query learning.
![Page 44: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/44.jpg)
Choosing when to query
There are different heuristics to choose when to do a query. Always (up to a certain number of queries). When models disagree. When value functions for the models are different. When the beliefs in the different models differ. When information from last query has been lost (?). Not when a query would bring no information.
![Page 45: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/45.jpg)
Choosing when to query
There is different heuristics to choose when to do a query: Always (up to a certain number of queries) When models disagree. When value functions for the models differ. When the beliefs in the different models differ. When information from last query has been lost. Not when a query would bring no information.
![Page 46: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/46.jpg)
Non-Query Learning on Tiger
Mean discounted reward Number of queries
Blue: Query learning Black: NQ learning
![Page 47: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/47.jpg)
Picking actions during learning
Take one model and do its best action. Consider every model, every action, do
the action with highest overall value. Compute the mean value of every action,
probabilistically take one of them according to the Boltzman method.
![Page 48: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/48.jpg)
Picking actions during learning
Take one model and do its best action. Consider every model, every action, do
the action with highest overall value. Compute the mean value of every action,
probabilistically take one of them according to the Boltzman method.
![Page 49: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/49.jpg)
Different action-pickings on Tiger
Evolution of mean discounted reward with time steps (query at every step)
Blue: Highest overall value Black: Pick one model
![Page 50: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/50.jpg)
Learning with non-stationary models If the parameters of the model
unpredictably change with time:At every time step decay alpha parameters by
some factor so that new experience weighs more than old experience.
If the parameters do not vary too much, non-query learning is sufficient to keep track of their evolution.
![Page 51: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/51.jpg)
Non-stationary Tiger problem:Change in p (probability of correct observation with Hear action) at time 0.
Evolution of mean discounted reward with time steps.
![Page 52: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/52.jpg)
The Hallway problem
60 states 5 actions 17 observations The number of alpha parameters
corresponding to a prior that is reasonable is 84.
![Page 53: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/53.jpg)
Hallway: reward evolution
Evolution of mean discounted reward with time steps
![Page 54: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/54.jpg)
Hallway: query number
Evolution of number of queries with time steps
![Page 55: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/55.jpg)
Benchmark POMDPs
![Page 56: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/56.jpg)
MEDUSA and Robotics
We have interfaced MEDUSA with a robotic simulator (Carmen) which can be used to simulate POMDP learning problems.
We present experimental results on the HIDE problem.
![Page 57: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/57.jpg)
The HIDE problem The robot is trying to capture a moving agent on the
following map. The movements of the robot are deterministic but the
behavior of the person is unknown and is learned through the execution.
The problem is formulated in a POMDP with 362 states, 22 observations and 5 actions. To model the behavior of the moving agent we learn 52 alpha parameters
![Page 58: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/58.jpg)
Results on the HIDE problem
Evolution of mean discounted reward with time steps
![Page 59: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/59.jpg)
Results on the HIDE problem
Evolution of number of queries with time steps
![Page 60: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/60.jpg)
Conclusion
Advantages:Learned models can be re-used. If a parameter in the environment has a small
change, MEDUSA can detect it online and without query.
Convergence is theoretically guaranteed.Number of queries requested and length of
training is tractable, even in large problems.Can be applied to robotics.
![Page 61: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/61.jpg)
Conclusion
But:The assumption of an oracle is strong.
However we do not need to know the query result immediately.
The algorithm has lots of parameters which request tuning so that it can work properly.
Convergence is guaranteed only under certain conditions (for a certain policy, for certain POMDPs, for an infinite amount of models and perfect policy).
![Page 62: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/62.jpg)
References
Jaulmes R., Pineau, J., Precup, D. “Learning in non-stationary Partially Observable Markov Decision Processes”, ECML workshop on Non-Stationarity in RL, 2005.
Jaulmes R.,Pineau J., Precup, D. “Active Learning in Partially Observable Markov Decision Processes”, ECML, 2005.
Cohn, D.A.,Ghaharamani, Z., Jordan, M.I. “Active Learning with Statistical Models” NIPS, 1995.
Pineau,J.,Gordon,G.,Thrun,S. “Point-Based Value Iteration: an anytime algorithm for POMDPs” IJCAI, 2003.
Dearden,R.,Friedman,N.,Andre,N.,”Model based Bayesian Exploration”, UAI, 1999.
Singh,S.Littman,M.,Jong.,N.K.,Pardoe,D.,Stone,P.”Learning Predictive State Representations” ICML, 2003.
![Page 63: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/63.jpg)
Questions ?
![Page 64: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/64.jpg)
Definitions
![Page 65: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/65.jpg)
Convergence of the policy
![Page 66: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/66.jpg)
Convergence of MEDUSA
![Page 67: Active Learning in POMDPs](https://reader033.fdocuments.net/reader033/viewer/2022052510/56814691550346895db3b039/html5/thumbnails/67.jpg)
Non-query update equations