1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.
-
Upload
rhoda-dalton -
Category
Documents
-
view
221 -
download
1
Transcript of 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.
![Page 1: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/1.jpg)
1
Symbolic Dynamic Programming
Alan Fern *
* Based in part on slides by Craig Boutilier
![Page 2: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/2.jpg)
2
Planning in Large State Space MDPs
h You have learned algorithms for computing optimal policies5 Value Iteration5 Policy Iteration
h These algorithms explicitly enumerate the state space5 Often this is impractical
h Simulation-based planning and RL allowed for approximate planning in large MDPs5 Did not utilize an explicit model of the MDP. Only used a strong or
weak simulator.
h How can we get exact solutions to enormous MDPs?
![Page 3: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/3.jpg)
3
Structured Representations
h Policy iteration and value iteration treat states as atomic entities with no internal structure.
h In most cases, states actually do have internal structure5 E.g. described by a set of state variables, or objects with properties
and relationships5 Humans exploit this structure to plan effectively
h What if we had a compact, structured representation for a large MDP and could efficiently plan with it?5 Would allow for exact solutions to very large MDPs
![Page 4: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/4.jpg)
4
A Planning Problem
![Page 5: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/5.jpg)
5
Logical or Feature-based Problems
h For most AI problems, states are not viewed as atomic entities.5 They contain structure. For example, they are
described by a set of boolean propositions/variables
5 |S| exponential in number of propositions
h Basic policy and value iteration do nothing to exploit the structure of the MDP when it is available
nXXXS 21
![Page 6: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/6.jpg)
6
Solution?h Require structured representations in terms
of propositions5 compactly represent transition function5 compactly represent reward function5 compactly represent value functions and policies
h Require structured computation5 perform steps of PI or VI directly on structured
representations5 can avoid the need to enumerate state space
h We start by representing the transition structure as dynamic Bayesian networks
![Page 7: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/7.jpg)
7
Propositional Representations
h States decomposable into state variables (we will assume boolean variables)
h Structured representations the norm in AI5 Decision diagrams, Bayesian networks, etc.5 Describe how actions affect/depend on features5 Natural, concise, can be exploited computationally
h Same ideas can be used for MDPs
nXXXS 21
![Page 8: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/8.jpg)
8
Robot Domain as Propositional MDPh Propositional variables for single user version
5 Loc (robot’s locat’n): Office, Entrance5 T (lab is tidy): boolean5 CR (coffee request outstanding): boolean5 RHC (robot holding coffee): boolean5 RHM (robot holding mail): boolean5 M (mail waiting for pickup): boolean
h Actions/Events5 move to an adjacent location, pickup mail, get coffee, deliver
mail, deliver coffee, tidy lab5 mail arrival, coffee request issued, lab gets messy
h Rewards5 rewarded for tidy lab, satisfying a coffee request, delivering mail5 (or penalized for their negation)
![Page 9: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/9.jpg)
9
State Spaceh State of MDP: assignment to these six
variables5 64 states5 grows exponentially with number of variables
h Transition matrices 5 4032 parameters required per matrix5 one matrix per action (6 or 7 or more actions)
h Reward function5 64 reward values needed
h Factored state and action descriptions will break this exponential dependence (generally)
![Page 10: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/10.jpg)
10
Dynamic Bayesian Networks (DBNs)h Bayesian networks (BNs) a common
representation for probability distributions5 A graph (DAG) represents conditional
independence5 Conditional probability tables (CPTs) quantify local
probability distributions
h Dynamic Bayes net action representation5 one Bayes net for each action a, representing the
set of conditional distributions Pr(St+1|At,St)5 each state variable occurs at time t and t+15 dependence of t+1 variables on t variables
depicted by directed arcs
![Page 11: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/11.jpg)
11
DBN Representation: deliver coffee
Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
Pr(CRt+1 | Lt,CRt,RHCt)
Pr(Tt+1| Tt)
L CR RHC CR(t+1) CR(t+1)
O T T 0.2 0.8
E T T 1.0 0.0
O F T 0.1 0.9
E F T 0.1 0.9
O T F 1.0 0.0
E T F 1.0 0.0
O F F 0.1 0.9
E F F 0.1 0.9
T T(t+1) T(t+1)
T 0.91 0.09
F 0.0 1.0
RHMt RHMt+1
Mt Mt+1
Pr(RHMt+1|RHMt)RHM R(t+1) R(t+1)
T 1.0 0.0
F 0.0 1.0
is the product of each of the 6 tables.
![Page 12: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/12.jpg)
12
Benefits of DBN RepresentationPr(St+1 | St) = Pr(RHMt+1,Mt+1,Tt+1,Lt+1,Ct+1,RHCt+1 | RHMt,Mt,Tt,Lt,Ct,RHCt)
= Pr(RHMt+1 |RHMt) * Pr(Mt+1 | Mt) * Pr(Tt+1 | Tt)
* Pr(Lt+1 | Lt) * Pr(CRt+1 | CRt,RHCt,Lt) * Pr(RHCt+1 | RHCt,Lt)
- Only 20 parameters vs. 4032 for matrix
- Removes global exponential dependence
s1 s2 ... s64
s1 0.9 0.05 ... 0.0s2 0.0 0.20 ... 0.1
S64 0.1 0.0 ... 0.0
...
Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
RHMt RHMt+1
Mt Mt+1
Full Matrix
![Page 13: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/13.jpg)
13
Structure in CPTs
h So far we have represented each CPT as a table of size exponential in the number of parents
h Notice that there’s regularity in CPTs5 e.g., Pr(CRt+1 | Lt,CRt,RHCt) has many similar entries
h Compact function representations for CPTs can be used to great effect5 decision trees5 algebraic decision diagrams (ADDs/BDDs)
h Here we show examples of decision trees (DTs)
![Page 14: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/14.jpg)
14
Action Representation – DBN/DT
CR(t)
0.1 1.0
RHC(t)
L(t)
0.2
Decision Tree (DT)Tt
Lt
CRt
RHCt
Tt+1
Lt+1
CRt+1
RHCt+1
RHMt RHMt+1
Mt Mt+1
f
t
t
oe
Leaves of DT givePr(CRt+1=true | Lt,CRt,RHCt)
DTs can often represent conditional probabilities much morecompactly than a full conditional probability table
e.g. If CR(t) = true & RHC(t) = false then CR(t+1)=TRUE with prob. 1
1.0
f
![Page 15: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/15.jpg)
15
Reward Representation
h Rewards represented with DTs in a similar fashion 5 Would require vector of size 2n for explicit representation
CR
M
T
-1 1
-100
f
tf
t
-10f t
Small reward for satisfying all of these conditions
High cost for unsatisfied coffee request
High, but lower, cost for undelivered mail
Cost for lab being untidy
![Page 16: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/16.jpg)
16
Structured Computation
h Given our compact decision tree (DBN)
representation, can we solve MDP without
explicit state space enumeration?
h Can we avoid O(|S|)-computations by exploiting
regularities made explicit by representation?
h We will study a general approach for doing this
called structured dynamic programming
![Page 17: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/17.jpg)
17
Structured Dynamic Programming
h We now consider how to perform dynamic programming techniques such as VI and PI using the problem structure
h VI and PI are based on a few basic operations.5 Here we will show how to perform these operations directly on tree
representations of value functions, policies, and transitions functions
h The approach is very general and can be applied to other representations (e.g. algebraic decision diagrams, situation calculus) and other problems after the main idea is understood
h We will focus on VI here, but the paper also describes a version of modified policy iteration
![Page 18: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/18.jpg)
18
Recall Tree-Based Representations
X
Y
Z
X
Y
Z
X
Y0.9
0.0
X
1.0 0.0
1.0
Z
Y1.0
0.00.9
Z
10 0
DBN for Action AReward Function R
Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.
e.g. If X(t)=false & Y(t) = true then Y(t+1)=true w/ prob 1
e.g. If X(t)=true THEN Y(t+1)=true w/ prob 0.9t f
t f
t f
t f
t f
t f
Recall that each action of the MDP has its own DBN.
![Page 19: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/19.jpg)
19
Structured Dynamic Programming
h Value functions and policies can also have tree representations5 Often much more compact representations than tables
h Our Goal: compute the tree representations of policy and value function given the tree representations of the transitions and rewards
![Page 20: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/20.jpg)
20
Recall Value Iteration
Suppose that initial is compactly represented as a tree.
1. Show how to compute compact trees for
2. Use a max operation on the Q-trees (returns a single tree)
;; could initialize to 0
Value Iteration:
Bellman Backup
![Page 21: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/21.jpg)
Symbolic Value Iteration
Tree
Symbolic MAX Tree
V(X)
Pr 𝐴=𝑎 (𝑆 ′|𝑆¿ . . . . .Pr 𝐴=𝑏 (𝑆 ′|𝑆¿ Pr 𝐴=𝑧 (𝑆 ′|𝑆¿
. . . . . T Tree
?
? ? ?
![Page 22: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/22.jpg)
22
The MAX Trees OperationX
Y0.9
0.0
X
1.0 0.0
1.0
Tree partitions the state space, assigning values to each region
1.0 0.0 0.9
0.0
1.0
1.0
0.0
1.0
The state space max for the above trees is:
In general, how can we compute the tree representing the max?
![Page 23: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/23.jpg)
23
The MAX Tree Operation
X
Y
X
1.0, 0.0 0.0, 0.0
1.0 0.0 0.9
0.0
1.0
Can simply append one tree to leaves of other. Makes all the distinctions that either tree makes. Max operation is taken at leaves of result.
X
1.0,1.0 0.0, 1.0
X
1.0, 0.9 0.0,0.9
X
Y0.9
0.0
X
1.0 0.0
1.0
MAXX
Y
X
1.0 0.0
X
1.0 1.0
X
1.0 0.9
![Page 24: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/24.jpg)
24
The MAX Tree Operation
1.0 0.0 0.9
0.0
1.0
The resulting tree may have unreachable leaves. We can simplify the tree by removing such paths.
X
Y0.9
0.0
X
1.0 0.0
1.0
SimplifyX
Y
X
1.0 0.0
X
1.0 1.0
X
1.0 0.9
X
Y
0.01.0
1.0
unreachable
![Page 25: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/25.jpg)
25
BINARY OPERATIONS(other binary operations similar to max)
![Page 26: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/26.jpg)
26
MARGINALIZATION
∑A
Compute diagram representing
There are libraries for doing this.
![Page 27: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/27.jpg)
Symbolic Bellman Backup
for each action a
TreeTreeTree
Tree
𝑆=( 𝑋1 ,…, 𝑋 𝑙 ) ,𝑆 ′=(𝑋 ′ 1 ,…,𝑋 ′𝑙)
![Page 28: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/28.jpg)
Symbol
TreeTree
![Page 29: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/29.jpg)
Symbol
Tree
![Page 30: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/30.jpg)
Symbol
Tree
![Page 31: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/31.jpg)
Symbol
TreeTree
![Page 32: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/32.jpg)
Symbol
Tree
![Page 33: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/33.jpg)
Symbolic Bellman Backup
for each action a
TreeTreeTree
Tree
![Page 34: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/34.jpg)
34
SDP: Relative Meritsh Adaptive, nonuniform, exact abstraction method
5 provides exact solution to MDP5 much more efficient on certain problems (time/space)5 400 million state problems in a couple hrs
h Can formulate a similar procedure for modified policy iteration
h Some drawbacks5 produces piecewise constant VF5 some problems admit no compact solution representation
g so the sizes of trees blows up with enough iterations5 approximation may be desirable or necessary
![Page 35: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/35.jpg)
35
Approximate SDPh Easy to approximate solution using SDP
h Simple pruning of value function
5 Simply “merge” leaves that have similar values
5 Can prune trees [BouDearden96] or ADDs [StaubinHoeyBou00]
h Gives regions of approximately same value
![Page 36: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/36.jpg)
36
A Pruned Value ADD
8.368.45
7.45
U
R
W
6.817.64
6.64
U
R
W
5.626.19
5.19
U
R
WLoc
HCR
HCU
9.00
W
10.00
[7.45, 8.45]
Loc
HCR
HCU
[9.00, 10.00]
[6.64, 7.64]
[5.19, 6.19]
![Page 37: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/37.jpg)
37
Approximate SDP: Relative Meritsh Relative merits of ASDP fewer regions implies faster
computation5 30-40 billion state problems in a couple hours5 allows fine-grained control of time vs. solution quality with
dynamic error bounds5 technical challenges: variable ordering, convergence, fixed
vs. adaptive tolerance, etc.
h Some drawbacks5 (still) produces piecewise constant VF5 doesn’t exploit additive structure of VF at all
h Bottom-line: When a problem matches the structural assumptions of SDP then we can gain much. But many problems do not match assumptions.
![Page 38: 1 Symbolic Dynamic Programming Alan Fern * * Based in part on slides by Craig Boutilier.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649e575503460f94b4f84d/html5/thumbnails/38.jpg)
38
Ongoing Workh Factored action spaces
5 Sometimes the action space is large, but has structure. 5 For example, cooperative multi-agent systems
h Recent work (at OSU) has studied SDP for factored action spaces5 Include action variables in the DBNs
Action variables
Statevariables