Pathologies of Temporal Difference Methods in Approximate...

Pathologies of Temporal Difference Methods inApproximate Dynamic Programming

Dimitri P. Bertsekas

Abstract—Approximate policy iteration methods basedon temporal differences are popular in practice, and havebeen tested extensively, dating to the early nineties, butthe associated convergence behavior is complex, and notwell understood at present. An important question iswhether the policy iteration process is seriously hamperedby oscillations between poor policies, roughly similar tothe attraction of gradient methods to poor local minima.There has been little apparent concern in the approximateDP/reinforcement learning literature about this possibility,even though it has been documented with several simpleexamples. Recent computational experimentation with thegame of tetris, a popular testbed for approximate DPalgorithms over a 15-year period, has brought the issueto sharp focus. In particular, using a standard set of 22features and temporal difference methods, an average scoreof a few thousands was achieved. Using the same featuresand a random search method, an overwhelmingly betteraverage score was achieved (600,000-900,000). The paperexplains the likely mechanism of this phenomenon, andderives conditions under which it will not occur.

Proc. of 2010 Conference on Decision and Control, Atlanta, Ga., Dec. 2010

I. INTRODUCTION

In this paper we discuss some phenomena that hamperthe effectiveness of approximate policy iteration methodsfor finite-state stochastic dynamic programming (DP)problems. These are iterative methods that are patternedafter the classical policy iteration method, with eachiteration involving evaluation of the cost vector of acurrent policy, and a policy improvement process aimingto obtain an improved policy.

We focus on the classical discounted finite-stateMarkovian Decision Problem (MDP) as described intextbooks such as [Ber07] and [Put94]. Here, for agiven policy µ, the policy evaluation phase involves theapproximate solution of the Bellman equation

J = gµ + αPµJ, (1)

where gµ ∈ <n is the one-stage cost vector of the policyµ, Pµ is its n× n transition probability matrix, and α ∈(0, 1) is a discount factor.1 The unique solution of Eq.

Dimitri Bertsekas is with the Dept. of Electr. Engineering and Comp.Science, M.I.T., Cambridge, Mass., 02139. [email protected]

Dimitri Bertsekas was supported by NSF Grant ECCS-0801549, bythe LANL Information Science and Technology Institute, and by theAir Force Grant FA9550-10-1-0412.

Many thanks are due to Huizhen (Janey) Yu for many helpfuldiscussions and suggestions.

1In our notation <n is the n-dimensional Euclidean space, all vectorsin <n are viewed as column vectors, and a prime denotes transposition.The identity matrix is denoted by I .

(1) is denoted Jµ and is the cost vector of policy µ.In the most common form of approximate policy

iteration, we introduce an n×r matrix Φ whose columnscan be viewed as basis functions, and we approximatethe cost vectors of policies by vectors in the range of Φ:

S = {Φr | r ∈ <s}.

Thus, given µ, an approximation of Jµ that has the formΦrµ is used to generate an (approximately) improvedpolicy µ via the equation

µ = arg minµ′∈M

[gµ′ + αPµ′Φrµ

], (2)

where M is the set of admissible policies, and the min-imization is done separately for each state/component.This approach is described in detail in the literature, hasbeen extensively tested in practice, and is one of themajor methodologies for approximate DP (see the booksby Bertsekas and Tsitsiklis [BeT96], Sutton and Barto[SuB98], Gosavi [Gos03], Cao [Cao07], Chang, Fu, Hu,and Marcus [CFH07], Meyn [Mey07], Powell [Pow07],and Borkar [Bor08]; the textbook [Ber07] together withits on-line chapter [Ber10a] provide a recent treatmentand up-to-date references). The main advantage of usingapproximations of the form Φr is that the algorithmcan be implemented using Monte-Carlo simulation usingexclusively low-dimensional linear algebra calculations(of order s rather than n).

An important question is whether the policy iterationprocess will converge or whether it will cycle betweenseveral policies. Convergence is guaranteed in the casewhere the policy evaluation is exact (Φ is the iden-tity matrix), or is done using an aggregation method(to be described later; see Section IV). However, theprocess may also be cycling among several (possiblypoor) policies. There has been little apparent concernin the approximate DP/reinforcement learning literatureabout this possibility, even though it was documentedwith several relatively simple examples in the book[BeT96]. Unfortunately recent research developmentswith the game of tetris, a popular testbed for rein-forcement learning algorithms for more than 15 years([Van95], [TsV96], [BeI96], [Kak02], [FaV06], [SzL06],[DFM09], [ThS09]), have brought the issue to sharpfocus and suggest that policy oscillations can prevent theattainment of good performance. In particular, using a setof 22 features introduced by Bertsekas and Ioffe [BeI96],

and temporal difference methods (the LSPE method) anaverage score of a few thousands was achieved (similarresults were obtained with the LSTD method [LaP03]).Using the same features and a random search methodin the space of weight vectors r, an average score of600,000-900,000 was achieved [SzL06], [ThS09]. Thesecase studies, together with the oscillatory computationalresults of [BeI96] (also reproduced in [BeT96], Section8.3), suggest that in the tetris problem, policy iterationusing the projected equation is seriously hampered byoscillations between poor policies, roughly similar to theattraction of gradient methods to poor local minima.

Moreover related phenomena may be causing similar(and hard to detect) difficulties in other related approx-imate DP methodologies: approximate policy iterationwith the Bellman error method, policy gradient methods,and approximate linear programming. In particular, thetetris problem, using the same 22 features, has beenaddressed by approximate linear programming [FaV06],[DFM09], and with a policy gradient method [Kak02],also with a grossly suboptimal achieved average score ofa few thousands.

This paper has a dual purpose. The first purpose is todraw attention to the significance of policy oscillations,as an issue that has been underestimated, and has thepotential to alter in fundamental ways our thinkingabout approximate DP. The second purpose is to proposeapproximate policy iteration methods with guaranteedconvergence/termination. In Section II we provide somebackground on approximate policy iteration, while inSection III, we describe the mechanism for oscillations.In Section IV, we describe some methods to preventoscillations and the restrictions that they require. Thisleads to specific methods, some new and some known,which not only don’t exhibit oscillations but also achievea more favorable order of error bound.

II. APPROXIMATE POLICY ITERATION

Let us first introduce some notation. We introduce themapping T : <n 7→ <n defined by

(TJ)(i) = minu∈U(i)

n∑

j=1

pij(u)(g(i, u, j) + αJ(j)

),

for i = 1, . . . , n, where pij(u) is the transition probabil-ity from state i to state j, as a function of the controlu, and g(i, u, j) is the one-stage cost. For a policy µ weintroduce the mapping Tµ : <n 7→ <n defined by

(TµJ)(i) =n∑

j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

),

for i = 1, . . . , n. The optimal cost vector J∗ is the uniquefixed point of T , while Jµ, the cost vector of µ, is theunique fixed point of T .

We consider a standard approximate policy iterationmethod with linear cost function approximation. In this

method, given Φrµ, an approximation of the cost vectorof the current policy µ, we generate an “improved” policyµ using the formula Tµ(Φrµ) = T (Φrµ), i.e., for all i

µ(i) = arg minu∈U(i)

n∑

j=1

pij(u)(g(i, u, j)+αφ(j)′rµ

), (3)

where φ(j)′ is the row of Φ that corresponds to state j.The method terminates with µ if Tµ(Φrµ) = T (Φrµ).We then repeat with µ replaced by µ.

The theoretical basis for the preceding approximatepolicy iteration method is given in Prop. 6.2 of [BeT96](also Prop. 1.3.6 of [Ber07]), where it was shown thatif the policy evaluation is accurate to within δ (in thesup-norm sense ‖Φrµ − Jµ‖∞ ≤ δ), then the methodwill yield a sequence of policies {µk} such that

lim supk→∞

‖Jµk − J∗‖∞ ≤2αδ

(1− α)2. (4)

Experimental evidence indicates that this bound, whiletight in theory ([BeT96], Example 6.4), tends to be con-servative in practice. Furthermore, often just a few policyevaluations are needed before the bound is attained. Still,however, given that α ≈ 1, and that the sup-norm error ofthe policy evaluation process (and hence δ) is potentiallylarge, the bound (4) is far from reassuring.

When the policy sequence {µk} terminates with someµ, the much sharper bound

‖Jµ − J∗‖∞ ≤2αδ

1− α (5)

holds. To show this, let J be the cost vector obtainedby policy evaluation of µ, and note that it satisfies ‖J −Jµ‖∞ ≤ δ (by assumption) and TJ = TµJ (since µk

converges to µ). We write

TJµ ≥ T (J − δe) = TJ − αδe= TµJ − αδe ≥ TµJµ − 2αδe= Jµ − 2αδe,

where e is the unit vector, from which by repeatedlyapplying T to both sides, we obtain

J∗ = limm→∞

TmJµ ≥ Jµ −2αδ

1− αe,

thereby showing the error bound (5). A comparison of thetwo bounds (4) and (5) suggests an important advantagefor methods that guarantee policy convergence.

Let us now discuss algorithmic approaches for approx-imate policy evaluation. The most popular are:(a) Projected equation approach: Here we solve the

projected equation

Φr = ΠTµ(Φr),

where Π denotes projection onto the subspace S ={Φr | r ∈ <s}. The projection is with respect

to a weighted Euclidean norm ‖ · ‖ξ, where ξ =(ξ1, . . . , ξn) is a probability distribution with pos-itive components (i.e., ‖x‖2ξ =

∑ni=1 ξix

2i , where

ξi > 0 for all i).

(b) Aggregation approach: Here we solve an equationof the form

Φr = ΦDTµ(Φr),

where Φ and D are matrices whose rows are re-stricted to be probability distributions. The vector rhas an interpretation as a cost vector of an aggregateproblem that has s states and is defined by Φ and D.Note that there is a probabilistic structure restrictionin the form of Φ, while there is no such restrictionin the projected equation approach.

Generally, the projected equation approach is associ-ated with temporal difference (TD) methods, which origi-nated in reinforcement learning with the works of Samuel[Sam59], [Sam67] on a checkers-playing program. Thepapers by Barto, Sutton, and Anderson [BSA83], andSutton [Sut88] proposed the TD(λ) method, which mo-tivated a lot of research in simulation-based DP, partic-ularly following an early success with the backgammonplaying program of Tesauro [Tes92]. The original papersdid not make the connection of TD methods with theprojected equation, and for quite a long time it was notclear which mathematical problem TD(λ) was aimingto solve! The convergence properties of TD(λ) and itsconnections with the projected equation were clarifiedin the mid 90s through the works of several authors,see e.g., Tsitsiklis and Van Roy [TsV97]. More recentworks have focused on the use of least squares-basedTD methods, such as the LSTD (Least Squares TemporalDifferences) method (Bradtke and Barto [BrB96]), andthe LSPE (Least Squares Policy Evaluation) method(Bertsekas and Ioffe [BeI96]).

The aggregation approach also has a long history inscientific computation and operations research. It wasintroduced in the simulation-based approximate DP con-text, mostly in the form of value iteration; see Singh,Jaakkola, and Jordan [SJJ94], [SJJ95], Gordon [Gor95],and Tsitsiklis and Van Roy [TsV96] (for textbook presen-tations of aggregation, see [Ber07], [Ber10a]). Currentlythe aggregation approach seems to be less popular, butas we will argue in this paper, it has some interestingadvantages over the projected equation approach, eventhough there are restrictions in its applicability.

An important fact, to be shown later, is that when theaggregation approach is used for policy evaluation, thepolicy sequence {µk} terminates with some µ, while thisis generally not true for the projected equation approach.

III. POLICY OSCILLATIONS

Projected equation-based variants of policy iterationmethods are popular in practice, and have been tested

rµk rµk+1 rµk+2 rµk+3

Rµk Rµk+1 Rµk+2 Rµk+3

Controllable State Components Post-Decision States

State-Control Pairs: Fixed Policy µ Case (j, v) P ∼ |A|

j�0 j�1 j�k j�k+1 j�0 u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

j0 j1 jk jk+1 i0 i1 ik ik+1

u p(z | j) g(i, u,m) m m = f(i, u) q(j | m)

(i, y) (j, z) States j g(i, y, u, j) pij(u) g(i, u, j) v µ(j)�j, µ(j)

�

f̄�2,Xk

(−λ)

x̃1 x̃2 x̃3 x̃4 Slope: x̃k+1 Slope: x̃i, i ≤ k

f�(λ)

Constant− f�1 (λ) f�

2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�

M =�(u, w) | there exists x ∈ X

Original System States Aggregate StatesAggregation ProbabilitiesDisaggregation Probabilities Column Sampling According toMarkov ChainRow Sampling According to ξ (May Use Markov Chain Q)

F (x) H(y) y h(y)

supz∈Z

infx∈X

φ(x, z) ≤ supz∈Z

infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

supz∈Z

infx∈X


infx∈X

φ̂(x, z) = q∗ = p̃(0) ≤ p(0) = w∗ = infx∈X

supz∈Z

φ(x, z)

1

Fig. 1. Greedy partition and cycle of policies generated by nonop-timistic policy iteration. Thus µ yields µ by policy improvement ifand only if rµ ∈ Rµ. In this figure, the method cycles betweenfour policies and the corresponding four parameters rµk rµk+1 rµk+2

rµk+3 .

extensively, dating to the early nineties (see e.g., thebooks [BeT96], [SuB08], and the references quotedthere; for a sample of more recent experimental stud-ies, see Lagoudakis and Parr [LaP03], Jung and Polani[JuP07], and Busoniu et al. [BED09]), but the associatedconvergence behavior is complex, and involves poten-tially damaging oscillations that are not well understoodat present. This behavior was first described, togetherwith the attendant policy oscillation and chattering phe-nomena, by the author at an April 1996 workshop onreinforcement learning [Ber96], and was subsequentlydiscussed in Section 6.4 of [BeT96].

To get a sense of this behavior, we introduce the socalled greedy partition. This is a partition of the space<s of parameter vectors r into subsets Rµ, each subsetcorresponding to a stationary policy µ, and defined by

Rµ ={r∣∣∣ µ(i) = arg min

u∈U(i)

n∑

j=1

pij(u)

(g(i, u, j) + αφ(j)′r

), i = 1, . . . , n

}.

Thus, Rµ is the set of parameter vectors r for which µis greedy with respect to Φr.

For simplicity, let us assume that we use a policy eval-uation method that for each given µ produces a uniqueparameter vector rµ. Nonoptimistic policy iteration startswith a parameter vector r0, which specifies µ0 as agreedy policy with respect to Φr0, and generates rµ0 byusing the given policy evaluation method. It then finds apolicy µ1 that is greedy with respect to Φrµ0 , i.e., a µ1

such that rµ0 ∈ Rµ1 . It then repeats the process with µ1

replacing µ0. If some policy µk satisfying rµk ∈ Rµk

is encountered, the method keeps generating that policy.This is the necessary and sufficient condition for policyconvergence in the approximate policy iteration method.Of course, the mere fact that a policy iteration methodis guaranteed to converge is not in itself a guarantee ofgood performance, beyond the fact that a better errorbound holds in this case [cf. Eqs. (5) and (4)].

In the case of a lookup table representation wherethe parameter vectors rµ are equal to the cost-to-govector Jµ, the condition rµk ∈ Rµk is equivalent torµk = Trµk , and is satisfied if and only if µk is optimal.When there is cost function approximation, however, thiscondition need not be satisfied for any policy. In thiscase, since there is a finite number of possible vectorsrµ, one generated from another in a deterministic way,the algorithm ends up repeating some cycle of policiesµk, µk+1, . . . , µk+m with

rµk ∈ Rµk+1 , rµk+1 ∈ Rµk+2 , . . . ,

rµk+m−1 ∈ Rµk+m , rµk+m ∈ Rµk ;

(see Fig. 1). Furthermore, there may be several differ-ent cycles, and the method may end up converging toany one of them depending on the starting policy µ0.The actual cycle obtained depends on the initial policyµ0. This is reminiscent of gradient methods applied tominimization of functions with multiple local minima,where the limit of convergence depends on the startingpoint. Furthermore, the policies obtained may be quitebad, subject only to the error bound (4) The followingexample illustrates policy oscillation.

Example 1: Consider a discounted problem with twostates, 1 and 2, illustrated in Fig. 2(a). There is a choiceof control only at state 1, and there are two policies,denoted µ∗ and µ. The optimal policy µ∗ when at state1 stays at 1 with probability p > 0 and incurs a negativecost c. The other policy is µ and cycles between the twostates with 0 cost. We consider cost approximation of theform i r, so Φ = (1, 2)′.

Let us construct the greedy partition. We have

Rµ ={r | p

(c+ α(1 · r)

)+ (1− p)α(2 · r) ≥ α(2 · r)

}

= {r | c ≥ αr},Rµ∗ = {r | c ≤ αr}.

We next calculate the points rµ and rµ∗ that solvethe projected equations Cµrµ = dµ and Cµ∗rµ∗ = dµ∗ ,which correspond to µ and µ∗, respectively. We have

Cµ = Φ′Ξµ(1− αPµ)Φ

=(

12

)(1 00 1

)(1 −α−a 1

)(12

)= 5− 9α,

dµ = Φ′Ξµgµ =(

12

)(1 00 1

)(00

)= 0,

so rµ = 0. Similarly, with some calculation,

Cµ∗ = Φ′Ξµ∗(1− αPµ∗)Φ

=(

12

)( 12−p 00 1−p

2−p

)(1− αp −α(1− p)−a 1

)(c0

)

=c

2− p ,

Cost =0 Cost = c < 0 Prob. = 1− p Prob. = 1 Prob. = pk Stages j1 j2 jk

rµ1 rµ2 rµ3 rµk+3

Rµ1 Rµ2 Rµ3 Rµk+3







�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�

M =�(u,w) | there exists x ∈ X


F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1










�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



F (x) H(y) y h(y)

1

Cost =0 Cost = c < 0 Prob. = 1− p Prob. = 1 Prob. = p

Policy µ Policy µ∗ (a) (b) rµ = 0 Rµ Rµ∗

rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�


Original System States Aggregate StatesAggregation Probabilities

1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�



1



rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�


1



rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�


1



rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�


1



rµ∗ ≈c

1− α,

c

αrµ = 0

1 2

k Stages j1 j2 jk









�

f̄�2,Xk

(−λ)


f�(λ)


2 (−λ) F ∗2,k(−λ)F ∗

k (λ)

�(g(x), f(x)) | x ∈ X

�


1

Fig. 2. The problem of Example 1. (a) Costs and transition probabili-ties for the policies µ and µ∗. (b) The greedy partition and the solutionsof the projected equations corresponding to µ and µ∗. Nonoptimisticpolicy iteration oscillates between rµ and rµ∗ .

dµ∗ = Φ′Ξµ∗gµ∗ =(

12

)( 12−p 00 1−p

2−p

)(c0

)=

c

2− p ,

sorµ∗ =

c

5− 4p− α(4− 3p).

We now note that since c < 0, rµ = 0 ∈ Rµ∗ , whilefor p ≈ 1 and α > 1− α, we have

rµ∗ ≈c

1− α ∈ Rµ;

cf. Fig. 2(b). In this case, approximate policy iterationcycles between µ and µ∗.

Notice that it is hard to predict when and what kindof oscillation will occur. For example if c > 0, we haverµ = 0 ∈ Rµ, while for p ≈ 1 and α > 1− α, we have

rµ∗ ≈c

1− α ∈ Rµ∗ .

In this case approximate policy iteration will converge toµ (or µ∗) if started with r in Rµ (or Rµ∗ , respectively).

IV. CONDITIONS FOR POLICY CONVERGENCE

The preceding analysis has indicated that it is desirable toavoid policy oscillation in approximate policy iteration.Moreover, as mentioned earlier, when policies convergethere is a more favorable error bound associated withthe method [cf. Eq. (5) versus Eq. (4)]. It is thereforeinteresting to investigate conditions under which thesequence of policies will converge.

To this end, consider a method that evaluates the costvector Jµ of a policy µ by a vector Φrµ that is definedas a solution of the fixed point equation

(MTµ)(Φr) = Φr (6)

where M : <n 7→ S is a mapping (independent of µ).Let us assume the following conditions:(a) M is monotone in the sense that My ≤ My for

any two vectors y, y ∈ <n such that y ≤ y(b) For each µ, there is a unique solution of Eq. (6),

denoted Φrµ.

(c) For each µ and starting point r0, the iteration

Φrk+1 = MTµ(Φrk)

converges to Φrµ.

Note that conditions (b) and (c) are satisfied if MTµ isa contraction on S.

Proposition 1. Let conditions (a)-(c) hold. Consider theapproximate policy iteration method that uses Eq. (6) forpolicy evaluation, and is operated so that it terminateswhen a policy µ is obtained such that Tµ(Φrµ) =T (Φrµ). Then the method terminates in a finite numberof iterations.

Proof: Similar to the standard proof of convergenceof (exact) policy iteration, we use the policy improve-ment equation Tµ(Φrµ) = T (Φrµ), the monotonicity ofM , and the policy evaluation Eq. (6) to write

(MTµ)(Φrµ) = (MT )(Φrµ) ≤ (MTµ)(Φrµ) = Φrµ.

Since MTµ is monotone by condition (a), by iteratingwith MTµ and by using condition (c), we obtain

Φrµ = limk→∞

(MTµ)k(Φrµ) ≤ Φrµ.

There are finitely many policies, so we must have Φrµ =Φrµ after a finite number of iterations, which implies thatTµ(Φrµ) = T (Φrµ), and that the algorithm terminateswith µ.

When the projected equation approach is used (M =Π where Π is a Euclidean projection matrix), the mono-tonicity assumption is satisfied if Π is independentof the policy µ and has nonnegative components. Itturns out that this is true when Φ has nonnegativecomponents, and has linearly independent columns thatare orthogonal with respect to the inner product <x1, x2 >= x′1Ξx2. This follows from the projectionformula Π = Φ(Φ′ΞΦ)−1Φ′Ξ = ΦΦ′Ξ and the fact thatΦ′ΞΦ is positive definite and diagonal.

An important special case where Prop. 4.1 applies andpolicies converge is when M = ΦD, and Φ and D arematrices whose rows are probability distributions. Thisis policy evaluation by aggregation, noted in Section II.Briefly, there is a finite set A of aggregate states, and Dand Φ are the matrices that have as rows disaggregationand aggregation probability distributions, respectively,whose components relate the original system states withthe aggregate states and are defined as follows:(1) For each aggregate state x and original system state

i, we specify the disaggregation probability dxi [wehave

∑ni=1 dxi = 1 for each x ∈ A]. Roughly, dxi

may be interpreted as the “degree to which x isrepresented by i.”

(2) For each aggregate state y and original system statej, we specify the aggregation probability φjy (wehave

∑y∈A φjy = 1 for each j = 1, . . . , n).

Roughly, φjy may be interpreted as the “degree ofmembership of j in the aggregate state y.”

It can be seen that for the aggregation case M = ΦD,M is monotone and MTµ is a sup-norm contraction(since M is nonexpansive with respect to the sup norm),so that conditions (a)-(c) are satisfied. The same is truein the more general case M = ΦD, where Φ and Dare matrices with nonnegative components, and the rowsums of M are less or equal to 1, i.e.,

s∑

m=1

ΦimDmj ≥ 0, ∀ i, j = 1, . . . , n,

s∑

m=1

Φimn∑

j=1

Dmj ≤ 1, ∀ i = 1, . . . , n.

Note that even in this more general case, the policy eval-uation Eq. (6) can be solved by using simulation and loworder calculations (see [Ber10a] and the survey [Ber10b],which contains many references). An interesting specialcase where the aggregation equation Φr = ΦDTµ(Φr)is also a projected equation is hard aggregation (the statespace is partitioned into subsets, and each column of Φhas a 1 for the states in the subset corresponding to thecolumn, and a 0 for the remaining states), as discussedin [Ber10a], [Ber10b].

V. CONCLUSIONS

We have considered some aspects of approximate policyiteration methods. From the analytical point of view, thisis a subject with a rich theory and interesting algorithmicissues. From a practical point of view, this is a method-ology that can address very large and difficult problems,and yet challenge the practitioner with unpredictablebehaviors, such as policy oscillations, which are not fullyunderstood at present.

We have derived conditions under which policy os-cillations will not occur, and showed that aggregation-based methods satisfy these conditions. It thus appearsthat these methods have more regular behavior, and offerbetter error bound guarantees that TD methods. On theother hand, aggregation methods are restricted in thechoice of basis functions that they can use, and this canbe a significant limitation for many problems.

REFERENCES

[BED09] Busoniu, L., Ernst, D., De Schutter, B., andBabuska, R., 2009. “Online Least-Squares Policy Itera-tion for Reinforcement Learning Control,” unpublishedreport, Delft Univ. of Technology, Delft, NL.

[BSA83] Barto, A. G., Sutton, R. S., and Anderson, C.W., 1983. “Neuronlike Elements that Can Solve DifficultLearning Control Problems,” IEEE Trans. on Systems,Man, and Cybernetics, Vol. 13, pp. 835-846.

[BeI96] Bertsekas, D. P., and Ioffe, S., 1996. “Tempo-ral Differences-Based Policy Iteration and Applicationsin Neuro-Dynamic Programming,” LIDS-P-2349, MIT,Cambridge, MA.

[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996.Neuro-Dynamic Programming, Athena Scientific, Bel-mont, MA.

[Ber96] Bertsekas, D. P., 1996. Lecture at NSF Workshopon Reinforcement Learning, Hilltop House, Harper’sFerry, NY.

[Ber07] Bertsekas, D. P., 2007. Dynamic Programmingand Optimal Control, 3rd Edition, Vols. I and II, AthenaScientific, Belmont, MA.

[Ber10a] Bertsekas, D. P., 2010. Approximate DynamicProgramming, on-line athttp://web.mit.edu/dimitrib/www/dpchapter.html.

[Ber10b] Bertsekas, D. P., 2010. “Approximate PolicyIteration: A Survey and Some New Methods,” ReportLIDS-P-2833, MIT; to appear in Journal of ControlTheory and Applications.

[Bor08] Borkar, V. S., 2008. Stochastic Approximation:A Dynamical Systems Viewpoint, Cambridge Press.

[BrB96] Bradtke, S. J., and Barto, A. G., 1996. “Lin-ear Least-Squares Algorithms for Temporal DifferenceLearning,” Machine Learning, Vol. 22, pp. 33-57.

[CFH07] Chang, H. S., Fu, M. C., Hu, J., Marcus, S. I.,2007. Simulation-Based Algorithms for Markov DecisionProcesses, Springer, NY.

[Cao07] Cao, X. R., 2007. Stochastic Learning andOptimization: A Sensitivity-Based Approach, Springer,NY.

[DFM09] Desai, V. V., Farias, V. F., and Moallemi,C. C., 2009. “Aproximate Dynamic Programming via aSmoothed Approximate Linear Program, Submitted.

[FaV06] Farias, V. F., and Van Roy, B., 2006. “Tetris:A Study of Randomized Constraint Sampling, in Prob-abilistic and Randomized Methods for Design UnderUncertainty, Springer-Verlag.

[Gor95] Gordon, G. J., 1995. “Stable Function Approxi-mation in Dynamic Programming,” in Machine Learning:Proceedings of the Twelfth International Conference,Morgan Kaufmann, San Francisco, CA.

[Gos03] Gosavi, A., 2003. Simulation-Based Optimiza-tion Parametric Optimization Techniques and Reinforce-ment Learning, Springer-Verlag, NY.

[JuP07] Jung, T., and Polani, D., 2007. “KernelizingLSPE(λ),” in Proc. 2007 IEEE Symposium on Approxi-mate Dynamic Programming and Reinforcement Learn-ing, Honolulu, Hawaii. pp. 338-345.

[Kak02] Kakade, S., 2002. “A Natural Policy Gradient,”in Advances in Neural Information Processing Systems,

Vol. 14, MIT Press, Cambridge, MA.

[LaP03] Lagoudakis, M. G., and Parr, R., 2003. “Least-Squares Policy Iteration,” J. of Machine Learning Re-search, Vol. 4, pp. 1107-1149.

[Mey07] Meyn, S., 2007. Control Techniques for Com-plex Networks, Cambridge University Press, NY.

[Pow07] Powell, W. B., 2007. Approximate DynamicProgramming: Solving the Curses of Dimensionality,Wiley, NY.

[Put94] Puterman, M. L., 1994. Markov Decision Pro-cesses: Discrete Stochastic Dynamic Programming, J.Wiley, NY.

[SJJ94] Singh, S. P., Jaakkola, T., and Jordan, M. I.,1994. “Learning without State-Estimation in PartiallyObservable Markovian Decision Processes,” Proc. of theEleventh Machine Learning Conference, pp. 284-292.

[SJJ95] Singh, S. P., Jaakkola, T., and Jordan, M. I., 1995.“Reinforcement Learning with Soft State Aggregation,”in Advances in Neural Information Processing Systems7, MIT Press, Cambridge, MA.

[Sam59] Samuel, A. L., 1959. “Some Studies in MachineLearning Using the Game of Checkers,” IBM J. Res. andDevelopment, pp. 210-229.

[Sam67] Samuel, A. L., 1967. “Some Studies in MachineLearning Using the Game of Checkers. II – RecentProgress,” IBM J. Res. and Development, pp. 601-617.

[SuB98] Sutton, R. S., and Barto, A. G., 1998. Rein-forcement Learning, MIT Press, Cambridge, MA.

[Sut88] Sutton, R. S., 1988. “Learning to Predict by theMethods of Temporal Differences,” Machine Learning,Vol. 3, pp. 9-44.

[SzL06] Szita, I., and Lorinz, A., 2006. “Learning TetrisUsing the Noisy Cross-Entropy Method,” Neural Com-putation, Vol. 18, pp. 2936-2941.

[Tes92] Tesauro, G., 1992. “Practical Issues in TemporalDifference Learning,” Machine Learning, Vol. 8, pp. 257-277.

[ThS09] Thiery, C., and Scherrer, B., 2009. “Improve-ments on Learning Tetris with Cross-Entropy,” Intern.Computer Games Assoc. Journal, Vol. 32, pp. 23-33.

[TsV96] Tsitsiklis, J. N., and Van Roy, B., 1996.“Feature-Based Methods for Large-Scale Dynamic Pro-gramming,” Machine Learning, Vol. 22, pp. 59-94.

[Van06] Van Roy, B., 2006. “Performance Loss Boundsfor Approximate Value Iteration with State Aggregation,”Math. of Operations Research, Vol. 31, pp. 234-244.

[YuB10] Yu, H., and Bertsekas, D. P., 2010. “NewError Bounds for Approximations from Projected LinearEquations,” Mathematics of Operations Research, Vol.35, 2010, pp. 306-329.

Pathologies of Temporal Difference Methods in Approximate...

Documents

Transcript of Pathologies of Temporal Difference Methods in Approximate...