APPROXIMATE DYNAMIC PROGRAMMING—II: …adp.princeton.edu/Papers/Powell - Approximate Dynamic...

APPROXIMATE DYNAMICPROGRAMMING—II: ALGORITHMS

WARREN B. POWELL

Department of Operations Researchand Financial Engineering,Princeton University, Princeton,New Jersey

INTRODUCTION

Approximate dynamic programming (ADP)represents a powerful modeling and algorith-mic strategy that can address a wide rangeof optimization problems that involve mak-ing decisions sequentially in the presence ofdifferent types of uncertainty. A short list ofapplications, which illustrate different prob-lem classes, include the following:

• Option Pricing. An American optionallows us to buy or sell an asset at anytime up to a specified time, where wemake money when the price goes underor over (respectively) a set strike price.Valuing the option requires finding anoptimal policy for determining when toexercise the option.

• Playing Games. Computer algorithmshave been designed to play backgam-mon, bridge, chess, and recently, theChinese game of Go.

• Controlling a Device. This might be arobot or unmanned aerial vehicle, butthere is a need for autonomous devicesto manage themselves for tasks rangingfrom vacuuming the floor to collectinginformation about terrorists.

• Storage of Continuous Resources.Managing the cash balance for amutual fund or the amount of waterin a reservoir used for a hydroelectricdam requires managing a continuousresource over time in the presence ofstochastic information on parameterssuch as prices and rainfall.

Wiley Encyclopedia of Operations Research and Management Science, edited by James J. CochranCopyright © 2010 John Wiley & Sons, Inc.

• Asset Acquisition. We often have toacquire physical and financial assetsover time under different sourcesof uncertainty about availability,demands, and prices.

• Resource Allocation. Whether we aremanaging blood inventories, financialportfolios, or fleets of vehicles, we oftenhave to move, transform, clean, andrepair resources to meet various needsunder uncertainty about demands andprices.

• R&D Portfolio Optimization. TheDepartment of Energy has to determinehow to allocate government funds toadvance the science of energy genera-tion, transmission, and storage. Thesedecisions have to be made over time inthe presence of uncertain changes intechnology and commodity prices.

These problems range from relativelylow-dimensional applications to very high-dimensional industrial problems, but allshare the property of making decisions overtime under different types of uncertainty. InRef. 1, a modeling framework is described,which breaks a problem into five components.

• State. St (or xt in the control theory com-munity), capturing all the informationwe need at time t to make a decisionand model the evolution of the systemin future.

• Action/Decision/Control. Dependingon the community, these will bemodeled as a, x, or u. Decisions are madeusing a decision function or policy. Ifwe are using action a, we represent thedecision function using Aπ (St), whereπ ∈ � is a family of policies (or func-tions). If our action is x, we use Xπ (St).We use a if our problem has a smallnumber of discrete actions. We use xwhen the decision might be a vector (ofdiscrete or continuous elements).

• Exogenous Information. Lacking stan-dard notation, we let Wt be the family

1

2 APPROXIMATE DYNAMIC PROGRAMMING—II: ALGORITHMS

of random variables that represent newinformation that first becomes knownby time t.

• Transition Function. Also known as thesystem model (or just model), this func-tion is denoted by SM(·), and is usedto express the evolution of the statevariable, as in St+1 = SM(St, xt, Wt+1).

• Objective Function. Let C(St, xt) be thecontribution (if we are maximizing)received when in state St if we takeaction xt. Our objective is to find adecision function (policy) that solves

maxπ∈�

E

{T∑

t=0

γ tC(St, Xπ (St))

}. (1)

We encourage readers to review Ref. 1 (inthis volume) or Chapter 5 in Ref. 2, availableat http://www.castlelab.princeton.edu/adp.htm, before attempting to design an algo-rithmic strategy. For the remainder of thisarticle, we assume that the reader is familiarwith the modeling behind the objectivefunction in Equation (1), and in particularthe range of policies that can be used toprovide solutions. In this article, we assumethat we would like to find a good policy byusing Bellman’s equation as a starting point,which we may write in either of two ways:

Vt(St) = maxa

(C(St, a)

+ γ∑

s′p(s′|St, a)Vt+1(s′)

), (2)

= maxa

(C(St, a) + γ E{Vt+1(St+1)|St}

),

(3)

where Vt(St) is the value of being in state Stand following an optimal policy from t untilthe end of the planning horizon (which maybe infinite). The control theory communityreplaces Vt(St) with Jt(St), which is referredto as the cost-to-go function. If we are solvinga problem in steady state, Vt(St) would bereplaced with V(S).

If we have a small number of discretestates and actions, we can find the value func-tion Vt(St) using the classical techniques ofvalue iteration and policy iteration [3]. Many

practical problems, however, suffer from oneor more of the three curses of dimensional-ity: (i) vector-valued (and possibly continu-ous) state variables, (ii) random variables Wtfor which we may not be able to computethe expectation, and (iii) decisions (typicallydenoted by xt or ut), which may be discreteor continuous vectors, requiring some sort ofsolver (linear, nonlinear, or integer program-ming) or specialized algorithmic strategy.

The field of ADP has historically focusedon the problem of multidimensional statevariables, which prevent us from calculat-ing Vt(s) for each discrete state s. In ourpresentation, we make an effort to coverthis literature, but we also show how wecan overcome the other two curses using adevice known as the postdecision state vari-able. However, a short article such as thisis simply unable to present the vast rangeof algorithmic strategies in any detail. Forthis, we recommend for additional reading,the following: Ref. 4, especially for studentsinterested in obtaining thorough theoreti-cal foundations; Ref. 5 for a presentationfrom the perspective of the reinforcementlearning (RL) community; Ref. 2 for a presen-tation that puts more emphasis on modeling,and more from the perspective of the opera-tions research community; and Chapter 6 ofRef. 6, which can be downloaded from http://web.mit.edu/dimitrib/www/dpchapter.html.

A GENERIC ADP ALGORITHM

Equation (2) (or Eq. 3) is typically solved bystepping backward in time, where it is nec-essary to loop over all the potential states tocompute Vt(St) for each state St. For this rea-son, classical dynamic programming is oftenreferred to as backward dynamic program-ming. The requirement of looping over allthe states is the first computational step thatcannot be performed when the state vari-able is a vector, or even a scalar continuousvariable.

ADP takes a very different approach. Inmost ADP algorithms, we step forward intime following a single sample path. Assumewe are modeling a finite horizon problem. Weare going to iteratively simulate this problem.

APPROXIMATE DYNAMIC PROGRAMMING—II: ALGORITHMS 3

At iteration n, assume that at time t we are instate Sn

t . Now assume that we have a policyof some sort that produces a decision xn

t . InADP, we are typically solving a problem thatcan be written as

vnt = max

xt

(C(Sn

t , xt)

+ γ E{Vn−1t+1 (SM(Sn

t , xt, Wt+1))|St}). (4)

Here, Vn−1t+1 (St+1) is an approximation of the

value of being in state St+1 = SM(Snt , xt, Wt+1)

at time t + 1. If we are modeling a prob-lem in steady state, we drop the subscriptt everywhere, recognizing that W is a ran-dom variable. We note that xt here can bea vector, where the maximization problemmight require the use of a linear, nonlinear,or integer programming package. Let xn

t bethe value of xt that solves this optimizationproblem. We can use vn

t to update our approx-imation of the value function. For example, ifwe are using a lookup table representation,we might use

Vnt (Sn

t ) = (1 − αn−1)Vn−1t (Sn

t ) + αn−1vnt , (5)

where αn−1 is a step size between 0 and 1(more on this later).

Now we are going to use Monte Carlomethods to sample our vector of random vari-ables, which we denote by Wn

t+1, representingthe new information that would have firstbeen learned between time t and t + 1. Thenext state would be given by

Snt+1 = SM(Sn

t , xnt , Wn

t+1).

The overall algorithm is given in Fig. 1.This is a very basic version of an ADP algo-rithm, one which would generally not workin practice. But it illustrates some of thebasic elements of an ADP algorithm. First,it steps forward in time, using a randomlysampled set of outcomes, visiting one stateat a time. Secondly, it makes decisions usingsome sort of statistical approximation of avalue function, although it is unlikely thatwe would use a lookup table representation.Thirdly, we use information gathered as westep forward in time to update our valuefunction approximation, almost always usingsome sort of recursive statistics.

Figure 1. An approximate dynamic programming algorithm using expectations.


The algorithm in Fig. 1 goes under dif-ferent names. It is a basic ‘‘forward pass’’algorithm, where we step forward in time,updating value functions as we progress.Another variation involves simulating for-ward through the horizon without updatingthe value function. Then, after the simula-tion is done, we step backward through time,using information about the entire futuretrajectory.

We have written the algorithm in the con-text of a finite horizon problem. For infinitehorizon problems, simply drop the subscriptt everywhere, remembering that you haveto keep track of the ‘‘current’’ and ‘‘future’’state.

There are different ways of presenting thisbasic algorithm. To illustrate, we can rewriteEquation (5) as

Vn(Sn

t ) = Vn−1

(Snt ) + αn−1(vn

t − Vn−1

(Snt )).

(6)

The quantity

vnt − V

n−1(Sn

t ) = Ct(Snt , xt) + γ V

n−1t+1

× (St+1) − Vn−1

(Snt )

whereSt+1 = SM(St, xt, Wt+1)

is known as the Bellman error, since itis a sampled observation of the differencebetween what we think is the value of beingin state Sn

t and an estimate of the value ofbeing in state Sn

t . In the RL community, ithas long been called the temporal difference(TD), since it is the difference between esti-mates at two different iterations, which canhave the interpretation of time, especiallyfor steady-state problems, where n indexestransitions forward in time.

The RL community would refer to thisalgorithm as TD learning (first introducedin Ref. 7). If we use a backward pass, wecan let vn

t be the value of the entire futuretrajectory, but it is often useful to introducea discount factor (usually represented as λ)to discount rewards received in the future.This discount factor is introduced purely foralgorithmic reasons, and should not be con-fused with the discount γ , which is intended

to capture the time value of money. If welet λ = 1, then we are adding up the entirefuture trajectory. But if we use λ = 0, then weobtain the same updating as our forward passalgorithm in Fig. 1. For this reason, the RLcommunity refers to this family of updatingstrategies as TD(λ).

TD(0) (TD learning with λ = 0) can beviewed as an approximate version of classicalvalue iteration. It is important to recognizethat after each update, we are not just chang-ing the value function approximation, we arealso changing our behavior (i.e., our policyfor making decisions), which in turn changesthe distribution of vn

t given a particular stateSn

t . TD(1), on the other hand, computes vnt

by simulating a policy into the future. Thispolicy depends on the value function approx-imation, but otherwise vn

t does not directlyuse value function approximations.

An important dimension of most ADPalgorithms is Step 2c, where we have tochoose which state to visit next. For manycomplex problems, it is most natural tochoose the state determined by the action xn

t ,and then simulate our way to the next stateusing Sn

t+1 = SM(Snt , xn

t , Wt+1(ωn)). Manyauthors refer to this as a form of ‘‘real-timedynamic programming’’ (RTDP) [8], althougha better term is trajectory following [9], sinceRTDP involves other algorithmic assump-tions. The alternative to trajectory followingis to randomly generate a state to visitnext. Trajectory following often seems morenatural since it means you are simulatinga system, and it visits the states that arisenaturally, rather than from what might bean unrealistic probability model. But readersneed to understand that there are fewconvergence results for trajectory followingmodels, due to the complex interactionbetween random observations of the value ofbeing in a state and the policy that resultsfrom a specific value function approximation,which impacts the probability of visiting astate.

The power and flexibility of ADP has to betempered with the reality that simple algo-rithms generally do not work, even on simpleproblems. There are several issues we haveto address if we want to develop an effectiveADP strategy:


• We avoid the problem of looping overall the states, but replace it with theproblem of developing a good statisticalapproximation of the value of being ineach state that we might visit. The chal-lenge of designing a good value functionapproximation is at the heart of mostADP algorithms.

• We have to determine how to update ourapproximation using new information.

• We assume that we can compute theexpectation, which is often not the case,and which especially causes problemswhen our decision/action variable is avector.

• We can easily get caught in a circle ofchoosing actions that take us to statesthat we have already visited, simplybecause we may have pessimistic esti-mates of states that we have not yetvisited. We need some mechanism toforce us to visit states just to learn thevalue of these states.

The remainder of this article is a brief tourthrough a rich set of algorithmic strategiesfor solving these problems.

Q-LEARNING AND THE POSTDECISIONSTATE VARIABLE

In our classical ADP strategy, we are solvingoptimization problems of the form

vnt = max

xt∈Xnt

(C(Sn

t , xt)

+ γ E{Vt+1(SM(Snt , xt, Wt+1))|St}

). (7)

There are many applications of ADP, whichintroduce additional complications: (i) wemay not be able to compute the expectation,and (ii) the decision x may be a vector,and possibly a very big vector (thousands ofdimensions). There are two related strategiesthat have evolved to address these issues:Q-learning, a technique widely used in theRL community, and ADP using the post-decision state variable, a method that hasevolved primarily in the operations researchcommunity. These methods are actuallyclosely related, a topic we revisit at the end.

Q-Factors

To introduce the idea of Q-factors, we need toreturn to the notation where a is a discreteaction (with a small action space). The Qfactor (so named because this was the originalnotation used in Ref. 10) is the value of beingin a state and taking a specific action,

Q(s, a) = C(s, a) + γ EV(SM(Sn, a, W)).

Value functions and Q-factors are relatedusing

V(s) = maxa

Q(s, a),

so, at first glance it might seem as if weare not actually gaining anything. In fact,estimating a Q-factor actually seems harderthan estimating the value of being in a state,since there are more state-action pairs thanthere are states.

The power of Q-factors arises when wehave problems where we either do not havean explicit transition function SM(·), or do notknow the probability distribution of W. Thismeans we cannot compute the expectation,but not because it is computationally difficult.When we do not know the transition func-tion, and/or cannot compute the expectation(because we do not know the probability dis-tribution of the random variable), we wouldlike to draw on a subfield of ADP knownas model-free dynamic programming. In thissetting, we are in state Sn, we choose anaction an, and then observe the next state,which we will call s′. We can then record avalue of being in state s and taking action ausing

qn = C(Sn, an) + Vn−1

(s′)

where V(s′) = maxa′ Qn−1(s′, a′). We then usethis value to update our estimate of the Q-factor using

Qn(Sn, an) = (1 − α)Qn−1(Sn, an) + αqn.

Whenever we are in state Sn, we choose ouraction using

an = arg maxa

Qn−1(Sn, a). (8)


Note that we now have a method that does notrequire a transition function or the need tocompute an expectation, we only need accessto some exogenous process that tells us whatstate we transition to, given a starting stateand an action.

The Postdecision State Variable

A closely related strategy is to break thetransition function into two steps: the pureeffect of the decision xt, and the effect ofthe new, random information Wt+1. In thissection, we use the notation of operationsresearch, because what we are going to do isenable the solution of problems where xt maybe a very large vector. We also return to time-dependent notation since it clarifies when weare measuring a particular variable.

We can illustrate this idea using a simpleinventory example. Let St be the amount ofinventory on hand at time t. Let xt be an orderfor a new product, which we assume arrivesimmediately. The typical inventory equationwould be written as

St+1 = max{0, St + xt − Dt+1}

where Dt+1 is the random demand that arisesbetween t and t + 1. We denote the pure effectof this order using

Sxt = St + xt.

The state Sxt is referred to as the postdecision

state [11], which is the state at time t, imme-diately after a decision has been made, butbefore any new information has arrived. Thetransition from Sx

t to St+1 in this examplewould then be given by

St+1 = max{0, Sxt − Dt+1}.

Another example of a postdecision state is atic-tac-toe board after you have made yourmove, but before your opponent has moved.Reference 5 refers to this as the after-statevariable, but we prefer the notation thatemphasizes the information content (whichis why Sx

t is indexed by t).A different way of representing a post-

decision state is to assume that we have aforecast Wt,t+1 = E{W+1|St}, which is a point

estimate of what we think Wt+1 will be givenwhat we know at time t. A postdecision statecan be thought of as a forecast of the stateSt+1 given what we know at time t, which isto say

Sxt = SM(St, xt, Wt,t+1).

This approach will seem more natural in cer-tain applications, but it would not be usefulwhen random variables are discrete. Thus,it makes sense to use the expected demand,but not the expected action of your tic-tac-toeopponent.

In a typical decision tree, St would repre-sent the information available at a decisionnode, while Sx

t would be the informationavailable at an outcome node. Using thesetwo steps, Bellman’s equations become

Vt(St) = maxxt

(C(St, xt) + γ Vx

t (Sxt )

), (9)

Vxt (Sx

t ) = E{Vt+1(St+1)|St}. (10)

Note that Equation (9) is now a determinis-tic optimization problem, while Equation (10)involves only an expectation. Of course, westill do not know Vx

t (Sxt ), but we can approxi-

mate it just as we would approximate Vt(St).However, now we have to pay attention tothe nature of the optimization problem. If weare just searching over a small number of dis-crete actions, we do not really have to worryabout the structure of our approximate valuefunction around Sx

t . But if x is a vector, whichmight have hundreds or thousands of dis-crete or continuous variables, then we haveto recognize that we are going to need somesort of solver to handle the maximizationproblem.

To illustrate the basic idea, assume thatwe have discrete states and are using alookup table representation for a value func-tion. We would solve Equation (9) to obtaina decision xn

t , and we would then simulateour way to Sn

t+1 using our transition function.Let vn

t be the objective function returned bysolving Equation (9) at time t, when we arein state Sn

t . We would then use this value toupdate the value at the previous postdecisionstate. That is,

Vnt−1(Sn

t−1) = (1 − αn−1)Vn−1t−1 (Sn

t−1) + αn−1vnt .

(11)


Thus, the updating is virtually the same aswe did before; it is just that we are now usingour estimate of the value of being in a state toupdate the value of the previous postdecisionstate. This is a small change, but it has ahuge impact by eliminating the expectationin the decision problem.

Now imagine that we are managing dif-ferent types of products, where Sti is theinventory of product of type i, and Sx

ti isthe inventory after placing a new order. Wemight propose a value function approxima-tion that looks like

Vxt (Sx

t ) =∑

i

(θ1iSx

ti − θ2i(Sxti)

2)

.

This value function is separable and concavein the inventory (assuming θ2i > 0). Usingthis approximation, we can probably solveour maximization problem using a nonlin-ear programming algorithm, which meanswe can handle problems with hundreds orthousands of products. Critical to this step isthat the objective function in Equation (9) isdeterministic.

Comments

On first glance, Q-factors and ADP using thepostdecision state variable seem to be verydifferent methods to address very differentproblems. However, there is a mathematicalcommonality to these two algorithmic strate-gies. First, we note that if S is our currentstate, (S, a) is a kind of postdecision state, inthat it is a deterministic function of the stateand action. Second, if we take a close look atEquation (9), we can write

Q(St, xt) = C(St, xt) + γ Vxt (Sx

t ).

We assume that we have a known functionSx

t that depends only on St and xt, but wedo not require the full transition function(which takes us to time t + 1), and we do nothave to take an expectation. However, ratherthan develop an approximation of the valueof a state and an action, we only require thevalue of a (postdecision) state. The task offinding the best action from a set of Q-factors(Eq. 8) is identical to the maximization prob-lem in Equation (9). What has been over-looked in the ADP/RL community is that this

step makes it possible to search over large,multidimensional (and potentially continu-ous) action spaces. Multidimensional actionspaces are a relatively overlooked problemclass in the ADP/RL communities.

APPROXIMATE POLICY ITERATION

An important variant of ADP is approximatepolicy iteration, which is illustrated in Fig. 2using the postdecision state variables. In thisstrategy, we simulate a policy, say, M timesover the planning horizon (or M steps intothe future). During these inner iterations, wefix the policy (i.e., we fix the value functionapproximation used to make a decision) toobtain a better estimate of the value of beingin a state. The general process of updatingthe value function approximation is handledusing the update function UV (·). As M → ∞,we obtain an exact estimate of the value ofbeing in a particular state, while following afixed policy (see Ref. 12 for the convergencetheory of this technique).

If we randomize on the starting state andrepeat this simulation for an infinite numberof times, we can obtain the best possiblevalue function given a particular approxima-tion. It is possible to prove convergence ofclassical stochastic approximation methodswhen the policy is fixed. Unfortunately, thereare very few convergence results when wecombine sampled observations of the valueof being in a state with changes in the policy.

If we are using low-dimensional represen-tations of the value function (as occurs withbasis functions), it is important that ourupdate be of sufficient accuracy. If we useM = 1, which means we are updating aftera single forward traversal, the result maymean that the value function approximation(and therefore the policy) is changing toorapidly from one iteration to another. Theresulting algorithm may actually diverge[13,14].

TYPES OF VALUE FUNCTIONAPPROXIMATIONS

At the heart of approxmiate dynamic pro-gramming is approximating the value of


Figure 2. Approximate policy iteration using value function-based policies.

being in a state. It is important to emphasizethat the problem of designing and estimatinga value function approximation draws onthe entire field of statistics and machinelearning. Below, we discuss two very generalapproximation methods. You will generallyfind that ADP is opening a doorway into thefield of machine learning, and that this isthe place where you will spend most of yourtime. You should pick up a good referencebook such as Ref. 15 on statistical learningmethods, and keep an open mind regardingthe best method for approximating a valuefunction (or the policy directly).

Lookup Tables with Aggregation

The simplest form of statistical learning usesthe lookup table representation that we firstillustrated in Equation (5). The problem withthis method is that it does not apply to con-tinuous states, and requires an exponentiallylarge number of parameters (one per state)when we encounter vector-valued states.

A popular way to overcome large statespaces is through aggregation. Let S be the

set of states. We can represent aggregationusing

Gg : S → S(g).

S(g) represents the gth level of aggregationof the state space S, where we assume thatS(0) = S. Let

s(g) = Gg(s), the gth level aggregation ofstate s.

G = The set of indices corresponding to thelevels of aggregation.

We then define Vg(s) to be the value function

approximation for state s ∈ S(g).We assume that the family of aggregation

functions Gg, g ∈ G is given. A common strat-egy is to pick a single level of aggregationthat seems to work well, but this introducesseveral complications. First, the best levelof aggregation changes as we acquire moredata, creating the challenge of working outhow to transition from approximating thevalue function in the early iterations ver-sus later iterations. Second, the right level of


aggregation depends on how often we visit aregion of the state space.

A more natural way is to use a weightedaverage. At iteration n, we can approximatethe value of being in state s using

vn(s) =∑g∈G

w(g,n)(s)v(g,n)(s).

where w(g,n)(s) is the weight given to the gth

level of aggregation when measuring states. This weight is set to zero, if there are noobservations of states s ∈ S(g). Thus, whilethere may be an extremely large numberof states, the number of positive weights atthe most disaggregate level w(0,n) is nevergreater than the number of observations.Typically, the number of positive weights atmore aggregate levels will be substantiallysmaller than this.

There are different ways to determinethese weights, but an effective method is touse weights that are inversely proportionalto the total squared variation of eachestimator. Let

σ 2(s)(g,n) the variance of the estimate ofthe value of state s at the gth

level of aggregation aftercollecting n observations,

μ(g,n)(s) an estimate of the bias due toaggregation.

We can then use weights that satisfy

w(g)(s) ∝(

(σ 2(s))(g,n) +(μ(g,n)(s)

)2)−1

. (14)

The weights are then normalized so thatthey sum to one. The quantities σ 2(s)(g,n) andμ(g,n)(s) are computed using the following:

(σ 2(s))(g,n) = Var[v(g,n)(s)]

= λ(g,n)(s)(s2(s))(g,n), (15)

(s2(s))(g,n) = ν(g,n)(s) − (β(g,n)

(s))2

1 + λn−1 , (16)

λ(g,n)(s) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

(α(g,n−1)(s))2, n = 1,(1 − α(g,n−1)(s))2

× λ(g,n−1)(s)+ (α(g,n−1)(s))2, n > 1,

(17)

ν(g,n)(s) = (1 − αn−1)ν(g,n−1)(s)

+ αn−1(v(g,n−1)(s) − vn(s))2, (18)

β(g,n)

(s) = (1 − αn−1)β(g,n−1)

(s)

+ αn−1(vn − v(g,n−1)(s)), (19)

μ(g,n)(s) = v(g,n)(s) − v(0,n)(s). (20)

The term ν(g,n)(s) is the total squared varia-tion in the estimate, which includes variationdue to pure noise, (s2(s))(g,n), and two sourcesof bias. The term β

(g,n)(s) is the bias intro-

duced when smoothing a nonstationary dataseries (if the series is steadily rising, our esti-mates are biased downward), and μ(g,n)(s) isthe bias due to aggregation error.

This system for computing weights scaleseasily to large state spaces. As we collectmore observations of states in a particularregion, the weights gravitate toward puttingmore weight on disaggregate estimates,while keeping higher weights on moreaggregate estimates for regions that are onlylightly sampled. For more on this strategy,see Ref. 16 or Chapter 7 in Ref. 2. For otherimportant references on aggregation seeRefs 4, 17, 18 and 19.

Basis Functions

Perhaps the most widely cited method forapproximating value functions uses the con-cept of basis functions. Let

F A set of features drawn from a statevector,

φf (S) A scalar function that draws what isfelt to be a useful piece of informat-ion from the state variable, for f ∈ F.

The function φf (S) is referred to as a basisfunction, since in an ideal setting we wouldhave a family of functions that spans thestate space, allowing us to write

V(S|θ ) =∑f∈F

θf φf (S).


for an appropriately chosen vector of param-eters θ . In practice, we cannot guarantee thatwe can find such a set of basis functions, sowe write

V(S|θ ) ≈∑f∈F

θf φf (S).

Basis functions are typically referred to asindependent variables or covariates in thestatistics community. If S is a scalar variable,we might write

V(S|θ ) =θ0 + θ1S+ θ2S2 + θ3 sin(S) + θ4 ln(S).

This type of approximation is referred to aslinear, since it is linear in the parameters.However, it is clear that we are capturingnonlinear relationships between the valuefunction and the state variable.

There are several ways to estimate theparameter vector. The easiest, which is oftenreferred to as least squares temporal dif-ferencing (LSTD), uses vn

t as it is calcu-lated in Equation (4), which means that itdepends on the value function approxima-tion. An alternative is least squares policyestimation (LSPE), where we use repeatedsamples of rewards from simulating a fixedpolicy. The basic LSPE algorithm estimatesthe parameter vector θ by solving

θn = arg maxθ

n∑m=1

⎛⎝vm

t −∑f∈F

θf φf (Smt )

⎞⎠

2

.

This algorithm provides nice convergenceguarantees, but the updating step becomesmore and more expensive as the algorithmprogresses. Also, it is putting equal weighton all iterations, a property that is notdesirable in practice. We refer the readerto Ref. 6 for a more thorough discussion ofLSTD and LSPE.

A simple alternative uses a stochastic gra-dient algorithm. Assuming we have an initialestimate θ0, we can update the estimate oftheta at iteration n using

θn = θn−1 − αn−1(Vt(Sn

t |θn−1) − vnt)∇θVt

× (Snt |θn−1)

= θn−1 − αn−1(V(Sn

t |θn−1) − vn(Snt )

)

×

⎛⎜⎜⎜⎝

φ1(Snt )

φ2(Snt )

.

..

φF(Snt )

⎞⎟⎟⎟⎠ . (21)

Here, αn−1 is a step size, but it has to performa scaling function, so it is not necessarily lessthan 1. Depending on the problem, we mighthave a step size that starts around 106 or0.00001.

A more powerful strategy uses recursiveleast squares. Let φn be the vector created bycomputing φ(Sn) for each feature f ∈ F. Theparameter vector θ can be updated using

θn = θn−1 − Hnφnεn, (22)

where φn is the vector of basis functions eval-uated at Sn, and εn = V(Sn

t |θn−1) − vn(Snt ) is

the error in our prediction of the value func-tion. The matrix Hn is computed using

Hn = 1γ n Bn−1. (23)

Bn−1 is an F + 1 by F + 1 matrix (where F =|F|), which is updated recursively using

Bn = Bn−1 − 1γ n (Bn−1φn(φn)TBn−1). (24)

γ n is a scalar computed using

γ n = 1 + (φn)TBn−1φn. (25)

Note that there is no step size in theseequations. These equations avoid the scal-ing issues of the stochastic gradient updatein Equation (21), but hide the fact that theyare implicitly using a step size of 1/n. Thiscan be very effective, but can work verypoorly. We can overcome this by introducinga factor λ and revising the updating of γ n andBn using

γ n = λ + (xn)TBn−1xn, (26)

and the updating formula for Bn, which isnow given by


Bn = 1λ

(Bn−1 − 1

γ n (Bn−1xn(xn)TBn−1))

.

This method puts a weight of λn−m on anobservation from iteration n − m. If λ = 1,then we are weighting all observationsequally. Smaller values of λ put a lowerweight on earlier observations. See Chapter7 of Ref. 2 for a more thorough discussion ofthis strategy. For a thorough presentationof recursive estimation methods for basisfunctions in ADP, we encourage readers tosee Ref. 19.

Values versus Marginal Values

There are many problems, where it is moreeffective to use the derivative of the valuefunction with respect to a state, rather thanthe value of being in the state. This is par-ticularly powerful in the context of resourceallocation problems, where we are solvinga vector-valued decision problem using lin-ear, nonlinear, or integer programming. Inthese problem classes, the derivative of thevalue function is much more important thanthe value function itself. This idea has longbeen recognized in the control theory com-munity, which has used the term heuristicdynamic programming to refer to ADP usingthe value of being in a state, and ‘‘dual heuris-tic’’ dynamic programming when using thederivative (see Ref. 20 for a nice discussion ofthese topics from a control-theoretic perspec-tive). Chapters 11 and 12 in Ref. 2 discuss theuse of derivatives for applications that arisein operations research.

Discussion

Approximating value functions can be viewedas just an application of a vast array ofstatistical methods to estimate the valueof being in a state. Perhaps the biggestchallenge is model specification, which caninclude designing the basis functions. SeeRefs 14 and 21 for a thorough discussionof feature selection. There has also beenconsiderable interest in the use of nonpara-metric methods [22]. An excellent referencefor statistical learning techniques that canbe used in this setting is Ref. 15.

The estimation of value function approxi-mations, however, involves much more thanjust a simple application of statistical meth-ods. There are several issues that have to beaddressed that are unique to the ADP setting:

1. Value function approximations haveto be estimated recursively. Thereare many statistical methods thatdepend on estimating a model from afixed batch of observations. In ADP,we have to update value functionapproximations after each observation.

2. We have to get through the early iter-ations. In the first iteration, we haveno observations. After 10 iterations,we have to work with value functionsthat have been approximated using 10data points (even though we may havehundreds or thousands of parametersto estimate). We depend on the valuefunction approximations in these earlyiterations to guide decisions. Poor deci-sions in the early iterations can lead topoor value function approximations.

3. The value vnt depends on the value

function approximation Vt+1(S). Errorsin this approximation produce biasesin vn

t , which then distorts our abilityto estimate Vt(S). As a result, we arerecursively estimating value functionsusing nonstationary data.

4. We can choose which state tovisit next. We do not have to useSn

t+1 = SM(Snt , xn

t , Wnt+1), which is a

strategy known as trajectory following.This is known in the ADP communityas the ‘‘exploration vs. exploitation’’Problem—do we explore a state justto learn about it, or do we visit astate that appears to be the best?Although considerable attention hasbeen given to this problem, it is alargely unresolved issue, especially forhigh-dimensional problems.

5. Everyone wants to avoid the art of spec-ifying value function approximations.Here, the ADP community shares thebroader goal of the statistical learningcommunity, which is a method that willapproximate a value function with nohuman intervention.


THE LINEAR PROGRAMMING METHOD

The best-known methods for solving dynamicprograms (in steady state) are value iterationand policy iteration. Less well known is thatwe can solve for the value of being in each(discrete) state by solving the linear program

minv

∑s∈S

βsv(s) (27)

subject to

v(s) ≥ C(s, a)

+ γ∑s′∈S

p(s′|s, a)v(s′) for all s and a,

(28)

where β is simply a vector of positive coeffi-cients. In this problem, the decision variableis the value v(s) of being in state s, whichhas to satisfy the constraints (Eq. 26). Theproblem with this method is that there isa decision variable for each state, and aconstraint for each state-action pair. Not sur-prisingly, this can produce very large linearprograms very quickly.

The linear programming method receiveda new lease on life with the work of de Fariasand Van Roy [23], who introduced ADP con-cepts to this method. The first step to reducecomplexity involves introducing basis func-tions to simplify the representation of thevalue function, giving us

minθ

∑s∈S

βs

∑f∈F

θf φf (s)

subject to∑f∈F

θf φf (s) ≥ C(s, a) + γ∑s′∈S

p(s′|s, a)

∑f∈F

θf φf (s′) for all s and a.

Now, we have reduced a problem with |S|variables (one value per state) to a vectorθf , f ∈ F. Much easier, but we still have theproblem of too many constraints. de Fariasand Van Roy [23] then introduced the novelidea of using a Monte Carlo sample of theconstraints. As of this writing, this method

is receiving considerable attention in theresearch community, but as with all ADPalgorithms, considerable work is needed toproduce robust algorithms that work in anindustrial setting.

STEP SIZES

While designing a good value function,approximations are the most central chal-lenge in the design of an ADP algorithm; acritical and often overlooked choice is thedesign of a step size rule. This is not to saythat step sizes have been ignored in theresearch literature, but it is frequently thecase that developers do not realize the dan-gers arising from an incorrect choice of stepsize rule.

We can illustrate the basic challenge inthe design of a step size rule by consideringa very simple dynamic program. In fact, ourdynamic program has only one state and noactions. We can write it as computing thefollowing expectation

F = E

∞∑t=0

γ tCt.

Assume that the contributions Ct are identi-cally distributed as a random variable C andthat EC = c. We know that the answer to thisproblem is c/(1 − γ ), but assume that we donot know c, and we have to depend on randomobservations of C. We can solve this problemusing the following two-step algorithm:

vn = Cn + γ vn−1, (29)

vn = (1 − αn−1)vn−1 + αn−1vn. (30)

We are using an iterative algorithm, wherevn is our estimate of F after n iterations. Weassume that Cn is the nth sample realizationof C.

We note that Equation (29) handles theproblem of summing over multiple contribu-tions. Since we depend on Monte Carlo obser-vations of C, we use Equation (30) tothenperform smoothing. If C were deterministic,we would use a step size αn−1 = 1 to achievethe fastest convergence. On the other hand,


if C is random, but if γ = 0 (which means wedo not really have to deal with the infinitesum), then the best possible step size is 1/n(we are just trying to estimate the mean ofC). In fact, 1/n satisfies a common set of the-oretical conditions for convergence which aretypically written

∞∑n=1

αn = ∞,

∞∑n=1

(αn)2 < ∞.

The first of these conditions prevents stalling(a step size of 0.8n would fail this condition),while the second ensures that the variance ofthe resulting estimate goes to zero (statisticalconvergence).

Not surprisingly, many step size formu-las have been proposed in the literature.A review is provided in Ref. 24 (see alsoChapter 6 in Ref. 2). Two popular step sizerules that satisfy the conditions above are

αn = aa + n − 1

, and

αn = 1nβ

.

In the first formula, a = 1 gives you 1/n,whereas larger values of a produce a stepsize that declines more slowly. In the secondrule, β has to satisfy 0.5 < β ≤ 1. Of course,these rules can be combined, but both a andβ have to be tuned, which can be particularlyannoying.

Reference 24 introduces the following stepsize rule (called the bias adjusted Kalmanfilter (BAKF) step size rule):

αn = 1 − σ 2

(1 + λn)σ 2 + (βn), (31)

where

λn ={

(αn−1)2, n = 1(1 − αn−1)2λn−1 + (αn−1)2, n > 1.

Here, σ 2 is the variance of the observationnoise, and βn is the bias measuring the dif-ference between the current estimate of the

value function Vn(Sn) and the true value

function V(Sn). Since these quantities are notgenerally known they have to be estimatedfrom data (see Refs 2 and 24, Chapter 6 fordetails).

This step size enjoys some useful prop-erties. If there is no noise, then αn−1 = 1.If βn = 0, then αn−1 = 1/n. It can also beshown that αn−1 ≥ 1/n at all times. The pri-mary difficulty is that the bias term has to beestimated from noisy data, which can causeproblems. If the level of observation noiseis very high, we recommend a deterministicstep size formula. But if the observation noiseis not too high, the BAKF rule can work quitewell, because it adapts to the data.

PRACTICAL ISSUES

ADP is a flexible modeling and algorithmicframework that requires that a developermake a number of choices in the design of aneffective algorithm. Once you have developeda model, designed a value function approxi-mation, and chosen a learning strategy, youstill face additional steps before you can con-clude that you have an effective strategy.

Usually the first challenge you will faceis debugging a model that does not seem tobe working. For example, as the algorithmlearns, the solution will tend to get better,although it is hardly guaranteed to improvefrom one iteration to the next. But what ifthe algorithm does not seem to produce anyimprovement at all?

Start by making sure that you are solv-ing the decision problem correctly, given thevalue function approximation (even if it maybe incorrect). This is really only an issue formore complex problems, such as where x isan integer-valued vector, requiring the use ofmore advanced algorithmic tools.

Then, most importantly, verify the accu-racy of the information (such as v) that youare deriving to update the value function.Are you using an estimate of the value ofbeing in a state, or perhaps the marginalvalue of an additional resource? Is this valuecorrect? If you are using a forward pass algo-rithm (where v depends on a value functionapproximation of the future), assume the


approximation is correct and validate v. Thisis more difficult if you are using a backwardpass (as with TD(1)), since v has to trackthe value (or marginal value) of an entiretrajectory.

When you have confidence that v is cor-rect, are you updating the value functionapproximation correctly? For example, if youcompare vn

t with Vn−1

(Snt ), the observations

of vn (over iterations) should look like thescattered noise of observations around anestimated mean. Do you see a persistent bias?

Finally, think about whether your valuefunction approximations are really contribut-ing better decisions. What if you set the valuefunction to zero? What intelligence do youfeel your value function approximations arereally adding? Keep in mind that there aresome problems where a myopic policy canwork reasonably well.

Now, assume that you are convincedthat your value function approximations areimproving your solution. One of the mostdifficult challenges is evaluating the qualityof an ADP solution. Perhaps you feel youare getting a good solution, but how far fromoptimal are you?

There is no single answer to the question ofhow to evaluate an ADP algorithm. Typically,you want to ask the question: If I simplifymy problem in some way (without making ittrivial), do I obtain an interesting problemthat I can solve optimally? If so, you shouldbe able to apply your ADP algorithm to thissimplified problem, producing a benchmarkagainst which you can compare.

How a problem might be simplified is sit-uation specific. One strategy is to eliminateuncertainty. If the deterministic problem isstill interesting (for example, this might bethe case in a transportation application, butnot a financial application), then this can be apowerful benchmark. Alternatively, it mightbe possible to reduce the problem to one thatcan be solved exactly as a discrete Markovdecision process.

If these approaches do not produce any-thing, the only remaining alternative is tocompare an ADP-based strategy against apolicy that is being used (or most likely to beused) in practice. If it is not the best policy, isit at least an improvement? Of course, there

are many settings, where ADP is being usedto develop a model to perform policy studies.In such situations, it is typically more impor-tant that the model behave reasonably. Forexample, do the results change in response tochanges in inputs as expected?

ADP is more like a toolbox than a recipe.A successful model requires some creativityand perseverance. There are many instancesof people who try ADP, only to conclude thatit ‘‘does not work.’’ It is very easy to miss oneingredient to produce a model that does notwork. But a successful model not only solvesa problem, but it also enhances your abil-ity to understand and solve complex decisionproblems.

REFERENCES

1. Powell WB. Approximate dynamic program-ming — I: Modeling. Encyclopedia for opera-tions research and management science. NewYork: John Wiley & Sons, Inc.; 2010.

2. Powell WB. Approximate dynamic program-ming: solving the curses of dimensionality.New York: John Wiley & Sons, Inc.; 2007.

3. Puterman ML. Markov decision processes.New York: John Wiley & Sons, Inc.; 1994.

4. Bertsekas D, Tsitsiklis J. Neuro-dynamic pro-gramming. Belmont (MA): Athena Scientific;1996.

5. Sutton R, Barto A. Reinforcement learning.Cambridge (MA): The MIT Press; 1998.

6. Bertsekas D. Volume II, Dynamic program-ming and optimal control. 3rd ed. Belmont(MA): Athena Scientific; 2007.

7. Sutton R. Learning to predict by the meth-ods of temporal differences. Mach Learn1988;3(1):9–44.

8. Barto AG, Bradtke SJ, Singh SP. Learningto act using real-time dynamic programming.Artif Intell Spec Vol Comput Res InteractAgency 1995;72:81–138.

9. Sutton R. On the virtues of linear learningand trajectory distributions. In: Proceedingsof the Workshop on Value Function Approxi-mation, Machine Learning Conference; 1995.pp. 95–206.

10. Watkins C. Learning from delayed rewards[PhD thesis]. Cambridge: Cambridge Univer-sity; 1989.

11. Van Roy B, Bertsekas DP, Lee Y, et al.A neuro-dynamic programming approach to


retailer inventory management. In: Proceed-ings of the IEEE Conference on Decision andControl, Volume 4, 1997. pp. 4052–4057.

12. Tsitsiklis J, Van Roy B. An analysis oftemporal-difference learning with functionapproximation. IEEE Trans Autom Control1997;42:674–690.

13. Bell DE. Risk, return, and utility. Manage Sci1995;41:23–30.

14. Tsitsiklis JN, Van Roy B. Feature-based meth-ods for large-scale dynamic programming.Mach Learn 1996;22:59–94.

15. Hastie T, Tibshirani R, Friedman J. The ele-ments of statistical learning, Springer seriesin Statistics. New York: Springer; 2001.

16. George A, Powell WB, Kulkarni S. Value func-tion approximation using multiple aggrega-tion for multiattribute resource management.J Mach Learn Res 2008;2079–2111.

17. Bertsekas D, Castanon D. Adaptive aggre-gation methods for infinite horizon dynamicprogramming. IEEE Trans Automat Control1989;34(6):589–598.

18. Luus R. Iterative dynamic programming. NewYork: Chapman & Hall/CRC; 2000.

19. Choi DP, Van Roy B. A generalized Kalman fil-ter for fixed point approximation and efficient

temporal-difference learning. Discrete EventDyn Syst 2006;16:207–239.

20. Ferrari S, Stengel RF. Model-based adap-tive critic designs. In: Si J, Barto AG, Pow-ell WB, et al., editors. Handbook of learningand approximate dynamic programming. NewYork: IEEE Press; 2004. pp. 64–94.

21. Fan J, Li R. Statistical challenges with highdimensionality: feature selection in knowl-edge discovery. In: Sanz-Sole M, Soria JVarona JL, et al. editors. Volume III, Proceed-ings of international congress of mathemati-cians. Zurich: European Mathematical SocietyPublishing House; 2006. pp. 595–622.

22. Fan J, Gijbels I. Local polynomial modelingand its applications. London: Chapman andHall; 1996.

23. de Farias D, Van Roy B. On constraint sam-pling in the linear programming approachto approximate dynamic programming. MathOper Res 2004;29(3):462–478.

24. George A, Powell WB. Adaptive step sizesfor recursive estimation with applicationsin approximate dynamic programming. MachLearn 2006;65(1):167–198.

APPROXIMATE DYNAMIC PROGRAMMING—II: …adp.princeton.edu/Papers/Powell - Approximate Dynamic...

Documents

Transcript of APPROXIMATE DYNAMIC PROGRAMMING—II: …adp.princeton.edu/Papers/Powell - Approximate Dynamic...