Dynamic Resource Allocation for Optimizing …afern/papers/aistats14.pdfDynamic Resource Allocation...

Dynamic Resource Allocation for Optimizing Population Diffusion

Shan Xue Alan Fern Daniel [email protected]

School of EECSOregon State University

[email protected] of EECS

Oregon State University

[email protected] of Massachusetts Amherst

Mount Holyoke College

Abstract

This paper addresses adaptive conservationplanning, where the objective is to maximizethe population spread of a species by allo-cating limited resources over time to con-serve land parcels. This problem is char-acterized by having highly stochastic exoge-nous events (population spread), a large ac-tion branching factor (number of allocationoptions) and state space, and the need toreason about numeric resources. Togetherthese characteristics render most existing AIplanning techniques ineffective. The maincontribution of this paper is to design andevaluate an online planner for this problembased on Hindsight Optimization (HOP), atechnique that has shown promise in otherstochastic planning problems. Unfortunately,standard implementations of HOP scale lin-early with the number of actions in a domain,which is not feasible for conservation prob-lems such as ours. Thus, we develop a newapproach for computing HOP policies basedon mixed-integer programming and dual de-composition. Our experiments on syntheticand real-world scenarios show that this ap-proach is effective and scalable compared toexisting alternatives.

1 INTRODUCTION

This paper addresses adaptive conservation planning,where the goal is to maximize the population growthof a species by purchasing and reserving land parcelsto best support population spread. This optimization

Appearing in Proceedings of the 17th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy-right 2014 by the authors.

problem is challenging as it requires reasoning aboutthe stochastic population spread on a large network,uncertain future budgets, and a combinatorial actionspace (the set of possible investment combinations).

Generic off-the-shelf, model-based planners typicallyare unable to compactly encode such problems (e.g.due to numeric budgets and/or exogenous events) andscalability is also quite limited. Other simulation-based approaches such as Monte-Carlo tree search havedifficulty due to the large branching factors and hori-zons. In this paper, we develop an online planning al-gorithm for the adaptive conservation problem, whichrequires deciding, at each decision epoch, the set ofparcels to purchase in order to maximize the long-termpopulation growth. While this work focuses on a par-ticular population diffusion process, it is important tonote that the principles are applicable to a wider classof diffusion-control problems that pose similar chal-lenges to existing methods.

Prior work has considered simplified versions of theabove problem. Sheldon et al. [2010] studies the non-adaptive, upfront planning problem, where it is as-sumed that the budget is known and purchases aremade only at the first time step. This simplificationignores the reality that budgets often arrives over timeand that delaying some allocation decisions until moreinformation is available can be beneficial. In followupwork Xue et al. [2012] studies a middle ground betweenupfront and fully adaptive planning. Their algorithmschedules the purchases of an upfront solution overtime in order to get a (non-adaptive) multi-stage con-servation plan with maximum flexibility. An approachfor two-stage adaptive conservation plans is given byAhmadizadeh et al. [2010], but scalability to morestages poses many challenges. The only prior workon fully adaptive conservation planning is by Golovinet al. [2011]. They make a crucial assumption that thespecies does not spread across land parcels and thenreplan at each time step using a greedy, myopic ap-proach. While approximation bounds are shown underthe assumption, the myopic nature of the approach can


make it short-sighted when populations can spread,such as in our motivating application involving birds.Indeed our experiments confirm this shortsightednessis apparent on realistic problems.

Above we have seen several non-adaptive approachesthat reason about population dynamics and an adap-tive approach that assumes populations do not spread.Here we fill the research gap in conservation plan-ning with an adaptive approach for highly spreadingspecies. We take both the future population dynam-ics, the future budget, and the future actions intoaccount. Our approach is based on hindsight opti-mization (HOP), an online planning approach thathas been successfully applied to a variety of difficultstochastic planning problems, e.g. Chang et al. [2011],Chong et al. [2000], Wu et al. [2002], Yoon et al.[2010], Hubbe et al. [2012]. HOP reasons about pos-sible stochastic futures to compute an upper boundon action values and selects the action that has themaximum upper bound. Unfortunately, standard im-plementations of HOP scale linearly in the number ofactions available at any time, which is prohibitive forour exponential action space. Thus, our main contri-bution is to develop an efficient algorithm for comput-ing HOP policies for such exponentially large, factoredaction spaces. We accomplish this by representing theHOP policy via a large Mixed Integer Program (MIP)and then applying the Dual Decomposition schema tomake its solution more practical. Our experimentsshow that HOP can significantly outperform more my-opic alternatives while also showing scalability to largeproblems.

2 Problem Setup

Population Model. Typically, a conservation prob-lem is associated with a land map that is divided intomany habitat patches, which at any time can eitherbe occupied by the species or not. For managementconvenience, multiple nearby patches are grouped intoland parcels and assign each parcel p a cost c(l) thatcan be seen as the expense to conserve all the patchesin p. We assume only patches in purchased parcels canbe occupied as they are conserved to be accessible andsuitable for the species.

The population dynamics are such that at any timestep, any patch v becomes occupied via a populationspread from patch u with a probability of puv (here1−pvv is the extinction probability for patch v). Herewe use a standard spread model puv where the spreadprobability decreases with the distance between u andv. In addition, there is a stochastic budget process,where new funds arrive each year according to somedistribution. Parcels can only be purchased when

Figure 1: Visualization of An Example Future. Sup-pose there are totally 3 patches: u, v, and w, theneach patch repeats at each time step. The edge indi-cates the colonization from one patch to the other ata certain time step.

enough funds are available and hence must be pur-chased incrementally. Since the population can onlyspread to patches in conserved parcels, the populationdiffusion is strongly influenced by parcel purchases.

Adaptive Planning Problem. We consider a finite-horizon setting, where the goal is to optimize the totalpopulation across patches at the specified horizon H.Decision epochs occur every Tr years and at each adecision is made about the set of parcels to purchase,limited by the current budget. Thus, our planningproblem is to produce a policy π that is given the cur-rent problem state at a decision epoch and outputs aset of parcels to purchase with cost no more than thecurrent budget B(t). Here the problem state is com-posed of the time-to-horizon, the species occupancy ateach patch, the current budget, and knowledge of pre-vious purchases. We follow a model-based approach,where π can be computed using provided stochasticmodels of the population dynamics and budget.

It will be useful to define the notion of a future F ,which is a random variable encoding the random out-comes that dictate the future population spread andbudget over the horizonH. In our setting, a realizationf of F deterministically dictates whether a populationcan spread between two patches at any particular timeand the budget at each future year. Such a future canbe visualized as a deterministic network as in Figure1. In the network, all the patches repeat at every timestep and an edge between patch u at time t and v attime t + 1 indicates that if the species is at u at timet then it will spread to v at time t+ 1.

For a fixed realization f , we effectively get a deter-ministic problem, for which any policy can be eval-uated against. We say a policy π is feasible for afuture f iff at any time step the total cost of par-cel purchases of π is within the budget limit of f .For a future f , a feasible policy π, and initial states0 we denote the total population/reward achievedat the horizon H by R(s0, f, π). The optimal solu-tion to our problem can then be expressed as findinga policy π∗ that maximizes the expected reward, i.e.

Shan Xue, Alan Fern, Daniel Sheldon

π∗ = arg maxπ E[R(s0, F, π)].

An exact solution for π∗ is beyond the capabilities ofexisting planners. To address this, prior work Golovinet al. [2011] used a simple policy based on myopic,greedy action selection. The main contribution was togive approximation bounds for the policy under strongassumption of no diffusion across parcels. Rather,here we develop a non-myopic action selection heuris-tic, hindsight optimization, that has shown success invarious other areas of planning. While hindsight op-timization also requires strong assumption to provideperformance guarantees, our experiments demonstratethat its non-myopic nature can lead to significant im-provements over Golovin et al. [2011].

3 Hindsight Optimization forConservation Planning

Hindsight Optimization. The main idea behindhindsight optimization (HOP) is to drive action se-lection by computing upper bounds on the values ofstates. Given a state s, the hindsight value of thestate Vhs(s) is an optimistic estimate of the true valueobtained by interchanging expectation and maximiza-tion, i.e. Vhs(s) = E [maxπ R(s, F, π)]. This clearlygives an upper bound on the value since it allows forinconsistent policies to optimize each future.

The key computational advantage of working with Vhsrather than directly with the value function V is thatit can be approximated by solving a set of determin-istic planning problems for individual futures. Thatis, given a set of sampled futures {f1, f2, . . . , fN} weestimate the hindsight value as:

V̂hs(s) =∑k

[maxπ

R(s, fk, π)]

where the internal maximization problem for each fkcan often be solved with existing solvers. The hind-sight Q-function Qhs(s, a) is accordingly defined as theexpected hindsight value achieved for states reached bytaking action a in state s and is also an upper bound onthe true Q-value of a state-action pair. The HOP pol-icy for a state s is then defined as arg maxaQhs(s, a).Under certain assumptions performance guaranteescan be made for the HOP policy (Mercier and Hen-tenryck [2007], Yoon et al. [2008]). However, in gen-eral, no guarantees can be made and examples of arbi-trarily poor performance compared to optimal can beconstructed. Fortunately such examples are often notreflective of real problems and the HOP policy is oftenan effective way to select action non-myopically.

The traditional way to compute the HOP policy is toestimate Qhs(s, a) by sampling a set of states result-ing from taking a in s, computing the hindsight value

for each state, and averaging. Unfortunately, this tra-ditional approach scales linearly with the size of theaction space and hence is not feasible for the combi-natorial action space in our conservation problem andmany others with factored actions.

HOP for Large Action Spaces. The traditionalcomputation of the HOP policy estimates Q-values foreach action for the purpose of maximizing over them.However, this is not strictly necessary if we can directlycompute the optimizing HOP action at a state with-out explicitly estimating each Q-value. Thus, the mainidea behind our approach is to encode the problem ofcomputing the optimizing HOP action (Q-values) asa mixed integer program (MIP) and then apply de-composition techniques to solve it, which avoids theexplicit enumeration over actions.

Our MIP for HOP is defined relative to a set of sam-pled futures {f1, f2, . . . , fN} and a current state s.Similar to the formulation of Sheldon et al. [2010],for each future fk, we can define a MIP, denoted asMIPk, which encodes the problem of finding a policythat maximizes the reward of fk. That is, MIPk solvesthe problem maxπk R(s, fk, πk), where πk is viewed asa binary vector that specifies for each parcel and time,whether to buy the parcel at that time in future fk.We will denote by π1

k the parcel purchases specifiedfor the first time step in MIPk. We also let OBJk andCONk denote the objective and constraints of MIPkrespectively and let Vk be all variables in MIPk exclud-ing πk. Due to space constraints we do not provide thefull details of MIPk, which is similar to that in Sheldonet al. [2010] and not crucial to our main contribution.Since we have separate decision variables for each fu-ture, the maximization and summation in the Q-valuefunction can be interchanged, which results in a MIPthat encodes the HOP policy:

HOP Policy MIP:

min{πk,Vk}

− 1

N

∑k

OBJk, s.t.

π11 = π1

2 = . . . = π1N and

⋃k

CONk

That is, we can compute the HOP policy by maxi-mizing the sum (minimizing the negative sum) of ob-jectives across the futures, subject to the constraintthat the solutions for each future agree on the first ac-tion. This MIP can then in concept be given to anyMIP solver and then the returned value of π1

1 can bereturned as the HOP action.

While in our experience, it is generally feasible touse existing MIP solvers to solve the individual MIPkproblems for realistic scenarios, when the number offutures increases, solving the combined MIP can be-come prohibitive. This is an important limitation sincethe variance of the HOP policy reduces with more fu-


tures. Fortunately, this issue can be largely overcomevia the use of dual decomposition techniques as theHOP policy MIP reveals a separable structure.

4 Dual Decomposition for HOP

In the HOP policy MIP, the individual MIPk problemsare only coupled via the policy constraints on the firstaction. If the constraints are removed then the com-bined MIP can be solved by solving each MIPk inde-pendently. This structure motivates the application ofLagrangian dual decomposition, which we formulatebelow and follows a similar structure as prior work onupfront conservation planning by Kumar et al. [2012].

We start by rewriting the coupling constraint π11 =

π12 = . . . = π1

N of MIPk as the set of constraints{π1

k = d : k = 1, . . . , N} where d is a new vec-tor of binary variables that represents the HOP policyaction at the first time step. We let π1

k,l and dl de-

note component l of π1k and d, indicating whether l

was purchased or not at time step 1. We can now re-lax these coupling constraints to get the Lagrangian ofthe HOP MIP by introducing Lagrangian multipliersλk,l for each constraint.

L({Vk, πk},d,λ) =− 1

N

N∑k=1

OBJk +∑l,k

λk,l(π1k,l − dl)

s.t.⋃k

CONk

The dual is then given by

q(λ) = min{Vk,πk},d

L({Vk, πk},d,λ)

= min{Vk,πk},d

− 1

N

N∑k=1

OBJk +∑l,k

λk,l(π1k,l − dl)

= min{Vk,πk},d

N∑k=1

− 1

NOBJk +

∑l,k

λk,lπ1k,l

−∑l

dl∑k

λk,l, s.t.⋃k

CONk

Intuitively, the relaxed constraints in the dual act asa penalty for violating the consistency requirementthat all policies across futures agree on the first ac-tion. Since the dual minimizes over d, in order toensure that q(λ) > −∞ we require the constraint∑Nk=1 λk,l = 0,∀l. To simplify notation, we denote

the space of Langrange multipliers that satisfy thisconstraint as:

Λ = {{λk,l} |N∑k=1

λk,l = 0, ∀l}

Under this constraint the last term in the dual vanishesand we finally get the dual which consists of indepen-dent subproblems for any fixed λ:

q(λ) = min{Vk,πk}

− 1

N

N∑k=1

OBJk +∑l,k

λk,lπ1k,l

s.t.⋃k

CONk and {λk,l} ∈ Λ

One important characteristic of the dual is that q(λ)for any feasible λ is a lower bound on the optimal pri-mal MIP objective value, which motivates attemptingto make the bound as tight as possible by maximizingq(λ) over λ. Since q(λ) is not continuous and the dualincludes constraints over λ we use projected subgradi-ent descent for this purpose, iterating as follows:

λ(i+1)k = [λ

(i)k + αi+1gk(λ

(i)k )]Λ (1)

where i is the iteration number, gk(·) is a subgradientof q(λ) with respect to λk, αi is the step size, and [z]Λis the projection of z onto constraint space Λ.

For our objective, one subgradient of q(λ) with respectto λk is π̄1

k such that

π̄k = arg minVk,πk

− 1

NOBJk +

∑l

λk,lπ1k,l

which can be found by solving a minimization probleminvolving a single future fk and hence is much moretractable than the full MIP. Note that this minimiza-tion problem is simply the original objective of MIPkwith an added term involving the current Langrangemultiplier values that can be viewed as assigning apenalty or reward for purchasing particular parcelsat the first time step. Finally, given the subgradi-ent, the projection onto Λ (with Euclidean norm) iswell known, requiring only that we subtract from eachcomponent of the subgradient the average component

value. Letting π̄1,(i)k denote the subgradient at itera-

tion i we get the following:

λ(i+1)k = λ

(i)k + αi+1

[π̄

1,(i)k −

∑Nk′=1 π̄

1,(i)

k′

N

](2)

This shows that the gradient steps for dual optimiza-tion can be computed by optimizing independent sub-problems for each future (i.e. solving for each π̄k),which avoids solving a single MIP involving all futures.Putting everything together, the complete dual opti-mization algorithm is given in Algorithm 1.

Algorithm 1 Dual Decomposition Algorithm1: Given: initial vector λ ∈ Λ2: while convergence is not reached do3: Optimize subproblems independently:

solve each subproblem and get {π̄1,(i)k }

4: Compute average value of π̄1,(i)k over N subproblems:

d̂(i)

=

∑Nk′=1

π̄1,(i)

k′N

5: Update λ:

λ(i+1)k = λ

(i)k + αi+1[π̄

1,(i)k − d̂(i)

]

6: end while


At a high level, the final algorithm involves optimiz-ing the dual via iterations. Each iteration involvessolving multiple modified MIPi problems, which aredifferent from the originals in that the costs of cer-tain purchases at the first time step are modified inorder to encourage the subproblems to agree on thefirst actions. Specifically costs are increased for parcelsthat are not currently purchased by most futures anddecreased for parcels that are purchased by many fu-tures. The iteration ends when either the subproblemsall agree on the first action, in which case we get theoptimal HOP action, or the maximum number of iter-ations is reached, in which case we extract a solutionas described below.

Feasible Solution Extraction. The above algo-rithm optimizes the dual and does not explicitly pro-vide a primal solution, which in our case is the actionof the HOP policy. After optimizing the dual it is oftenthe case that the solutions to the independent subprob-lems (i.e. the π1

k) are consistent and hence representa feasible primal solution. In these cases, any of theπ1k are optimal primal solutions and we output the re-

sulting action as the HOP action. However, in generalthere can be a duality gap and we are not guaranteedthat optimizing the dual will produce feasible primalsolutions. Thus, as is typical in Langrangian relax-ation techniques we must define a strategy for heuris-tically selecting a feasible primal solution guided bythe information obtained during the optimization.

Our approach is a heuristic based on the consistencyrequirement. As in the algorithm, we let d̂l denote theaverage value of π1

k,l over different futures, which is1 if parcel l is purchased in all futures and 0 if it isnot purchases in any futures, and otherwise d̂l ∈ (0, 1)indicating the percentage of futures in which parcel lwas purchased. To extract a HOP action d where dlindicates whether to purchase l, we first set dl = 0whenever d̂l = 0, since purchasing l was not preferredin any future. Next, we cycle through each patch l withd̂l = 1 and purchase the patch (set dl = 1). If thereis remaining budget after processing all parcels withd̂l = 1, we sort all remaining parcels with d̂l ∈ (0, 1) indescending order. Then for each parcel, if purchasingit does not violate the budget constraint we set dl = 1and otherwise set dl = 0.

Step Size Control. Correctly controlling the stepsize αi can have a large impact on efficiency, since eachiteration involves solving N MIP problems (one perfuture). We follow the same, relatively standard, stepsize control as Kumar et al. [2012], where the step sizeis computed according to the gap between the feasibleprimal solution quality and the dual solution quality.In particular, after extracting a feasible solution forthe primal, let APXi be the sum of rewards on every

future and DUALi be the dual objective value, we set

αi =APXi −DUALi∑

l,t,k(π̄t,(i)k,l )2

5 Dual Decomposition for Baselines

There are multiple alternative heuristics that can beformulated within the same dual decomposition frame-work described above for HOP. These heuristics differin terms of the horizon over which they consider fu-ture population spread and whether or not they con-sider the possibility of selecting actions in the futurewhen selecting actions at the current moment. Belowwe describe two baselines within this framework thatare included in our experiments.

GreedyZero Policy. The GreedyZero policy is ourmost near-sighted baseline as it selects the action thatlooks best assuming the population growth and avail-able budget after the next decision epoch is zero. Inparticular, the futures for GreedyZero are simulatedfor only Tr years instead of until time H. Correspond-ingly, its MIP is exactly the same as the HOP MIPexcept that the horizon H is always replaced by Trand no purchases are allowed after the first year.

HNoop Policy. The HNoop policy is unlikeGreedyZero as it considers the population spread untilthe real horizon H. However, unlike HOP and similarto GreedyZero, it does not consider the possibility ofselecting actions after the first time step. Thus, it eval-uates purchasing actions according to how much long-term population spread they will facilitate assumingthat the noop action is taken thereafter. The compu-tation of HNoop is similar to our approach for HOP,except that we simply remove all “action variables”from each MIPk after the first time step, which pre-vents the consideration of future actions. This myopicpolicy will often work well, when the consideration offuture actions is unimportant. However, when this isnot the case, we might expect HOP to have an ad-vantage. For example, in some conservation situationsit is important to consider building longer term pathsfor population spread in order to encourage spread toa particularly good habitat. Such paths will often notresult from purely myopic reasoning.

Interestingly, the conceptual definition of the HNooppolicy corresponds exactly to the myopic policy pro-posed in the only prior work on adaptive conservationplanning in Golovin et al. [2011]. However, their com-putation of HNoop was carried out via a greedy al-gorithm that considered greedily adding parcels intothe purchased set one at a time until the budget wasexhausted. Given certain submodularity assumptionsthat greedy algorithm came with approximation guar-antees. Our framework provides an alternative ap-


proach to computing HNoop that is not purely greedy,instead relying on decomposition for efficiency.

6 Experimental Results

Our evaluation uses a real dataset of the Red-cockadedWoodpecker (RCW) recovery project from Sheldonet al. [2010] along with some hand-designed syntheticmaps. The various problems differ in terms of the spa-tial layout of available parcels, the initial populationof birds, and the set of parcels that are already re-served (i.e. free parcels). The real RCW map is froma large land region of the southeastern United Statesthat was of interest to The Conservation Fund. Theregion was divided into 443 non-overlapping parcels(each with area at least 125 acres) and 2500 patchesserving as potential habitat sites. Parcel costs werebased on estimated land prices.

Throughout the experiments, we use a reliable MIPsolver, IBM CPLEX, to directly solve MIPs if needed.Our experiments are performed on a single core ma-chine with a memory limit of 6GB. We do not set timelimit for the computation.

Performance of Dual Decomposition. Here wecompare the use of Dual Decomposition (DD) for com-puting the HOP policy compared to directly applyingCPLEX to the full HOP policy MIP, which includesall futures into a single MIP. This provides an indica-tion of the optimality of DD, when CPLEX can solvethe problems, and also the efficiency/scalability of theapproach. We use the large RCW map and run DDuntil either the step size α ≤ 0.001 or a maximum of50 iterations is reached.

Table 1 shows the HOP value computed by CPLEXand DD as well as the timing results for time horizonH = 20. When a method fails to return a solution,no value is shown in the table. From the table, we seethat the objective value of DD is very close or equal tothat of CPLEX, meaning that DD is providing an ex-tremely close approximation to the HOP policy. Whenthe number of future scenarios is small, the MIP in-stances seen by CPLEX are relatively small and canbe solved quickly by CPLEX. As the number of fu-tures increases, CPLEX takes a much longer time tofind solutions compared to DD or even fails to solvethe problem within a reasonable amount of time andmemory. To make the problem more challenging, weincrease the time horizon to H = 40 where the prob-lem size is much larger. Table 2 shows the results.We see similar results, but they are more pronouncedsince the problems seen by CPLEX are now signifi-cantly larger. The expensive computation and failuresof CPLEX indicate that the full MIP approach is notpractical to solve real-world problems.

Table 1: Solution Quality and Run Time(H=20)N CPLEX-OBJ DD-OBJ CPLEX-Time(s) DD-Time(s)5 -393.0 -392.4 26.8 112.010 -389.6 -388.0 41.9 100.315 -354.5 -353.2 71.2 79.020 -382.6 -381.4 102.5 210.825 -383.2 -381.2 368.7 159.830 -378.6 -375.0 650.8 187.135 -380.7 -377.9 430.1 205.640 -394.2 -392.8 1201.3 247.645 -391.3 280.3

Table 2: Solution Quality and Run Time(H=40)N CPLEX-OBJ DD-OBJ CPLEX-Time(s) DD-Time(s)5 -502.4 -502.4 61.8 94.410 -480.8 -480.4 196.9 181.415 -505.7 -501.7 1039.7 294.220 -504.6 -504.6 1417.2 358.225 -500.8 -499.7 4091.1 487.030 -502.3 568.5

It is important to note that DD can be easily paral-lelized to achieve significant speedup in terms of thenumber of processors. In particular, if we use oneprocessor per future, the wall clock time per iterationwould be equal to the maximum time required to solvean individual future. If those times are nearly uniformthen this would result in nearly linear speedup.

Adaptive Planning Results on Synthetic Maps.We now evaluate the quality of the HOP policies andcompare it with the GreedyZero and HNoop policies.The results reported for each algorithm are averagedover 10 runs to account for randomness of the envi-ronment and sample futures. We note that we haveattempted to apply other planning formalisms to thisproblem, such as Monte-Carlo Tree Search, but with-out success due to the extreme action and stochasticbranching factors of our problems.

We created two simple grid maps (Figure 2) to illus-trate the advantage of the non-myopic HOP policyover the more myopic baselines. In the maps, eachgrid represents one parcel and the patches are markedusing their indexes. The cost of each parcel is 1 andthe annual budget is also 1. In other words, only oneparcel can be purchased each year. In the first gridmap, most parcels contain only one patch, but thereare many parcels with two patches in the south of theinitial population. Generally speaking, there are twopossible directions of purchasing: to either the North-west or the Southeast. Presumably, purchasing theNorthwest part would lead the population to the mostpromising free area as long as the time horizon is largeenough for the population to spread there, while theSoutheast provides more instant benefit as each yeartwo patches would be available instead of only one. Itis obvious that the optimal policy would follows thefirst strategy in order to maximize the long-term re-


Figure 2: (Left) Grid Map 1. (Right) Grid Map 2. For both maps, the number marks the index of patchesand each grid represents one parcel. The initially occupied patches are numbered in red and the free parcels areshaded in green.

ward. The results shown in Table 3 illustrate thatHOP recognizes the potential benefit of the free parcelsand purchases parcels towards the correct direction,while GreedyZero and HNoop are easily distracted bythe Southeast purchasing due to their myopia.

Grid map 2 is similar to grid map 1, but the purchas-ing is not limited to only two directions so a plan-ner would have more choices of purchasing, which alsomeans more distractions from the far-away free area.Again, HOP recognizes the potential benefit of thefree parcels and purchases parcels towards that direc-tion, while GreedyZero and HNoop expand the reservearound the initial population uniformly. This leadsHOP to achieve a higher reward as shown in Table 3.

Table 3: Rewards on Different MapsMap HOP HNoop GreedyZeroGrid map 1 85.2 35.0 70.2Grid map 2 32 25.1 23.2Real map 248.75 220.6 198.8

Adaptive Planning Results on Real Map. Thereal map (Figure 3) shows, via red + marks, where theinitial bird population is, and free parcels are shadedin dark gray. Parcels shaded in pink are expensiveyet affordable ones. The right map gives the naturalpopulation spread that would result if all parcel wereconserved. We see that the free area in the Northeastcorner is promising for optimal reward, therefore theoptimal solution prefers to build a path from the initialpopulation to it as long as there is enough budget andtime for diffusion. However, many parcels on such apath are comparatively more expensive, adding moredistractions for myopic decision makers.

We present the reward data for the three policies in Ta-ble 3, showing that HOP gains more reward than oth-ers. To further check their strategies, we plot the pur-

chased parcels and corresponding population spread ofthe HOP and HNoop policies in Figures 4 and 5. Whilenot shown, the GreedyZero policy gradually purchasesparcels around the initial population. Compared toGreedyZero, HNoop finds the Southwest part morebeneficial for longer term population spread, so moreparcels are bought in that direction. However, HNoopfails to recognize the most promising free area in theNortheast, which requires consideration of future ac-tions. HOP rather recognizes the Northeast area andpurchases expensive parcels to build a path to the freeparcels. In addition, HOP also buys parcels aroundthe initial population if the reward gain is justified.

7 Summary and Future Work

We presented an action selection approach for adaptiveconservation planning where it is necessary to dynam-ically reason about an extremely large set of resourcemanagement decisions and how they will impact thestochastic population spread. Our algorithm, based onhindsight optimization (HOP), is the first non-myopicapproach for this problem and was shown to be effec-tive compared to natural baselines. The main tech-nical contribution was to show how to compute HOPpolicies for huge factor action spaces, for which priorHOP algorithms were inapplicable.

In future work, we plan to consider more broadly de-fined problem classes that involve the adaptive con-trol of stochastic diffusion processes and other types ofstochastic exogenous events. Such problems expose se-rious limitations of most AI planning techniques, mak-ing them an interesting sub-class from an AI researchperspective. It is also of interest to understand addi-tional conditions under which HOP policies can pro-vide theoretical guarantees, and also tractable modifi-cations to HOP that might support such guarantees.


Figure 3: (Left) Initial state of the real map. (Right) The population distribution after 20 years on the real map.

t = 5 t=10 t=15 t=20

Figure 4: Purchases and population spread of HNoop. Parcels shaded in green are purchased with a probabilityof ≥ 0.5. Yellow is used for parcels with purchase probability of < 0.5. Patches are colored by the probability ofbeing occupied (lighter color indicates lower probability).

t = 5 t=10 t=15 t=20

Figure 5: Purchases and population spread of the HOP policy. The color setting is the same as Figure 4.


References

K. Ahmadizadeh, C. Dilkina, C. P. Gomes, andA. Sabharwal. An empirical study of optimizationfor maximizing diffusion in network. In CP ’10: 16thInternational Conference on Principles and Practiceof Constraint Programming, 2010.

R. Bent and P. V. Hentenryck. Regret only! onlinestochastic optimization under time constraints. InAAAI ’04: Proceedings of the Nineteenth Confer-ence on Artificial Intelligence, 2004.

D. P. Bertsekas. Nonlinear Programming, 2nd edition.Athena Scientific, 1999.

H. S. Chang, R. L. Givan, and Chong E. K.P. On-linescheduling via sampling. In AIPS ’00: ArtificialIntelligence Planning and Scheduling, 2011.

E. K.P. Chong, R. L. Givan, and H. S. Chang. Aframework for simulation-based network control viahindsight optimization. In IEEE CDC ’00: IEEEConference on Decision and Control, 2000.

D. Golovin, A. Krause, B. Gardner, S. J. Converse,and S. Morey. Dynamic resource allocation in con-servation planning. In AAAI ’11: Proceedings of theTwenty-Fifth Conference on Artificial Intelligence,2011.

A. Hubbe, W. Ruml, S. Yoon, J. Benton, and M. B.Do. On-line anticipatory planning. In ICAPS ’12:International Conference on Automated Planningand Scheduling, 2012.

A. Kumar, X. Wu, and S. Zilberstein. Lagrangian re-laxation techniques for scalable spatial conservationplanning. In AAAI ’12: Proceedings of the Twenty-sixth Conference on Artificial Intelligence, 2012.

L. Mercier and P. V. Hentenryck. Performance anal-ysis of online anticipatory algorithms for large mul-tistage stochastic integer programs. In IJCAI ’07:International Joint Conference on Artificial Intelli-gence, 2007.

D. Sheldon, B. Dilkina, A. Elmachtoub, R. Finseth,A. Sabharwal, J. Conrad, C. Gomes, D. Shmoys,W. Allen, O. Amundsen, and B. Vaughan. Maxi-mizing the spread of cascades using network design.In UAI ’10: Uncertainty in Artificial Intelligence,2010.

G. Wu, E. Chong, and R. Givan. Burst-level con-gestion control using hindsight optimization. IEEETransactions on Automatic Control, 47:979–991,2002.

S. Xue, A. Fern, and D. Sheldon. Scheduling conser-vation designs via network cascade optimization. InAAAI ’12: Proceedings of the Twenty-sixth Confer-ence on Artificial Intelligence, 2012.

S. Yoon, A. Fern, R. L. Givan, and S. Kambhampati.Probabilistic planning via determinization in hind-sight. In AAAI ’08: Proceedings of the Twenty-thirdConference on Artificial Intelligence, 2008.

S. Yoon, W. Ruml, J. Benton, and M. B. Do. Im-proving determinization in hindsight for online prob-abilistic planning. In ICAPS ’10: InternationalConference on Automated Planning and Scheduling,2010.

Dynamic Resource Allocation for Optimizing …afern/papers/aistats14.pdfDynamic Resource Allocation...

Documents

Transcript of Dynamic Resource Allocation for Optimizing …afern/papers/aistats14.pdfDynamic Resource Allocation...