ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment...

ORF 544

Stochastic Optimization and LearningSpring, 2019

Warren PowellPrinceton University

http://www.castlelab.princeton.edu

Week 11

Forward approximate dynamic programming

Approximate dynamic programming

Exploiting monotonicity

Monotonicity in dynamic programming» There are many problems where the value function

increases (or decreases) monotonically with all dimensions of a state variable:

» Operations research• Optimal equipment replacement – Replace when a measure of

deterioration exceeds some point• Dispatching customers in a queue – Costs increase

monotonically with the number of customers in the queue. Dispatch when the queue is over some number.

1Repair/dispatch

Do nothing

Monotonicity in dynamic programming (cont’d)» Energy

• Buy when electricity price is below one number, sell when above another number:

• Value of energy in storage increases with amount being held (and perhaps price, speed of wind, demand, …)

Price0

Do nothing

Buy -1

Monotonicity in dynamic programming (cont’d)» Health

• Dosage of a diabetes drug increases with blood sugar.• Dosage of statins (for reducing cholesterol) increase as the

cholesterol level increases (and also increases with age, weight).

» Finance• The value of holding cash in a mutual fund increases with the

amount of redemptions, stock indices, and interest rates.• The value of holding an option increases with price and

volatility.

100.00

120.00

140.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71

Bid is placed at 1pm, consisting of charge and discharge prices between 2pm and 3pm.

Discharge1 ,2pm pmb

Charge1 ,2pm pmb

1pm 2pm 3pm

Approximate value function without monotonicity

Maintaining monotonicity

Benchmarking» M-ADP and AVI (no. of iterations)

Observations» Initial experiments using various machine learning

approximations produced poor results (60-80 percent of optimal)

» Vanilla lookup table works poorly (within computing budgets that spanned days)

» Lookup table with monotonicity worked quite well» We have run this algorithm on problems with up to 7

dimensions (but that is our limit)

Backward ADP

Backward ADP for a clinical trial problem» Problem is to learn the value of a new drug within a

budget of patients to be tested.» Backward MDP required 268-485 hours.» Forward ADP exploiting monotonicity (we will cover

thiss later) required 18-30 hours.» Backward ADP required 20 minutes, with a solution

that was 1.2 percent within optimal.

Driverless EV problem

Driverless EV optimizationCurrent Uber logic:» Show nearest 8 drivers.» Contact closest driver to

confirm assignment.» If driver does not confirm,

contact second closest driver.

Limitations:» Ignores potential future

opportunities for each driver.

Driverless EV optimization

Closest car…. but it moves car away from busy downtown area

and strands other car in low density area.

Assigning car in less dense area allows closer car to handle potential demands in more dense

areas.

The central operator should think about:• Should it accept this trip?• What is the best car to assign to the trip considering

– The type of car– The charge level of the battery

» If it doesn’t assign a car to a trip, should it:• Sit where it is?• Reposition to a better location?• Recharge the battery?• Move to a parking facility?

RidersCars

t t+1 t+2

The assignment of cars to riders evolves over time, with new riders arriving, along with updates of cars available.

Policies based on value function approximations

» Exact value functions are rare:• Discrete states and actions, with a computable one-step transition

matrix.» Approximate value functions are defined by:

• Approximation architecture– Linear, nonlinear separable, nonlinear

• Learning strategy– Pure forward pass, two-pass

1 ,( ) arg max ( , ) ( , )t

VFA n n n M x nt t x t t t t tX S C S x V S S x

Contribution for taking decision 𝑥 from state 𝑆

Expected Value of the Post-decision State

1 1ta rc

'1( )v a

TimeLocation

aBatteryFleet

''1( )v a

1 2ta rc

'1( )ta rc v a

''1( )ta rc v a

TimeLocation

aBatteryFleet

'1( )v a

''1( )v a

'1( )v a

'2( )v a

''1( )v a

''2( )v a

''3( )v a

'4( )v a

'1( )v a

'2( )v a

'4( )v a

1ˆ( )v a 1a

''1( )v a

''2( )v a

''3( )v a

'1( )v a

'2( )v a

'4( )v a

''1( )v a

''2( )v a

''3( )v a

'1( )v a

'2( )v a

'4( )v a

New car assignments:

New future values

''2( )v a

''3( )v a

''4( )v a

Original solution

'1( )v a

'2( )v a

'4( )v a

''1( )v a

''2( )v a

''3( )v a

'1( )v a

'2( )v a

'4( )v a

New solution''2( )v a

''3( )v a

''4( )v a

'1( )v a

'2( )v a

'4( )v a

1ˆ( )v a 1a

Difference = 1ˆ( )v a'' ''1 2( ) ( )v a v a

'' ''2 3( ) ( )v a v a

'' ''3 4( ) ( )v a v a

'1( )v a

'2( )v a

'4( )v a

1ˆ( )v a 1a

'' ''1 2( ) ( )v a v a

'' ''2 3( ) ( )v a v a

'' ''3 4( ) ( )v a v a

Difference in assignment costs

Changes in future values

Driverless EV optimizationAssignment network

» Capture the value of downstream driver.

» Add this value to the assignment arc.

Drivers

3 1( , )Ma a d

3 2( , )Ma a d

3 3( , )Ma a d

3 4( , )Ma a d

3 5( , )Ma a d

Future attributes

3 1( , ) Attribute vector of driverin the future given a decision .

Ma a d

Finding the marginal value of a driver:» Dual variables

• Can provide unreliable estimates.• Need to get the marginal value of drivers who are not actually

there.

» Numerical derivatives:

Estimating average values:

1 ˆ( ) (1 ) ( ) ( ) ( )n n nv a v a v a

Attribute of car

Old estimate of the value of a car with attribute a

New estimate of the value of a cars with attribute a

We had to use two statistical techniques to accelerate learning:» Monotonicity with respect to time and charge level.» Hierarchical aggregation (just as we did with the

trucking example).

Driverless fleets of EVs using ADP

The value of a vehicle in the future» Value function approximation captures charge level, as

well as time and location.» Hierarchical aggregation accelerated the learning process

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

1 ,ˆ min ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

“on policy learning”

22000 zones 2000 cars Battery capacity: 50KWh 31560 Trips

No aggregation – (!!) Solution gets worse!

Profit versus number of iterations

Three levels of aggregation - Better

Five levels of aggregation – old VFAs

Five levels of aggregation – new VFAs

Trip requests over time» Challenge is to recharge during off-peak periods

Trip Requests vs Time

Driverless fleets of EVs using ADP

Heuristic dispatch vs. ADP-based policies» Effect of value function approximations on recharging

Total battery charge

Trip requests

Actual trips

Heuristic recharging logic

Recharging during peak period No recharging during peak period

Recharging controlled by approximate dynamic programming

Peak Peak

The economics of driverless fleetsPr

We can simulate different fleet sizes and battery capacities, properly modeling recharging behaviors given battery capacity.

Exploiting convexity in an inventory problem

Exploiting convexity

Single inventory problem» With pre-decision state:

» With post-decision state: 1

1 1 1arg max ( , ) ( , , ) |t

n n n nt x t t t t t t t tx C S x V S S x W S

, 1arg max ( , ) ( , )t

n n x n x nt x t t t t t tx C S x V S S x

Updating strategies» Forward pass (approximate value iteration) updating as

we go (also known as TD(0)):

• Compute the value of being in a state by “bootstrapping” the downstream value function approximation:

• Now compute the marginal value using

• Use this to update the piecewise linear value functions.

, 1ˆ ( ) max ( , ) ( , )t

n n n x n x nt t x t t t t t tV S C S x V S S x

ˆ ˆ( ) ( )ˆn n n n

n t t t tt

V S V Sv dd

Updating strategies» Double pass:

• Perform forward pass simulating the policy for 𝑡 0, … , 𝑇:

• Now perturb state by 𝛿 and solve again:

Update 𝑆 𝑆 𝑥 (remember 𝑥 may be negative). Store 𝑆 as you proceed. Also store incremental cost

• Perform backward pass, computing marginal values (and performing updates):

11 1 1 1 1

ˆ ˆ( , )

ˆ( ) ( ( ), , )

n n n nt t t t

n n V n n n nt t t t t t

v C S x v

V S U V S S v

-- - - - -

, 1arg max ( , ) ( , )t

, , 1arg max ( , ) ( , )t

,( , ) ( , )n n n n nt t t t tC C S x C S xdd d += + -

We update the piecewise linear value functions by computing estimates of slopes using a backward pass:

» The cost along the marginal path is the derivative of the simulation with respect to the flow perturbation.

From pre- to post-decision state

» Get marginal value of inventory, , at time for inventory level during iteration .

» Use this to update the value function around the previous post-decision state , to obtain ,

Updating strategies» The choice between a pure forward pass (TD(0)) versus

a double pass (TD(1)) is highly problem dependent. Every problem has a natural “horizon.”

» We have found that in our fleet management problems, the pure forward pass works fine and is much easier to implement.

» For energy storage problems, it is essential that we use a double pass, since a decision at time can have an impact hundreds of time periods into the future.

Exploiting convexity It is important to maintain concavity:

ˆ nitV

Valuefunction

A concave function…

0u 1u 2ukitR

Slopes … has monotonically decreasing slopes. But updating the function with a stochastic gradient may violate this property.

0u 1u 2unitR 3u

0u 1u 2ukitR

ˆnitv

0u 1u 2ukitR

1 ˆ(1 )n n nit it itv v v

0u 1u 2ukitR

Ways for maintaining concavity (monotonicity in the slopes):» CAVE algorithm – Use updated slope at one point to

update over a range which shrinks with iterations (shrinking factor is tunable parameter). Expand if needed to ensure concavity is maintained.

• Works well! But we could never prove convergence.

» Leveling algorithm – Update at a point . • Force monotonicity in slopes by increasing/decreasing slopes

farther from 𝑥 as needed. • Works fine, without tunable parameters

» SPAR algorithm – Perform nearest point projection onto space of monotone functions.

• Nice theoretical convergence proof – could never get it to work.

Derivatives are used to estimate a piecewise linear approximation

( )t tV R

1200000

1300000

1400000

1500000

1600000

1700000

1800000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Iterations

With luck, your objective function improves

Iterations0 10 20 30 40 50 60 70 80 90 100

Testing on a time-dependent, deterministic problem» Blue = optimal (found by

solving a single linear program over entire horizon)

» Black = ADP solution

Optimalsolution

Stochastic, time-dependent problems

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Benchmark storage problems

Stochastic, time-dependent problems

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Storage problem

Resource allocation (freight cars) for two-stage problem

Forecasts of Car Demands

Forecast Actual

Two-stage problemsTwo-stage resource allocation under uncertainty

Optimization frameworksWe obtain piecewise linear recourse functions for each regions.

Optimization frameworksThe function is piecewise linear on the integers.

We approximate the value of cars in the future using a separable approximation.

0 1 2 3 4 5Number of vehicles at a location

Optimization frameworksUsing standard network transformation:

Each link captures the marginalreward of an additional car.

Two-stage problems

Two-stage problemsWe estimate the functions by sampling from our distributions.

1 ( )nD

2 ( )nD w

3 ( )nD w

( )nCD w

1( )nv

2 ( )nv

3 ( )nv

4 ( )nv

5 ( )nv

Marginal value:

Two-stage problems

The time t subproblem:

t1 2 3( , , )n

ta t t tV R R R(i-1,t+3)

(i,t+1)

(i+1,t+5)

Gradients:ˆ ˆ( , )ˆ ˆ( , )ˆ ˆ( , )

n nt tn nt tn nt t

Two-stage problems

Left and right gradients are found by solving flow augmenting path problems.

1 2 3( , , )nta t t tV R R R

(i-1,t+3)Gradients:

3ˆ( )ntv

The right derivative (the value of one more unit of that resource) is a flow augmenting path from that node to the supersink.

Two-stage problems

Left and right derivatives are used to build up a nonlinear approximation of the subproblem.

1( )kit tV R

Two-stage problems

Left and right derivatives are used to build up a nonlinear approximation of the subproblem.

Right derivativeLeft derivative

1( )kit tV R

Two-stage problems

Each iteration adds new segments, as well as refining old ones.

( 1)ktv ( 1)k

1( )kit tV R

Two-stage problems

0 1 2 3 4 5 6 7 8 9 10

Variable Value, s

Exact1 Iter2 Iter5 Iter10 Iter15 Iter20 Iter

Number of resources

What we can proveSPAR algorithm» Powell, Ruszczynski and Topaloglu, Mathematics of Operations

Research (2004)

Leveling algorithm» Topaloglu and Powell, Operations Research Letters, 2003.

But so far, our fastest convergence is with the original version, called the CAVE algorithm, for which convergence has not been proven.» Godfrey and Powell, Management Science, 2001.

Blood management problemManagement multiple resources

Blood management

Managing blood inventories over time

ˆ ˆ,R D1S

Week 1

2 2ˆ ˆ,R D

Week 2

3 3ˆ ˆ,R D

Week 3

t=1 t=2 t=3

Week 0

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

Satisfy a demand Hold

tS ˆ , t tR D

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

tR xtR

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

( )tF R

tR xtR

,( ,0)t ABR

,( ,1)t ABR

,( ,2)t ABR

,( ,0)t OR

,( ,1)t OR

,( ,2)t OR

Solve this as a linear program.

( )tF R

tR xtR

Dual variables give value additional unit of blood..

,( ,0)t̂ AB

,( ,1)t̂ AB

,( ,2)t̂ AB

,( ,0)t̂ O

,( ,1)t̂ O

,( ,2)t̂ O

Updating the value function approximation

Estimate the gradient at

,( ,2)nt ABR

,( ,2)ˆnt AB

( )tF R

Update the value function at

11 1( )n x

t tV R

,( ,2)ˆnt AB

( )tF R

,( ,2)nt ABR

Update the value function at ,1

,( ,2)ˆnt AB

11 1( )n x

t tV R

Update the value function at ,1

11 1( )n x

t tV R

1 1( )n xt tV R

Notes:» We get a marginal value for each supply node.» These are provided automatically by linear

programming solvers.» Be careful when the supply at the node = 0. While we

would like to have the marginal value of one more, if we use the marginal values produced by the linear programming code, it might be the slope to the left (where the “supply” would be negative).

Exploiting concavity

( )t tV R

Approximate value iteration

to obtain and dual variables . Step 3: Update the value function approximation

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Recursivestatistics

ˆntiv

1 ,max ( , ) ( ( , )) n n M x nx t t t t t tC S x V S S x

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Recursivestatistics

ˆntivn

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Recursivestatistics

ˆntivn

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Recursivestatistics

ˆntiv

Iterative learning

Grid level storage

Grid level storage Imagine 25 large storage devices spread around the PJM grid:

Grid level storage Optimizing battery storage

Value function approximations

Value function approximationsMonday

Time :05 :10 :15 :20

Exploiting concavity

( )t tV R

1200000

1300000

1400000

1500000

1600000

1700000

1800000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Iterations

… a typical performance graph.

Grid level storage control

Without storage

Grid level storage control

With storage

Locomotive application

The locomotive assignment problem

Locomotives Trains Yards

Atlanta

Baltimore

Jacksonville

LocomotivesHorsepower

6200 Consist-breakup costsShop routing bonuses/ penalties

Leader logic

Locomotives coming in on same trainLocomotivesHorsepower

Train reward function

The train reward function

GoalPower

Minimum

Repositioning

6200Train may need 12,130 horsepower. Solutions:

4400 + 4400 + 6000 = 14,4004400+4400+4400 = 13,200

6000+6200 = 12,200

6200Train may need 12,130 horsepower. Solutions:

4400 + 4400 + 6000 = 14,4004400+4400+4400 = 13,200

6000+6200 = 12,200

Locomotivebuckets

The value of locomotives in the future is captured through nonlinear

approximations.

Iterative learning

tApproximate dynamic programming

Iterative learningApproximate dynamic programming

Simulation

Recursivestatistics

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

1 ,ˆ max ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

1 ,ˆ max ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Recursivestatistics

Laboratory testingADP as a percent of IP optimal » 3 day horizon» Five locomotive types

Laboratory testingTrain delay with uncertain transit times and yard delays

Train delay using deterministically trained VFAs

Train delay using stochastically trained VFAs

How do we do it?

» The stochastic model keeps more power in inventory. The challenge is knowing when and where.

Laboratory testing

Time period within simulation

Stochastic training

Deterministic training

Experiments with ADP for energy storage

“Does anything work”

Energy storage model

𝐸Wind Energy

𝐺rid G,with LMP 𝑃

𝐿Load

𝑅Energy in Storage

𝐶 𝑆 , 𝑥 𝑃 𝑥 𝑥 𝜂𝑥

A storage problemEnergy storage with stochastic prices, supplies and demands.

windtE

gridtP

batterytR

wind wind windt t t

grid grid gridt t t

load load loadt t t

battery batteryt t t

R R Ax

1 Exogenous inputstW State variabletS

Controllable inputstx

A storage problem

Bellman’s optimality equation

1 1( ) min ( , ) ( ( , , ) |tt x t t t t t t tV S C S x V S S x W S X

windtgrid

tloadtbatteryt

wind batteryt

wind loadtgrid batterytgrid loadt

battery loadt

These are the “three curses of dimensionality.”

Approximate policy iteration

Step 1: Start with a pre-decision state Step 2: Inner loop: Do for m=1,…,M:

Step 2a: Solve the deterministic optimization usingan approximate value function:

to obtain . Step 2b: Update the value function approximation

Step 2c: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 3: Update using and return to step 1.

1, , 1, 1 ,1 1 ˆ( ) (1 ) ( )n m x m n m x m m

m mV S V S v

1 , ˆ min ( , ) ( ( , )) m m n M x mxv C S x V S S x

1 ( , , ( ))m M m m mS S S x W 1, ( )n MV S

Machine learning methods (coded in R)» SVR - Support vector regression with Gaussian radial

basis kernel» LBF – Weighted linear combination of polynomial basis

functions» GPR – Gaussian process regression with Gaussian RBF» LPR – Kernel smoothing with second-order local

polynomial fit» DC-R – Dirichlet clouds – Local parametric regression.» TRE – Regression trees with constant local fit.

Test problem sets

» Linear Gaussian control• P1 = Linear-quadratic regulation• Remaining problems (P2-P7) are nonquadratic

» Finite horizon energy storage problems (Salas benchmark problems)

• 100 time-period problems• Value functions are fitted for each time period

Linear Gaussian control

SVR - Support vector regressionGPR - Gaussian process regression

DC-R – local parametric regression LPR – kernel smoothing

LBF – linear basis functions LBF – regression trees

Energy storage applications

SVR - Support vector regressionGPR - Gaussian process regressionLPR – kernel smoothing DC-R – local parametric regression

A tale of two distributions» The sampling distribution, which governs the

likelihood that we sample a state.» The learning distribution, which is the distribution of

states we would visit given the current policy

State S

( )V s

1( )V s2 ( )V s

Using the optimal value function

Now we are going to use the optimal policy to fit approximate value functions and watch the stability.

Optimal value function …with quadratic fit

State distribution using optimal policy

• Policy evaluation: 500 samples (problem only has 31 states!)

• After 50 policy improvements with optimal distribution: divergence in sequence of VFA’s, 40%-70% optimality.

• After 50 policy improvements with uniform distribution: stable VFAs, 90% optimality.

State distribution using optimal policy

VFA estimated after 50 policy iterations

VFA estimated after 51 policy iterations

Policy iteration with parametric VFA

Observations» Experiments using a number of machine learning

approximations produced poor results (60-80 percent of optimal)

» Policy search using a simple error correction term works surprisingly well (for this problem)

» Using low-order polynomial approximations fit using state distributions simulated using the optimal policyworks poorly!

“I think you give a too rosy a picture of ADP….”Andy Barto, in comments on a paper (2009)

“Is the RL glass half full, or half empty?” Rich Sutton, NIPS workshop, (2014)

Approximate value functions can work very well, but you need structure to guide the learning process. ADP needs benchmarks and careful tuning. WBP

Least squares approximate policy iteration

…and comparison to policy search

Least squares API vs. policy search

Assume we approximate our value function using a linear model:

Using the post-decision state

In steady state, we might write

( ) ( ) ( )Tf f

V S S Sq f f qÎ

( ) ( ) ( )x x Tf f

V S S Sq f f qÎ

{ }1( ) max ( , ) ( ) |x x x x x xt x t tV S C S x V S Sg- = +

Least squares API vs. policy searchLeast squares policy iteration (Lagoudakis and Parr)» Bellman’s equation (infinite horizon)

» … is equivalent to (for a fixed policy)

» Rearranging gives:

» … where “X” is our explanatory variable. But we cannot compute this exactly (due to the expectation) so we can sample it.

Need to sample introduces“errors in variable” formulation

For benchmark datasets, see: http:www.castlelab.princeton.edu/datasets.htm

Myopic policy

Myopic policyLSAPI

Least squares API vs. policy searchVariations of LSAPI:

» Theorem (W. Scott and W.B.P.) Bellman error using instrumental variables and projected Bellman error minimization are the same!

Myopic policyLSAPI

Myopic policyLSAPILSAPI-Proj. Bellman

Finding the best policy (“policy search”)» Assume our policy is given by

» We wish to minimize the function

max ( ) , ( | )T

tF C S X S

( | ) arg max ( , ) ( ( , )) n x nt x t t f f t t t

fX S C S x S S x

ˆ ( , )F

Two solution methods» Forward approximate dynamic programming

• Use least squares approximate policy iteration (LSAPI) to solve

» Policy search• Find to optimize

( ) min ( , ) ( ( , )) |

x x x x xt x t t t t t t

x x n xf f t x t t f f t t t t

V S C S x V S S x S

S C S x S S x S

( | ) arg min ( , ) ( ( , ))

xt x t t f f t t t

X S C S x S S x

min ( ) , ( | )T

tF C S X S

Myopic policyLSAPILSAPI-Inst.Var.

Policy search.

Notes:» Using various parametric approximations for the value

function did not work well (but it did work well with the trucking application!)

» Using the same functions, but fitted using policy search, worked quite well.

» We do not know how general this conclusion is. For example, when we can exploit problem structure, value function approximations, fitted using Bellman’s equation, can work quite well.

» It appears that the value function approximations need to be quite accurate, and not just locally.

Notes» So why don’t we always do policy search?

• It simply doesn’t work all the time.• Unable to handle time-dependent policies.• The vector 𝜃 needs to be fairly low dimensional if we do not

have access to derivatives (and derivatives where the CFA term is in the objective can be tricky).

» One strategy is to use both:• Start by trying to fit 𝜃 to approximate the value function. Use

this to get an initial value of 𝜃. • Then use policy search to tune it to something even better.

Stochastic dual dynamic programming

ADP with Benders cuts (SDDP)

ADP with Benders decomposition» Mario Pereira (1991) was the first to notice that we

could approximate stochastic linear programs (which are convex) using Benders cuts.

» SDDP requires two approximations:• We are limited to solving a sampled model. That is, we are

limited to a small set of sample paths.• We have to assume that new information arriving at time t+1 is

independent of the information at time t.

» SDDP is approximate dynamic programming using a two-pass procedure:

• We simulate forward using the current value function approximation (represented using a set of cuts).

• We then perform a backward pass to create a new cut for each time period, averaging over the set of samples.

The forward pass

Post-decision state variable

The set of sample paths (of wind)

We need to solve a linear program for each sample path (“scenario”) which is why this cannot be very large.

We then need to average over all of these linear programs.

Regularization» In the 1990’s, Ruszczynski showed that Benders works

much better if you introduce a regularization term, which penalizes deviations from the current solution.

» This powerful idea (now widely used in statistics) has not been generalized in a practical way to multiperiod(“multistage”) problems.

» In recent work, Tsvetan Asamov introduced a practical way to do regularization for multiperiod problems (with hundreds of time periods). A key feature is that the regularization term does not depend on history.

» The method is very easy to implement.

We penalize deviations from the previous resource state. This is not indexed by the sample path.

We simply retain the previous resource state.

UB w/o reg. LB w/o reg.UP w reg.LB w reg.

With regularization

Without regularization

SDDP with regularization

» Regularization slightly improves the upper bound, but dramatically improves the lower bound.

» There is minimal additional computational cost to introduce regularization for multiperiod problems.

Comparisons on a deterministic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 25 batteries

SDDP lower bound

SDDP upper bound

Slightly faster convergence using SPWL VFAs.

Comparisons on a deterministic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 50 batteries

SDDP lower bound

SDDP upper bound

Much faster convergence using SPWL VFAs.

Comparisons on a stochastic grid problem» SPWL – Separable, piecewise linear approximations» SDDP – Benders cuts» 25 batteries

SDDP Lower bound 100 scenarios

SPWL VFAs still exhibit slightly faster convergence.

SDDP upper bounds 20 and 100 scenarios

SPWL VFAs still exhibit slightly faster convergence.

Performance is very similar, despite high dimensionality (100 batteries)

SPWL upper bounds 20 and 100 scenarios

Problem classes

Learning policies (“algorithms”):Approximate dynamic programmingQ-learningSDDP (see next slide)

Notes:» SDDP is a form of learning policy. This is a method

for learning the cuts of the value function.» It is parameterized by the regularization term

» So our learning policy is determined by:• The SDDP structure.• The parameters of the regularization term • The graphs above represent optimization of the learning

policy.

» Then we have to test the cuts as an implementation policy, and compare to other VFA approximations.

0( , )r

The forward pass

The value function approximation defines the implementation policy!

More complex information processes» We can easily approximate convex functions (using

Benders or separable, piecewise linear value functions).» But what if we have an external information process

that affects the system?• Weather• Economy

» In the next slide, we show how the algorithm changes when we no longer enjoy “intertemporalindependence.”

» Assume that the information state is “weather” that may be sunny, cloudy or stormy.

Weather states

More complex information processes (cont’d)» In this algorithm, we are using a lookup table function

for weather:

» We generate a new cut for each information state. This can work quite well if the information state is simple, but not if it has multiple dimensions.

ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment...

Documents

Transcript of ORF 544 Stochastic Optimization and Learning · » Operations research • Optimal equipment...

ORF als Baum

Molecular characterization of Orf virus isolates from ... · B2L gene, Kodai hills, Orf virus, phylogenetic analysis, sheep. Introduction. Orf, popularly called as contagious ecthyma,

KYPHOSIS - orf-aarhus.dk

How Backward are the Other Backward Classes? Changing ... · How Backward are the Other Backward Classes? Changing Contours ... How Backward are the Other Backward ... hardly any

3VNJ JKB[ - ORF

ORF Kalendabladdl - Countdown läuft!

Orf 2013 Wrap Up

RAPPORT PRATIQUE - ORF-VD

Raisina Files - ORF

Occasional Paper - ORF

ORF Public Value - Start - Österreichischer Rundfunk ......Steigerung der Aufmerksamkeit für ORF-Produktionen im Speziellen und ORF-Fernsehsendungen im 4 Beispielkollektionen Angebotskonzept

ORF Jahresbericht 2014

Darstellung der EU im ORF im Vergleich zwischen ORF On und ZiB1

MIETOBJEKT ORF RADIOKULTURHAUSradiokulturhaus.orf.at/static/pdf/RKH_FOLDER_Mietobjekt_72_neu.pdf · mietobjekt orf radiokulturhaus grosser sendesaal – orf kulturcafe – studio

WELCOME NOTE - ORF

ORF DIGITAL

orf geschaeftsbericht 2011

Brisa Fresco - Ultrafabrics · Brisa ® | Fresco 544-3105 Sepia 544-3197 Patina 544-5265 Goldenrod 544-3100 Whey 544-5752 Flint 544-3101 Oatstraw 544-5754 Peppercorn 544-5711 Coconut

ENERGY NEWS - ORF

Bioinformatica 2014 ORF