Applying reinforcement learning to single and multi-agent economic problems

Applying reinforcement learning to economics

Neal Hughes

Australian National University

[email protected]

November 17, 2014

Neal Hughes (ANU) Applying reinforcement learning to economics November 17, 2014 1 / 23

Machine learning

Machine learningI algorithms that ‘learn’ from data, i.e., build models from data with

minimal theory / human involvement.I goes hand in hand with ‘Big Data’

Supervised LearningI estimating functions mapping ‘input‘ variables X to ‘target’ variables Y.I aka non-parametric regression

Reinforcement learningI learning to make optimal (reward maximising) decisions in dynamic

environments: learning optimal policy functions for Markov DecisionProcesses (MDPs)

I aka approximate dynamic programming


Reinforcement learning

Agent

Environment

State, st

st+1

Action, atReward, rt


A (single agent) water storage problemInflow, It+1

Release point, F1t

Storage, St

Demand node

1

Extraction, Et

Extraction point, F2t

End of system, F3t

2

3

Return flow, Rt


A (single agent) water storage problem

max{Wt}t=∞

t=0

E

{∞

∑t=0

βtΠ(Qt , It)

}

Subject to:

St+1 = min{St −Wt − δ0αS2/3t + It+1, K}

0 ≤ Wt ≤ St

Qt ≤ max{(1− δ1b)Wt − δ1a, 0}


Why reinforcement learning?

0 200 400 600 800 1000

Storage (GL)

0

500

1000

1500

2000

Inflo

w(G

L)


The Q function

The standard Bellman equation with state value function V (s)

V ∗(s) = maxa

{R(s, a) + β

∫

ST (s, a, s ′)V ∗(s ′) ds ′

}

The Bellman equation with action-value function Q(a, s)

Q∗(a, s) = R(s, a) + β∫

ST (s, a, s ′)max

aQ∗(a, s ′) ds ′


Fitted Q Iteration

Algorithm 1: Fitted Q Iteration

1 initialise s02 Run a simulation with exploration for T periods

3 Store the samples {at , st , st+1, rt}Tt=0

4 initialise Q(at , st)5 repeat // Iterate until convergence

6 for t = 0 to T do7 set Q̂t = rt + β. maxa .Q(a, st+1)8 end

9 estimate Q by regressing Q̂t against (at , st)

10 until a stopping rule is satisfied ;

With large dense data, computing maxa Q(a, .) for each point is wastefulAlternative: max over a sample of points and fit a value function (FittedQ-V iteration)


Single agent reinforcement learning

Figure : An approximately equidistant grid in two dimensions

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

(a) 10000 iid standard normal points

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

(b) 100 points at least 0.4 apart


Tilecoding

input space

tiling layer 1

tiling layer 2

input point Xt

activated tile, layer 1

activated tile, layer 2


Single fine grid

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Single chunky grid

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Tilecoding: many chunky grids

0.0 0.2 0.4 0.6 0.8 1.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Tilecoding

Fitting

Averaging

Averages Stochastic Gradient Descent

Setup

Regular grids

‘Optimal’ displacement vectors

Linear extrapolation

Implementation

Cython with OpenMP

Perfect ‘hashing’


A test case

10000 20000 30000 40000 50000 60000 70000 80000

Number of samples

0.993

0.994

0.995

0.996

0.997

0.998

0.999

1.000

Soc

ialw

elfa

reas

perc

enta

geof

SD

P

SDP TC-A TC-ASGD


A test case

Table : Computation time

5000 10000 20000 50000 80000

SDP 6.6 7.2 7.5 7.4 7.4TC-A 0.4 0.4 0.5 0.6 0.8TC-ASGD 0.4 0.6 0.9 1.3 1.9


Multi agent problems

Nash equilibrium concepts for stochastic games (Economics)

Markov Perfect Equilibrium

Oblivious Equilibrium

Learning in games (Economics)

Factious play

Partial best response dynamic

Multi-agent learning (Computer Science / Economics)

each agent follows a single agent RL method

or we combine RL with game theory / equilibrium concepts


Multi-agent fitted Q-V iteration

Each agent follows a fitted Q-V iteration algorithm except...

I only a sample of agents update their policies each stage(similar to partial best response)

I each new batch of samples is blended with the existing batch of samples(similar to fictitious play)


Conclusions

RL can be successfully applied to economic problems

Batch methods (such as fitted Q-V iteration) are suited to our context

tilecoding is a great approximation method for low dimension problems

Our multi-agent method provides a middle ground between macro-DPmethods and agent based-evolutionary methods

Allows us to consider complex multi-agent problems with externalities,but still have near optimal agents


A (multi-agent) water storage problemInflow, It+1

Release point, F1t

Storage, St

Demand node

1

Extraction, Et

Extraction point, F2t

End of system, F3t

2

3

Return flow, Rt


Example: capacity sharing

Initial balance Updated balance

Total Inflow

Inflow credit

Internal Spill

20 ML

+10 ML +10 ML

10 ML

User 1 Volume

10 ML

User 2 Volume

50 ML

User 1 Airspace

40 ML

User 1 Volume

30 ML

User 2 Volume

50 ML

User 1 Airspace

20 ML


A test case

Figure : Mean storage by iteration

0 5 10 15 20

Iteration

550

600

650

700

750

800

Mea

nst

orag

eSt

(GL)

CS NS OA SWA


A test case

Figure : Mean social welfare by iteration

0 5 10 15 20

Iteration

192.0

192.5

193.0

193.5

194.0

194.5

195.0

195.5

Mea

nso

cial

wel

fare∑n i=

1uit

($M

)

CS NS OA SWA


Applying reinforcement learning to single and multi-agent economic problems

Economy & Finance

Transcript of Applying reinforcement learning to single and multi-agent economic problems