Model-based Reinforcement Learning with Non-linear...

5
Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments Yi Wan *1 Muhammad Zaheer *1 Martha White 1 Richard S. Sutton 1 Abstract In model-based reinforcement learning (MBRL), the model of a stochastic environment provides, for each state and action, either 1) the complete distribution of possible next states, 2) a sample next state, or 3) the expectation of the next state’s feature vector. The third case, that of an expec- tation model, is particularly appealing because the expectation is compact and deterministic; this is the case most commonly used, but often in a way that is not sound for non-linear models such as those obtained with deep learning. In this pa- per, we introduce the first MBRL algorithm that is sound for non-linear expectation models and stochastic environments. Key to our algorithm, based on the Dyna architecture, is that the model is never iterated to produce a trajectory, but only used to generate single expected transitions to which a Bellman backup with a linear approx- imate value function is applied. In our results, we also consider the extension of the Dyna ar- chitecture to partial observability. We show the effectiveness of our algorithm by comparing it with model-free methods on partially-observable navigation tasks. 1. Introduction Model-free reinforcement learning methods have achieved impressive performance in a range of complex tasks (Mnih et al., 2015). These methods, however, require millions of interactions with the environment to attain a reasonable level of performance. With computation getting exponentially faster, interactions with the environment pose themselves to be the primary bottleneck in the successful deployment * Equal contribution 1 Department of Computing Science, University of Alberta, Edmonton, Canada. Correspondence to: Yi Wan <[email protected]>, Muhammad Zaheer <mza- [email protected]>. Accepted at the FAIM workshop “Prediction and Generative Mod- eling in Reinforcement Learning”, Stockholm, Sweden, 2018. Copyright 2018 by the author(s). dyna world act real-experience ( v u( learn learning planning act ( v plan m( m( simulated-experience ( ( ( Figure 1. (a) Actor-critic with dyna-style planning uses the real- experience to not only learn the state update, value and policy functions but also the model. Concurrently, it uses the model to simulate experience and update the value and policy functions of reinforcement learning algorithms to real-world appli- cations. In contrast, model-based reinforcement learning methods promise sample-efficient learning by modeling how the world works (Betts, 1998)(Deisenroth and Rasmussen, 2011)(Browne et al., 2012)(Oh et al., 2017). In MBRL, it is a common practice to learn non-linear ex- pectation models of the environment (Oh et al., 2015)(Finn et al., 2016)(Leibfried et al., 2016). When the world is stochastic, iterating a non-linear expectation model leads to an invalid trajectory. In order to deal with this limitation, we argue that the model should be used to only simulate one-step transitions and the planning process should uti- lize the simulated one-step transitions to perform planning updates based on Bellman backups (Bertsekas and Tsitsik- lis, 1995). We also note that to obtain correct targets for Bellman backups, the value function has to be linear. To retain the expressiveness of the value function and tackle partial observability, we propose the use of state update function, an arbitrarily complex non-linear function of the history. In subsequent sections, we expand on each of the aforementioned issues and their solutions in detail. The main contributions of this paper are as follows: 1) we devise the first MBRL algorithm that is sound for non-linear expectation models and stochastic environments. Instead of iterating the model or using a non-linear value function, we simulate one-step transitions and use a linear value function; 2) In order to learn a complex value function and deal with partial observability, we learn to generate a flexible agent state using the state update function, which could be any arbitrarily complex non-linear function of the observations.

Transcript of Model-based Reinforcement Learning with Non-linear...

Page 1: Model-based Reinforcement Learning with Non-linear ...reinforcement-learning.ml/papers/pgmrl2018_wan.pdf · Model-based Reinforcement Learning with Non-linear Expectation Models and

Model-based Reinforcement Learning with Non-linear Expectation Models andStochastic Environments

Yi Wan * 1 Muhammad Zaheer * 1 Martha White 1 Richard S. Sutton 1

AbstractIn model-based reinforcement learning (MBRL),the model of a stochastic environment provides,for each state and action, either 1) the completedistribution of possible next states, 2) a samplenext state, or 3) the expectation of the next state’sfeature vector. The third case, that of an expec-tation model, is particularly appealing becausethe expectation is compact and deterministic; thisis the case most commonly used, but often in away that is not sound for non-linear models suchas those obtained with deep learning. In this pa-per, we introduce the first MBRL algorithm thatis sound for non-linear expectation models andstochastic environments. Key to our algorithm,based on the Dyna architecture, is that the modelis never iterated to produce a trajectory, but onlyused to generate single expected transitions towhich a Bellman backup with a linear approx-imate value function is applied. In our results,we also consider the extension of the Dyna ar-chitecture to partial observability. We show theeffectiveness of our algorithm by comparing itwith model-free methods on partially-observablenavigation tasks.

1. IntroductionModel-free reinforcement learning methods have achievedimpressive performance in a range of complex tasks (Mnihet al., 2015). These methods, however, require millions ofinteractions with the environment to attain a reasonable levelof performance. With computation getting exponentiallyfaster, interactions with the environment pose themselvesto be the primary bottleneck in the successful deployment

*Equal contribution 1Department of Computing Science,University of Alberta, Edmonton, Canada. Correspondenceto: Yi Wan <[email protected]>, Muhammad Zaheer <[email protected]>.

Accepted at the FAIM workshop “Prediction and Generative Mod-eling in Reinforcement Learning”, Stockholm, Sweden, 2018.Copyright 2018 by the author(s).

dyna

world

act

real-experience

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):94

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions95

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode96

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).97

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return98

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.99

Distribution, sample and expectation models. In model-based reinforcement learning, the agent100

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)101

and the reward model r(s, a), and use the learned model to plan. One important choice in model-102

based reinforcement learning is to whether learn a distribution model, a sample model, or an103

expectation model. Given the current state and action, the distribution model explicitly outputs the104

joint distribution over the next state and reward. However, computing the entire distribution is not105

tractable especially if the state space is continuous. The sample model does not explicitly output106

the joint distribution of the next state and reward but generates samples according to the underlying107

distribution. However, it is not clear how to learn a sample model directly from data using function108

approximation. In contrast, the expectation model outputs the expected next state and reward. In this109

work, we focus on the expectation model because they are relatively straightforward to learn using110

supervised learning.111

For non-markovian environments, the choice of models remains the same except that the model is112

learned in the observation space. That is, given the past history of observations and the current action,113

we need to learn the distribution over the next observations for the distribution model, the sample114

of the next observation for the sample model, or the expected next observation for the expectation115

models.116

Dyna style planning with expectation model. In the context of Dyna, the simulated experience117

from the model is used to update the value function. For an expectation model, the model only118

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the119

expectation of the reward and the next state in planning. For Q-learning, in order to update the120

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].121

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and122

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not123

equal the value of the expected next state:124

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (4)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.125

However, such linear function might not be interesting because there are no parameters capturing the126

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the127

above equation does not hold.128

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly129

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the130

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here131

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update132

if we approximate the value function for the policy as a linear function of the state, as the following133

holds true:134

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (5)

We will later show that once we update the state-value function, we can further update the policy in135

the planning phase and this can potentially expedite the learning process.136

3

learn

learning planning

act

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

plan

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

simulated-experience

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

state update function:91

u(s, a, o, r) (1)

value function:92

v⇡(s) (2)

policy function:93

⇡(s) (3)

model:94

m(s, a) (4)

We formalize the agent’s interaction with its environment as a Markov Decision Process (MDP):95

hS, A, r, p, �i. At time-step t, the agent observes the state St 2 S , takes an action At 2 A, transitions96

into the next state St+1 2 S according to the one-step dynamics of the MDP p(s0|s, a) which encode97

Pr(St+1 = s0|St = s, At = a). The agent also receives a reward Rt+1 2 R where Rt+1 ⇠ r(s, a).98

The goal of the agent is to find a policy ⇡ : S ⇥A ! [0, 1] which maximizes the expected return99

E[Gt|St = s, At = a; At+1:1 ⇠ ⇡] where return Gt.=

P1k=0 �

kRt+k+1.100

Distribution, sample and expectation models. In model-based reinforcement learning, the agent101

aims to learn the dynamics of the environment, which includes the state transition model p(s0|s, a)102

and the reward model r(s, a), and use the learned model to plan. One important choice in model-103

based reinforcement learning is to whether learn a distribution model, a sample model, or an104

expectation model. Given the current state and action, the distribution model explicitly outputs the105

joint distribution over the next state and reward. However, computing the entire distribution is not106

tractable especially if the state space is continuous. The sample model does not explicitly output107

the joint distribution of the next state and reward but generates samples according to the underlying108

distribution. However, it is not clear how to learn a sample model directly from data using function109

approximation. In contrast, the expectation model outputs the expected next state and reward. In this110

work, we focus on the expectation model because they are relatively straightforward to learn using111

supervised learning.112

For non-markovian environments, the choice of models remains the same except that the model is113

learned in the observation space. That is, given the past history of observations and the current action,114

we need to learn the distribution over the next observations for the distribution model, the sample115

of the next observation for the sample model, or the expected next observation for the expectation116

models.117

Dyna style planning with expectation model. In the context of Dyna, the simulated experience118

from the model is used to update the value function. For an expectation model, the model only119

simulates the expected reward and the next-state. It is important to see if it is legitimate to use the120

expectation of the reward and the next state in planning. For Q-learning, in order to update the121

action-value of (s, a), the target consists of a sample of E[R + � maxa0 Q⇤(S0, a0)|S = s, A = a].122

For Dyna-style planning with the expectation model, we can only evaluate E[R|S = s, A = a] and123

Q⇤(E[S0|S = s, A = a], a0). We notice that, in general, the expected value of the next state does not124

equal the value of the expected next state:125

maxa0

Q⇤(E[S0|S = s, A = a], a0) 6= E[maxa0

Q⇤(S0, a0|S = s, A = a)] (5)

Nevertheless, if Q is a linear function, i.e. Q(s, a) = w1s + w2a, then the above equation holds.126

However, such linear function might not be interesting because there are no parameters capturing the127

interaction between s and a. Notice that Q(s, a) = was is also a non-linear function for which the128

above equation does not hold.129

As a work-around, we turn towards the actor-critic family of methods where the agent explicitly130

learns a parameterized policy ⇡✓. In order to reduce variance, the agent also uses an estimate of the131

state-value function V ⇡✓ as a baseline. In order to update the state-value of s, the target we use here132

is E[R + �V ⇡✓ (S0)|S = s, A = a]. In this setting, we can conveniently perform the planning update133

if we approximate the value function for the policy as a linear function of the state, as the following134

holds true:135

V ⇡✓ (E[S0|S = s, A = a]) = E[V ⇡✓ (S0|S = s, A = a)] (6)

We will later show that once we update the state-value function, we can further update the policy in136

the planning phase and this can potentially expedite the learning process.137

3

Figure 1. (a) Actor-critic with dyna-style planning uses the real-experience to not only learn the state update, value and policyfunctions but also the model. Concurrently, it uses the model tosimulate experience and update the value and policy functions

of reinforcement learning algorithms to real-world appli-cations. In contrast, model-based reinforcement learningmethods promise sample-efficient learning by modeling howthe world works (Betts, 1998)(Deisenroth and Rasmussen,2011)(Browne et al., 2012)(Oh et al., 2017).

In MBRL, it is a common practice to learn non-linear ex-pectation models of the environment (Oh et al., 2015)(Finnet al., 2016)(Leibfried et al., 2016). When the world isstochastic, iterating a non-linear expectation model leads toan invalid trajectory. In order to deal with this limitation,we argue that the model should be used to only simulateone-step transitions and the planning process should uti-lize the simulated one-step transitions to perform planningupdates based on Bellman backups (Bertsekas and Tsitsik-lis, 1995). We also note that to obtain correct targets forBellman backups, the value function has to be linear. Toretain the expressiveness of the value function and tacklepartial observability, we propose the use of state updatefunction, an arbitrarily complex non-linear function of thehistory. In subsequent sections, we expand on each of theaforementioned issues and their solutions in detail.

The main contributions of this paper are as follows: 1) wedevise the first MBRL algorithm that is sound for non-linearexpectation models and stochastic environments. Instead ofiterating the model or using a non-linear value function, wesimulate one-step transitions and use a linear value function;2) In order to learn a complex value function and deal withpartial observability, we learn to generate a flexible agentstate using the state update function, which could be anyarbitrarily complex non-linear function of the observations.

Page 2: Model-based Reinforcement Learning with Non-linear ...reinforcement-learning.ml/papers/pgmrl2018_wan.pdf · Model-based Reinforcement Learning with Non-linear Expectation Models and

Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments

The proposed algorithm learns the model of the transitiondynamics in the agent state space, intermixes model-freeand model-based methods and performs Bellman backupsusing the simulated experience in the agent state space;3) We fuse the above two strategies with model-free Asyn-chronous Advantage Actor-Critic (A3C) method with Gener-alized Advantage Estimation (Mnih et al., 2016)(Schulmanet al., 2015). We evaluate the proposed method in fullyand partially observable navigation tasks. Our experimentsshow that the proposed method considerably improves themodel-free baselines methods in terms of the number ofenvironmental interactions.

2. Stochastic Environments and Non-LinearModels

In order to model a stochastic world, one can in principlelearn a distributional model, which provides a distributionover the next state, a sample model, which produces a sam-ple of the next state, or an expectation model, which pro-vides the expected next agent state’s feature vector (in thefollowing, we refer to the agent state’s feature vector simplyas agent state). When the state space is high-dimensional,both the distribution model and the sample model are pro-hibitively hard to learn. In contrast, the expectation modelis relatively easier to learn and it scales well with the size ofthe state space.

If a stochastic world is modeled using an expectation model,the expected next state might not correspond to any real-state. If such expected next-state is fed into the model againas input, it would be something that the model has neverencountered during the training, possibly leading to an ar-bitrary output. Thus, expectation models can not be usedto simulate valid trajectories by iterating themselves. Apotential solution to side-step this inherent limitation of theexpectation model is to generate only single-step transitionsand use them to perform Bellman backups. In this work,we focus on simulating multiple single-step transitions andlearning from them as if they were real transitions. Anadded benefit of using single-step transitions for planning isthat they are relatively robust to model errors. When an im-perfect model is used to simulate a trajectory, model errorscompound over the length of the rollout leading to unlikelytrajectories with hypothetical states (Talvitie, 2017).

In order to perform Bellman backups with single-step tran-sitions, the target consists of the expected value of the nextstate. We note that if the value is a non-linear function ofthe state, the aforementioned target cannot be obtained fromthe expected next state, which is what is given by the expec-tation model (Paduraru, 2007). Concretely, the value of theexpected next state is not equal to the expected value of nextstate when the parametrization vw is not a linear functionvw(E[St+1|St = s,At = a]) 6= E[vw(St+1|St = s,At =

a)]). In order to use the expectation model to obtain thecorrect Bellman backup target, we propose to use a linearvalue function.

The linearity seems to be a severe limitation on the expres-siveness of the value function. To ensure that we are stillable to express complex value functions, we introduce thenotion of state update function which could be any arbitrar-ily complex non-linear function of history that is optimizedwhile solving the task.

3. State Update Function and PartialObservability

While the state-update function allows us to use a linearvalue function, it also presents itself as a potential solutionto partial observability. A common problem for RL agentsoperating in complex environments is that they do not haveaccess to the states of the underlying MDP. Instead, an agentreceives observations which convey partial information ofthe states. Given these observations, it is desirable to con-struct a compact summary of the agent’s experience whichcould be used to make decisions effectively. We refer to thiscompact summary as agent state and treat this agent statethe same way we treat the state in traditional RL methods.Once the notion of the agent state has been defined, the tradi-tional RL definitions and methods can be used. For instance,the agent state can be used as the input to the policy andvalue functions.

At a given time-step t, the agent state can be estimated usingthe state update function u which recursively updates theagent state to obtain St+1 using the previous agent state St,the action At, the next observation Ot+1 and the rewardRt+1 i.e. St+1 = u(St, At, Ot+1, Rt+1). The state updatefunction can be either given by the designer or could belearned by the agent.

4. A Concrete Algorithm Based on DynaArchitecture

In this section, we describe a concrete algorithm for ourproposed approach. We extend the Dyna (Sutton, 1991)(Sutton et al., 2012) (Pan et al., 2018) architecture to non-linear models and partial observability and integrate it withan actor-critic method. We begin by describing the corecomponents of our formulation. Subsequently, we discussthe algorithmic details which highlight how we interleavelearning from the real-experience and planning with thesimulated experience.

4.1. Architectural Components

In this section, we outline the architectural components:state update function u, expectation model m, policy func-

Page 3: Model-based Reinforcement Learning with Non-linear ...reinforcement-learning.ml/papers/pgmrl2018_wan.pdf · Model-based Reinforcement Learning with Non-linear Expectation Models and

Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments

5 10 1501e5

100

200

300

400

500

5 10 1501e5(a) Deterministic

Fully Observable(b) Stochastic

Fully Observable

50

100

150

200

250

300

Model-Free A3C A3CD - True Model A3CD - Learned Model

1 2 501e6(c) Deterministic

Partially Observable

3 41 2 50 3 4 1 2 501e6(d) Stochastic

Partially Observable

3 4

25

50

75

100125

150175

50

100

150

200

250

Figure 2. Learning curves on the deterministic and stochastic variants of the fully observable and the partially observable navigation tasks.x-axis represents the number of interactions with the environment, y-axis corresponds to the number of episodes completed in the last10,000 environment interactions. Learning curves are averaged from 20 runs with different random seeds.

tion π and value function vπ parameterized by α,β,θ andw respectively.

Non-linear State update functionuα: St−1, At−1, Ot, Rt → St, which estimates the currentagent state st given the last agent state St−1, last actionAt−1, the current observation Ot and the current reward Rt.The state update function can be modeled using an LSTM(Hochreiter and Schmidhuber, 1997).

Non-linear Expectation modelmβ: S,A→ S′, R, which takes agent state S and action Aas input and outputs the next expected agent state and thenext expected reward. The non-linear expectation can bemodeled using a feedforward neural network.

Non-linear policy functionπθ(A|S), which takes agent state S and action A as inputand outputs the probability of taking action A in agent stateS. The policy can be modeled using a feedforward neuralnetwork followed by a softmax function.

Linear value functionvw: S → vπθ

which takes agent state S as input andoutputs the value vw(S). We approximate vw with a linearfunction, i.e., vw(S) = w · S.

4.2. Algorithmic Description

Dyna architecture involves three distinct phases: directlearning, model learning and planning. The three phasescan proceed serially or concurrently. Figure 1 illustratesthe actor-critic architecture with dyna-style planning andcontrasts it with the model-free actor-critic architecture. Inthis section, we further describe each of the three phases inthe context of actor-critic algorithm.

Direct Learning: In the direct learning phase, we learn thestate update function, the value function and the policy withreal experience using the actor-critic method. Concretely,we first learn the value function vw and the state updatefunction uα by updating parameters with the semi-gradient

of the n-step TD-error. We learn the policy πθ and the stateupdate function uα by taking the gradient ascent on thepolicy gradient objective.

Model Learning: During direct learning, we collect thetransitions (St, At, Rt+1, St+1) and store them in a recencybuffer. During the model learning phase, we sample amini-batch of random transition (Si, Ai, Ri, S

′i) and learn

the model S′i, Ri ⇐ mβ(Si, Ai) by minimizing the mean-squared error between S′i, Ri and S′i, Ri. It should be notedthat the learned model is task specific since the state repre-sentation is influenced by the reward function.

Planning: In the planning phase, a mini-batch of agentstates Si is randomly sampled from the set of recently visitedstates so that p(Si) ≈ µπθ

(Si), where µπ(Si) is the agentstate distribution under policy π. Once the agent stateshave been sampled, we then choose action ai for each Siaccording to the current policy πθ. Next, given Si and ai,we predict the next agent state S′i and the next reward Riusing the model mβ. We fix the state update function uαand update vw and πθ.

5 10 1501e5(a) Stochastic

Fully Observable

50

100

150

200

250

300

1 2 501e6(b) Stochastic

Partially Observable

3 4

25

50

75

100

125

150

175

Correct planning target Incorrect planning target

6 7

Figure 3. Planning with correct planning targets vs planning withincorrect planning targets: Learning curves on the stochastic vari-ant of the fully and partially observable navigation tasks. x-axisrepresents the number of interactions with the environment, y-axiscorresponds to the number of episodes completed in the last 10,000environment interactions. Learning curves are averaged from 20runs with random seeds

Page 4: Model-based Reinforcement Learning with Non-linear ...reinforcement-learning.ml/papers/pgmrl2018_wan.pdf · Model-based Reinforcement Learning with Non-linear Expectation Models and

Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments

(a) Fully observable task (b) Partially observable task

Figure 4. In the fully observable task, the agent receives an obser-vation which represents the x, y coordinates of its position in themaze. In the partially observable task, the agent receives an obser-vation which includes the color information of the agent’s currentcell and its four neighbours. The black cells mark the obstacleswhile the colored cells mark the accessible regions. We formulateboth tasks as episodic with a discount factor γ = 0.99. At thestart of each episode, the agent is randomly placed in the maze.The reward is 0 at each time-step except when the agent movesto the terminate state, at which point the agent receives a rewardof 1 and the episode terminates. We consider the deterministicand stochastic transition dynamics for both tasks. In stochasticcase, the intended action is executed with probability 0.7 while arandom action is executed with probability 0.3

We fuse the aforementioned algorithm in A3C with Gener-alized Advantage Estimation. which we call AsynchronousAdvantage Actor Critic with Dyna-style planning (A3CD).

5. ExperimentsWe designed our experiments to answer the followingquestions: a) does our proposed method improve sample-efficiency when compared with the model-free methods? b)how robust is the learned model to the changing state updatefunction? c) how does the proposed method compare with amethod which uses a non-linear expectation model with anon-linear value function?

We evaluate our framework in fully and partially observable2D navigation tasks, illustrated in 4.

5.1. Baselines

• Model-Free A3C with Generalized Advantage Estimation.It is important to note that our architecture reduces to thisbaseline if it does not perform the model learning andplanning steps.

• Dyna-style planning with linear value function and thetrue model. This baseline demonstrates the improvementin sample efficiency if an accurate model is available.

• Dyna-style planning with non-linear value function andthe true model. This baseline is used to empiricallydemonstrate that non-linear expectation model provideswrong planning targets if a non-linear value function isused.

5.2. Results

Model-free, true expectation model and learned expec-tation model. Results are summarized in Figure 2. We usedthe number of episodes completed in the last 10,000 steps asthe evaluation metric. We first notice that in the stochasticand deterministic variants of both tasks, A3CD with trueexpectation model achieves the optimal performance consid-erably faster than the model-free A3C. This indicates thatplanning by only updating the value and policy functions,but not the state update function, can still obtain significantimprovements over the model-free method.

We also notice that in both fully and partially observabletasks, the learned model performs almost as well as the truemodel, which suggests that the model can be fairly robustto the problem of the changing state update function. Oneexception is the performance of the learned model in thestochastic variant of the partially observable task. Whilethe learned model does not reach the performance level ofthe true model, it still significantly outperforms the model-free baseline. This is expected, as the model learning isconsiderably more challenging if the task is both partiallyobservable as well as stochastic.

Incorrect planning targets with non-linear value func-tion. To empirically show that in a stochastic world, non-linear value function with non-linear expectation modelleads to wrong planning targets which could consequentlyhurt the performance, we compare the performance of thelinear and the non-linear value functions for planning withthe true expectation model. While the former’s planningprocess only updates parameters of the policy and the linearvalue functions and not the parameterized state update func-tion, the latter updates all the parameters of the architecture,but with a wrong target. We compare the two methods inthe stochastic variant of the fully and partially observabletasks. As illustrated in Figure 3, when the expectation modelis used, planning updates with the non-linear value func-tion performs considerably worse than the planning updateswhich involve the linear value function.

6. ConclusionIn this paper, we devised the first MBRL algorithm that issound for non-linear expectation models and stochastic en-vironments. The proposed approach learns the model onlineand uses it to simulate experience in the agent state spaceto update the value function and the policy. We also extenddyna to partially observable environments. Experiments onpartially observable navigation tasks show the benefit of theproposed algorithm for sample efficient learning.

Page 5: Model-based Reinforcement Learning with Non-linear ...reinforcement-learning.ml/papers/pgmrl2018_wan.pdf · Model-based Reinforcement Learning with Non-linear Expectation Models and

Model-based Reinforcement Learning with Non-linear Expectation Models and Stochastic Environments

ReferencesBertsekas, D. P. and Tsitsiklis, J. N. (1995). Neuro-dynamic

programming: an overview. In Decision and Control, 1995.,Proceedings of the 34th IEEE Conference on, volume 1, pages560–564. IEEE.

Betts, J. T. (1998). Survey of numerical methods for trajectoryoptimization. Journal of guidance, control, and dynamics,21(2):193–207.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling,P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S.,and Colton, S. (2012). A survey of monte carlo tree searchmethods. IEEE Transactions on Computational Intelligenceand AI in games, 4(1):1–43.

Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A model-based and data-efficient approach to policy search. In Proceed-ings of the 28th International Conference on machine learning(ICML-11), pages 465–472.

Finn, C., Goodfellow, I., and Levine, S. (2016). Unsupervisedlearning for physical interaction through video prediction. InAdvances in neural information processing systems, pages 64–72.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term mem-ory. Neural computation, 9(8):1735–1780.

Leibfried, F., Kushman, N., and Hofmann, K. (2016). A deeplearning approach for joint video frame and reward predictionin atari games. arXiv preprint arXiv:1611.07078.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley,T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronousmethods for deep reinforcement learning. In International Con-ference on Machine Learning, pages 1928–1937.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., et al. (2015). Human-level control through deepreinforcement learning. Nature, 518(7540):529.

Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-conditional video prediction using deep networks in atari games.In Advances in Neural Information Processing Systems, pages2863–2871.

Oh, J., Singh, S., and Lee, H. (2017). Value prediction network.In Advances in Neural Information Processing Systems, pages6120–6130.

Paduraru, C. (2007). Planning with approximate and learned mod-els of markov decision processes. These de maıtre, Universityof Alberta.

Pan, Y., Zaheer, M., White, A., Patterson, A., and White, M. (2018).Organizing experience: a deeper look at replay mechanisms forsample-based planning in continuous state domains. In IJCAI.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P.(2015). High-dimensional continuous control using generalizedadvantage estimation. arXiv preprint arXiv:1506.02438.

Sutton, R. (1991). Integrated modeling and control based onreinforcement learning and dynamic programming. In Advancesin Neural Information Processing Systems.

Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. P.(2012). Dyna-style planning with linear function approximationand prioritized sweeping. arXiv preprint arXiv:1206.3285.

Talvitie, E. (2017). Self-correcting models for model-based rein-forcement learning. In AAAI, pages 2597–2603.