DiplomArbeit

The Reinforcement Learning Toolbox,Reinforcement Learning for Optimal Control Tasks

Diplomarbeit amInstitut f ur Grundlagen der Informationsverarbeitung (IGI)

Technisch-Naturwissenschaftliche Fakultat derTechnischen Universitat (University of Technology)

Graz

vorgelegt von

Gerhard NeumannStudienrichtung: Telematik

Betreuer: O. Univ.-Prof. Dr.rer.nat. DI Wolfgang Maass

Graz, Mai 2005

Contents

0.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .110.2 Thesis Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

1 Reinforcement Learning 131.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131.2 Problems of reinforcement learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131.3 Successes in Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

1.3.1 RL for Games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141.3.2 RL for Control Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151.3.3 RL for Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

1.4 Basic Definitions for RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .191.4.1 Markov Decision Process (MDP). . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4.2 Partially Observable Markov Decision Processes (POMDP). . . . . . . . . . . . . 19

2 The Reinforcement Learning Toolbox 212.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21

2.1.1 Design Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212.1.2 Programming Issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232.1.3 Libraries and Utility classes used. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Structure of the Learning system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .242.2.1 The Listeners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.2.2 The Agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .262.2.3 The Environment Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .262.2.4 The Action Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .282.2.5 The Agent Controllers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .302.2.6 The State Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .312.2.7 Logging the Training Trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .332.2.8 Parameter representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .352.2.9 A general interface for testing the learning performance. . . . . . . . . . . . . . . 36

3 State Representations in RL 393.1 Discrete State Representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39

3.1.1 Discretization of continuous Problems. . . . . . . . . . . . . . . . . . . . . . . . . 393.1.2 State Discretization in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . 393.1.3 Discretizing continuous state variables. . . . . . . . . . . . . . . . . . . . . . . . . 403.1.4 Combining discrete state variables. . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.5 Combining discrete state objects. . . . . . . . . . . . . . . . . . . . . . . . . . . .41

3

4 CONTENTS

3.1.6 State substitutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .413.2 Linear Feature States. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42

3.2.1 Tile coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423.2.2 Linear interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423.2.3 RBF-Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .433.2.4 Linear features in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.5 Laying uniform grids over the state space. . . . . . . . . . . . . . . . . . . . . . . 443.2.6 Calculating features from a single continuous state variable. . . . . . . . . . . . . . 45

3.3 States for Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

4 General Reinforcement Learning Algorithms 484.1 Theory on Value based approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48

4.1.1 Value Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .484.1.2 Q-Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .494.1.3 Optimal Value Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .494.1.4 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Dynamic Programming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .514.2.1 Evaluating the V-Function of a given policy. . . . . . . . . . . . . . . . . . . . . . 524.2.2 Evaluating the Q-Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .524.2.3 Policy Iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .524.2.4 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .524.2.5 The Dynamic Programming Implementation in the Toolbox. . . . . . . . . . . . . 53

4.3 Learning the V-Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .564.3.1 Temporal Difference Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . .564.3.2 TD (λ) V-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .564.3.3 Eligibility traces for Value Functions. . . . . . . . . . . . . . . . . . . . . . . . . 564.3.4 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4 Learning the Q-Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .584.4.1 TD Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .584.4.2 TD(λ) Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .594.4.3 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .604.5.1 Action Selection with V-Functions using Planning. . . . . . . . . . . . . . . . . . 614.5.2 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Actor-Critic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .634.6.1 Actors for two different actions . . . . . . . . . . . . . . . . . . . . . . . . . . . .644.6.2 Actors for a discrete action set. . . . . . . . . . . . . . . . . . . . . . . . . . . . .654.6.3 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.7 Exploration in Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . .664.7.1 Undirected Exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .674.7.2 Directed Exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .674.7.3 Model Free and Model Based directed exploration. . . . . . . . . . . . . . . . . . 684.7.4 Distal Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .694.7.5 Selective Attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .694.7.6 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.8 Planning and model based learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

CONTENTS 5

4.8.1 Planning and Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .714.8.2 The Dyna-Q algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .724.8.3 Prioritized Sweeping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .734.8.4 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Hierarchical Reinforcement Learning 755.1 Semi Markov Decision Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75

5.1.1 Value and action value functions. . . . . . . . . . . . . . . . . . . . . . . . . . . .765.1.2 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Hierarchical Reinforcement Learning Approaches. . . . . . . . . . . . . . . . . . . . . . . 785.2.1 The Option framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .795.2.2 Hierarchy of Abstract Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . .795.2.3 MAX-Q Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .805.2.4 Hierarchical RL used in optimal control tasks. . . . . . . . . . . . . . . . . . . . . 83

5.3 The Implementation of the Hierarchical Structure in the Toolbox. . . . . . . . . . . . . . . 845.3.1 Extended actions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .845.3.2 The Hierarchical Controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .855.3.3 Hierarchic SMDP’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .865.3.4 Intermediate Steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .865.3.5 Implementation of the Hierarchic architectures. . . . . . . . . . . . . . . . . . . . 88

6 Function Approximators for Reinforcement Learning 906.1 Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90

6.1.1 Efficient Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .916.2 The Gradient Calculation Model in the Toolbox. . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.1 Representing the Gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .926.2.2 Updating the Weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .926.2.3 Calculating the Gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .946.2.4 Calculating the Gradient of V-Functions and Q-Functions. . . . . . . . . . . . . . 946.2.5 Calculating the gradient of stochastic Policies. . . . . . . . . . . . . . . . . . . . 956.2.6 The supervised learning framework. . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3 Function Approximation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .976.3.1 Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .976.3.2 Linear Approximators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .976.3.3 Gaussian Softmax Basis Function Networks (GSBFN). . . . . . . . . . . . . . . . 986.3.4 Feed Forward Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . .996.3.5 Gauss Sigmoid Neural Networks (GS-NNs). . . . . . . . . . . . . . . . . . . . . .1006.3.6 Other interesting or utilized architectures for RL. . . . . . . . . . . . . . . . . . .1006.3.7 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . .103

7 Reinforcement learning for optimal control tasks 1067.1 Using continuous actions in the Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . . .106

7.1.1 Continuous action controllers. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1077.1.2 Gradient calculation of Continuous Policies. . . . . . . . . . . . . . . . . . . . . .1087.1.3 Continuous action Q-Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . .1097.1.4 Interpolation of Action Values. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1107.1.5 Continuous State and Action Models. . . . . . . . . . . . . . . . . . . . . . . . .110

6 CONTENTS

7.1.6 Learning the transition function. . . . . . . . . . . . . . . . . . . . . . . . . . . .1117.2 Value Function Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112

7.2.1 Direct Gradient Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1137.2.2 Residual Gradient Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1147.2.3 Residual Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1147.2.4 Generalizing the Results to TD-Learning. . . . . . . . . . . . . . . . . . . . . . .1157.2.5 TD(λ) with Function approximation. . . . . . . . . . . . . . . . . . . . . . . . . .1167.2.6 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . .117

7.3 Continuous Time Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . .1187.3.1 Continuous Time RL formulation. . . . . . . . . . . . . . . . . . . . . . . . . . .1197.3.2 Learning the continuous time Value Function. . . . . . . . . . . . . . . . . . . . .1207.3.3 Continuous TD(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1217.3.4 Finding the Greedy Control Variables. . . . . . . . . . . . . . . . . . . . . . . . .1227.3.5 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . .123

7.4 Advantage Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1247.4.1 Advantage Learning Update Rules. . . . . . . . . . . . . . . . . . . . . . . . . . .1257.4.2 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . .125

7.5 Policy Search Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1257.5.1 Policy Gradient Update Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . .1267.5.2 Calculating the learning rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1267.5.3 The GPOMDP algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1277.5.4 The PEGASUS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1297.5.5 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . .132

7.6 Continuous Actor-Critic Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1347.6.1 Stochastic Real Valued Unit (SRV) Algorithm. . . . . . . . . . . . . . . . . . . .1347.6.2 Policy Gradient Actor Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . .1357.6.3 Implementation in the RL Toolbox. . . . . . . . . . . . . . . . . . . . . . . . . . .137

8 Experiments 1398.1 The Benchmark Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140

8.1.1 The Pendulum Swing Up Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1408.1.2 The Cart-Pole Swing Up Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1428.1.3 The Acrobot Swing Up Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1448.1.4 Approaches from Optimal Control. . . . . . . . . . . . . . . . . . . . . . . . . . .146

8.2 V-Function Learning Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1468.2.1 Learning the Value Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1478.2.2 Action selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1518.2.3 Comparison of Different Time Scales. . . . . . . . . . . . . . . . . . . . . . . . .1528.2.4 The influence of the Eligibility Traces. . . . . . . . . . . . . . . . . . . . . . . . .1538.2.5 Directed Exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1558.2.6 N-step V-Planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1598.2.7 Hierarchical Learning with Subgoals. . . . . . . . . . . . . . . . . . . . . . . . . .1608.2.8 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .161

8.3 Q-Function Learning Experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1628.3.1 Learning the Q-Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1638.3.2 Comparison of different time scales. . . . . . . . . . . . . . . . . . . . . . . . . .165

CONTENTS 7

8.3.3 Dyna-Q learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1658.3.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .166

8.4 Actor-Critic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1668.4.1 Actor-Critic with Discrete Actions. . . . . . . . . . . . . . . . . . . . . . . . . . .1668.4.2 The SRV algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1678.4.3 Policy Gradient Actor-Critic Learning. . . . . . . . . . . . . . . . . . . . . . . . .1708.4.4 Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173

8.5 Comparison of the algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1738.6 Policy Gradient Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175

8.6.1 GPOMDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1758.6.2 The PEGASUS algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1768.6.3 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177

8.7 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .178

A List of Abbreviations 179

B List of Notations 180

C Bibliography 182

List of Figures

1.1 The Cart-pole Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151.2 The Acrobot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .161.3 The Truck Backer Upper Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171.4 The robot stand up task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

2.1 The structure of the learning system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.2 Interaction of the agent with the environment. . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Using an individual transition function for the Environment. . . . . . . . . . . . . . . . . 282.4 Action objects, Action sets and action data objects. . . . . . . . . . . . . . . . . . . . . . 292.5 The interaction of the agent with the controller. . . . . . . . . . . . . . . . . . . . . . . . 302.6 State Objects and State Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .312.7 Calculating modified states and storing them in state collections. . . . . . . . . . . . . . . 322.8 The adaptable parameter representation of the Toolbox. . . . . . . . . . . . . . . . . . . . 35

3.1 Discretizing a single continuous state variable. . . . . . . . . . . . . . . . . . . . . . . . . 403.2 Combining several discrete state objects with theandoperator . . . . . . . . . . . . . . . . 413.3 State Substitutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .413.4 Tilings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .443.5 Grid Based RBF-Centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .453.6 Single state feature calculator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

4.1 The value function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .504.2 Q-Functions for a finite action set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .514.3 Representation of the Transition Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . .544.4 V-Function Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .624.5 Stochastic Policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .634.6 The general Actor-Critic architecture.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .644.7 The class architecture of the Toolbox for the Actor-Critic framework. . . . . . . . . . . . . 664.8 Dyna-Q Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

5.1 Illustration of the execution of a MDP, SMDP and a MDP with options,. . . . . . . . . . . 765.2 Temporal Difference Learning with options.. . . . . . . . . . . . . . . . . . . . . . . . . . 785.3 State transition structure of a simple HAM. . . . . . . . . . . . . . . . . . . . . . . . . . . 805.4 Illustration of the taxi task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .815.5 MAX-Q task decomposition for the Taxi problem. . . . . . . . . . . . . . . . . . . . . . . 815.6 Subgoals defined for the robot stand-up task. . . . . . . . . . . . . . . . . . . . . . . . . . 835.7 The hierarchic controller architecture of the Toolbox. . . . . . . . . . . . . . . . . . . . . 85

8

LIST OF FIGURES 9

5.8 The hierarchic Semi-MDP is used for learning in different hierarchy levels.. . . . . . . . . 875.9 Intermediate steps of an option. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .875.10 Realization of the option framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .895.11 Realization of the MAX-Q framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

6.1 Interface for updating the weights of a parameterized FA. . . . . . . . . . . . . . . . . . . 936.2 Interface for parameterized FAs which provide the gradient calculation. . . . . . . . . . . . 946.3 Value Function class which uses an gradient function as function representation. . . . . . . 956.4 Single Neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1006.5 Gaussian-Sigmoidal Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . .101

7.1 Continuous action controllers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1077.2 Limited control policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1097.3 Direct and Residual Gradients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115

8.1 The Pendulum Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1418.2 The Cart-pole Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1438.3 The Acrobot Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1448.4 Pendulum V-RBF Performances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1478.5 Pendulum V-RBF Performances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1488.6 Pendulum FF-NN Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1498.7 Pendulum FF-NN Learning Trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1508.8 Cart-Pole FF-NN Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1518.9 Pendulum GS-NN Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1518.10 Performances of different action selection schemes. . . . . . . . . . . . . . . . . . . . . .1528.11 Comparison with different time scales. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1538.12 Pendulum V-RBF Performance with different e-traces (Direct Gradient). . . . . . . . . . .1548.13 Pendulum V-RBF Performance with different e-traces (Residual). . . . . . . . . . . . . . .1548.14 Pendulum V-FF-NN Performance with different e-traces (Direct). . . . . . . . . . . . . . .1558.15 Pendulum V-FF-NN Performance with different e-traces (Residualβ = 0.6) . . . . . . . . . 1568.16 Pendulum V-FF-NN Performance with different e-traces (variableβ) . . . . . . . . . . . . .1568.17 Pendulum V-GS-NN Performance with different e-traces (Direct). . . . . . . . . . . . . . .1578.18 Pendulum V-GS-NN Performance with different e-traces (Residualβ = 0.6) . . . . . . . . . 1578.19 Performance of directed exploration schemes. . . . . . . . . . . . . . . . . . . . . . . . .1588.20 Performance of directed exploration schemes. . . . . . . . . . . . . . . . . . . . . . . . .1588.21 Performance of V-Planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1598.22 Performance of Hierarchic Learning with Subgoals. . . . . . . . . . . . . . . . . . . . . .1618.23 Pendulum Q-RBF Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1638.24 Cart-Pole Q-RBF Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1648.25 Pendulum Q-FF-NN Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1648.26 Performance of Q-Learning with different time scales and Dyna-Q Learning. . . . . . . . . 1658.27 Performance of discrete Actor-Critic Learning. . . . . . . . . . . . . . . . . . . . . . . . .1678.28 Pendulum performance of the SRV algorithm with RBF critic. . . . . . . . . . . . . . . . .1688.29 Pendulum performance of the SRV algorithm with FF-NN critic. . . . . . . . . . . . . . .1698.30 Pendulum performance of the SRV algorithm with FF-NN critic. . . . . . . . . . . . . . .1698.31 CartPole performance of the SRV algorithm with an RBF critic. . . . . . . . . . . . . . . .1708.32 Pendulum performance of the PGAC algorithm. . . . . . . . . . . . . . . . . . . . . . . .171

10 LIST OF FIGURES

8.33 Cart-Pole performance of the PGAC algorithm. . . . . . . . . . . . . . . . . . . . . . . .1728.34 Cart-Pole learning curve of the PGAC algorithm with a FF-NN critic. . . . . . . . . . . . .1728.35 Comparison of the algorithms for the pendulum task with a RBF network. . . . . . . . . .1748.36 Comparison of the algorithms for the cart-pole task with an RBF network. . . . . . . . . .1748.37 Comparison of the algorithms for the Pendulum-Task with an FF-NN network. . . . . . . . 1758.38 Performance of the GPOMDP algorithm for the pendulum task. . . . . . . . . . . . . . . .1768.39 Performance of the PEGASUS algorithm for the pendulum task. . . . . . . . . . . . . . .177

0.1. Abstract 11

0.1 Abstract

This thesis investigates the use of reinforcement learning for optimal control problems and tests the perfor-mance of the most common existing RL algorithms on different optimal control benchmark problems. Thetests consist of an exhaustive comparision of the introduced RL algorithms with different parameter settingsand using different function approximators. The tests also demonstrate the influence of specific parameterson the algorithms. To our knowledge these tests are the most exhaustive benchmark tests done for RL.We also developed a software framework for RL which makes our tests easily extendable. This frameworkis called the Reinforcement Learning Toolbox 2.0 (RLT 2.0), which is a general C++ Library for all kindof reinforcement learning problems. The Toolbox was designed to be for general use, to be extendableand to provide a satisfying computational speed performance. Nearly all common RL algorithms such asTD(λ) learning for the V-Function and Q-Function, discrete Actor-Critic learning, dynamic programmingapproaches and prioritized sweeping (see [49] for an introduction to these algorithms), are included in theToolbox as well as special algorithms for continuous state and action spaces which are particularly suitedfor optimal control tasks. These algorithms are TD(λ)-Learning with value function approximation (a new,slightly extended version of the TD learning residual algorithm (see [6]) has been used), continuous timeRL [17], Advantage Learning [6], the stochastic real valued algorithm (SRV, see [19], [17]) as Actor-Criticalgorithm for continuous action spaces and also two policy gradient algorithms, which are the GPOMDPalgorithm [11] and modified versions of the PEGASUS algorithm presented in [33]. In addition to thesemostly already existing algorithms, a new Actor-Critic algorithm is introduced. This new algorithm isreferred to as policy-gradient Actor-Critic (PGAC) learning, and will be presented in section7.6.2.Most of these algorithms can be used with different kinds of function approximators, we implementedconstant and also adaptive normalized RBF networks (GSBFNs, see [30]), feed forward neural networks(FF-NNs) and gaussian sigmoidal neural networks [42]. The Toolbox uses a modular design, so the exten-sion of the Toolbox with new algorithms is very easy because much of the necessary functionality is likelyto have already been implemented.The second part of this thesis concerns the evaluation of the learning algorithms for continuous control tasksand how they can cope with different function approximators. The benchmark tests are done for all thealgorithms mentioned above. These algorithms were tested for the pendulum, cart-pole and acrobot swingup tasks with constant normalized RBF networks, feed forward neural networks and also gaussian-sigmoidalneural networks [42] as far as it was possible. The influence of certain parameters of the algorithms like theλ value for different function approximators also evaluated, as is the use of different time scales. Further weinvestigated the use of planning, directed exploration and hierarchic learning to boost the performance ofthe algorithms.

0.2 Thesis Structure

The thesis is divided into two parts: the first part is comprised of the Reinforcement Learning Toolbox andRL algorithms in general. We will discuss several algorithms and other theoretical aspects of RL. At theend of every theoretical discussion the implementation issues of the Toolbox are explained, so this part ofthe thesis can also be used as a manual for the Toolbox. The second part of this thesis (chapter8) covers thebenchmark tests for optimal control tasks.The thesis begins with a brief look at reinforcement learning itself, the successes and problems of reinforce-ment learning in general and specifically for continuous control tasks.The next two sections of the thesis are more software related and deal with the general requirements andstructure of the Toolbox. These sections will cover the agent, environment, actions and state models of

12 LIST OF FIGURES

the Toolbox. We will then take a look at the general reinforcement learning approaches, which includes atthe first place a theoretical discussion of value based algorithms such as dynamic programming approachesand temporal difference learning. Actor critic learning and planning methods such as prioritized sweepingare also discussed, as is the problem of efficiently exploring the state space. The next section will coverhierarchical reinforcement learning and how this is done in the Toolbox. In chapter six we will take a look atfunction approximation using gradient descent in general, and in particular the use in RL. In this chapter wewill also introduce the function approximator schemes used in RL and in our benchmark tests . In chapterseven we will discuss more specialized algorithms for dealing with continuous state and action spaces.Firstly we will cover algorithms for value function approximation [6], then we will come to continuoustime RL [17] and advantage learning [6]. After this two policy gradient algorithms GPOMDP [11] andPEGASUS [33] are introduced, and general issues about policy gradient algorithms are discussed. In theend of this chapter two different Actor-Critic algorithms are introduced, the stochastic real valued algorithm(SRV, [19]) and the new proposed policy gradient Actor-Critic algorithms.Chapter eight will cover the experiments with the pendulum, cart-pole and acrobot swing up benchmarktasks, which are explained in the beginning of this chapter. For each algorithm the best parameter settingsare pointed out and the potentials, traps, advantages and disadvantages are discussed. The tests includeexhaustive tests with V-Function learning, using different approximation algorithms and different functionapproximators. The influences of crucial parameters of the algorithms are also evaluated, as are the per-formance of the algorithms using different time scales. The results are compared to Q-Function learningalgorithms and Actor-Critic approaches. The use of planning methods, directed exploration and hierarchicallearning are also investigated in the case of V-Function learning in this section. At the end the experimentswith policy gradient algorithms are presented.In the conclusion, we summarize the results and talk about further possibilities to improve the performanceof the algorithms.

Chapter 1

Reinforcement Learning

In this chapter we will explain the basics of reinforcement learning; what it is, its achievements and prob-lems.

1.1 Introduction

We define RL as learning from a reward signal to choose an optimal (or near optimal) actiona∗ in thecurrent statest of the agent. Generally the goal of all reinforcement learning algorithms is to find a goodaction-selection policy which optimizes the long-term reward. There are algorithms for optimizing the finitehorizon, un-discounted rewardV(t0) =

∑Ti r(ti), the (in)finite horizon discounted rewardV(t0) =

∑∞i γir(ti)

(γ is the discount factor) or also the average rewardVA(t0) = limT→∞1T

∑Ti r(ti), but the infinite horizon

discounted reward is most commonly used. The agent learns from trial and error and attempts to adapt hisaction selection policy according to the received rewards.Reinforcement learning is an unsupervised learning approach and which is why reinforcement learning isso popular. In the best case, we have only to define our reward function, start our learning algorithm and weget an action selection policy maximizing the long-term reward. Usually it is not that easy.There is a huge variety of RL algorithms, the most common are value based algorithms (those which try tolearn the expected discounted horizon reward for each state) or policy search algorithms, where the searchis done directly in the space of the policy parameters. For policy search algorithms we can actually useany optimization algorithm we want, so there are approaches which use genetic algorithms or simulatedannealing to search for a good policy (this is mentioned here to point out that RL does not necessarily meanthat we need to learn a value function). In this thesis we will emphasize on the value based algorithms, theuse of policy search algorithms is not discussed and tested as exhaustively.

1.2 Problems of reinforcement learning

In practice, a learning problem faces many restrictions in order to achieve an optimal (or at least good)policy. In general Reinforcement Learning algorithms suffer from the following:

• The curse of dimensionality: Many algorithms need to discretize the state space, which is impossiblefor control problems with high dimensionality, because the number of discrete states would explode.The choice of the state space is crucial in reinforcement learning, so a lot of time has to be spentto designing the state space. In chapter six, we will discuss function approximation methods, whichovercome this problem, but we will also see that new problems are introduced by this approach.

13

14 Chapter 1. Reinforcement Learning

• Many learning trials : Most algorithms need a huge number of learning trials, specifically if the statespace size is high, so it is very difficult to apply reinforcement learning to real world tasks like robotlearning. However, RL can also be very time-consuming for learning a simulated control task.

• Finding good parameters for the algorithms: Many algorithms work well, but only with the rightparameter setting. Searching for a good parameter setting is therefore crucial, in particular for time-consuming learning processes. Thus algorithms which work with fewer parameters or allow a widerrange of the parameter setting are preferable.

• Exploration-Exploitation Dilemma : Often, even with a good state representation and enough learn-ing trials for the learning process, the agent will become mired in suboptimal solutions, because theagent has not searched through the state space thoroughly enough. On the other hand, if too manyexploration steps are used, the agent will not find a good policy at all. So the amount of exploration isanother parameter to be considered (in a few cases, we can set an individual parameter for exploration,for example, the noise of a controller).

• A skilled ‘reinforcement learner’ is needed: Since merely defining the reward function (which, initself is not always easy) is not enough, we must also define a good state space representation or agood function approximator, choose an appropriate algorithm and set the parameters of the algorithm.Consequently much knowledge and experience is needed when dealing with reinforcement learning.The argument that anybody can ‘program’ a reinforcement learning agent, (because one needs onlydefine a reward function intuitively) is not true in the most cases.

It appears that there are many problems to solve, but nevertheless, reinforcement learning has been appliedin many different domains successfully, the restrictions were only mentioned in order to emphasize that RLis by far no panacea.

1.3 Successes in Reinforcement Learning

Because of the generality of Reinforcement Learning, many researchers have tried to apply reinforcementlearning to different fields, many of them successfully.

1.3.1 RL for Games

Games often have the problem of a huge state space, which can not be represented as a table. Thus a verygood function approximator is needed for learning, making the learning time very high, so that supervisedlearning from human experts, search and planning methods or rule based systems are often preferred to RLin games. But with clever choices of the function approximator and possibly adding a hierarchic structureRL is entirely applicable to the field of games.

• Backgammon: The most popular and impressive RL approach is TD-Gammon by Gerauld Tesauro[50]. The algorithm uses a feed forward neural network to determine the probability of winning fora given state. For learning, TD(λ) Q-Learning is used. For training trials, the algorithm uses selfplay (over one million self-play games were used). The performance of this algorithm outperformsall other AI approaches and meets the performance of the world best human players.

• Chess and Checkers: Other approaches to apply RL to chess (The KnightCap programm by Baxter[10]) or checkers (Schaeffer [41]) were successful as well, but could not compete with human experts.

1.3. Successes in Reinforcement Learning 15

• Settlers of Catan: M. Pfeiffer [38] used Reinforcement Learning to learn a good policy for the game‘settlers of catan’. He employed hierarchical reinforcement learning and a model tree as functionapproximator for his Q-Function. Even though the game is quite complex, the algorithm manages tocompete with skilled human players.

1.3.2 RL for Control Tasks

Reinforcement Learning for control tasks is a challenging problem, because we typically have continuousstate and action spaces. For learning with continuous state and action space, a function approximator mustbe used. Since in the most cases RL with function approximators needs many learning steps to converge,results only exists for simulated tasks. RL has been used to solve the following problems:

• Cart-Pole Swing Up: The task is to swing up a pole hinged on a cart by applying a bounded force tothe cart. The cart must not leave a specified area. This task has one control variable (the force appliedto the cart) and four state variables. The task was solved successfully by Doya [17] with an RBFnetwork and also by Coulom [15] with a feed-forward neural network (FF-NN). Other approachesusing Actor-Critic learning have been investigated by Morimoto [31] and Si [43].

Figure 1.1: The Cart-pole Problem, taken from Coulom [15]

• Acrobot: The acrobot has two joints, one attached at the end of the other. There is one motor betweenthe of the two joints which can apply a limited force. The task is to swing both links in the upwardsposition. Here again we have one control variable and four state variables, but this task is morecomplex than the cart pole swing up task. The task was solved by Coulom [15] with a FF-NN, orby Yonemura [56] with a hierarchic switching approach between several controllers from the optimalcontrol theory.

• Double Pendulum: Here we have the same constellation as in the acrobot task, but there is an ad-ditional motor at the base of the pendulum. Thereby we have two control variables and four statevariables. Randlov [39] solved this problem with the help of an LQR (linear quadratic regulator)controller around the area of the target state.


Figure 1.2: The Acrobot, in the double pendulum problem, an additional torque can also be applied to thefixed joint. The figure is taken from Yoshimoto [57]

• Bicycle Problem: Randlov has written a bicycle simulator where the agent must balance a bike.Different tasks have been learned with this simulator, such as simply balancing the bicycle (the ridingdirection of motion does not matter) or riding in a specified direction. The agent can use two controlvariables: the torque applied to the handle bars and the displacement of the center of mass from thebicycle’s plane. Depending on the task, the problem has four state variables (in balancing the bike) orseven state variables (riding to a specific place). The simulator has been used by several researchers.Randlov [39] solved both tasks with Q-Learning; she used tilings for her state representation. Ng andJordan [33] solved this task much more efficiently with the PEGASUS policy search algorithm.

• Truck-Backer-Upper Problem: In this task the agent must navigate a trailer truck backwards to adocking point (see figure1.3). The truck has to avoid an inner blocking between the cab and thetrailer, a too large steering angle and hitting the wall. The goal is to navigate the trailer to a specifiedposition perpendicular to the wall. The dynamics of the TBU task are highly nonlinear, the standardtask configuration used by Vollbrecht [53] has six different continuous state variables, thex andyposition of the trailer, the orientation of the trailerθtrailer , the inner orientation of the cabθcab, itsderivativeθcab and the current steering angle. To control the truck we can change the steering angle,so that the truck moves with a constant velocity. Vollbrecht successfully learned the TBU task whichemployed a hierarchic Q-Learning approach using an adaptive kd-tree for the state discretization.

• Inverted Helicopter Flight : Ng [32] managed to learn the inverted helicopter flight on the simulatorwith very good results. Inverted helicopter flight at a constant position is a highly non linear andinstable process, making it a difficult task for human experts. Ng created a model of the invertedhelicopter flight by standard system identification methods. Learning was done with parameterizedregulators using the PEGASUS policy search algorithm in the simulator. The learned policy couldalso be transferred successfully to the real model helicopter.

• Swimmer Problem: The simulated swimmer consists of three or more connected links. The swimmermust move in a two dimensional pool. The goal is to swim as quickly as possible in a given directionby using the friction of the water. The state of a n-segment swimmer is defined by the n-angles andthe angular velocities of the segments, as well as the x and y velocity of the center of mass. This gives

1.3. Successes in Reinforcement Learning 17

Figure 1.3: The Truck Backer Upper Problem. The figure is taken from Vollbrecht [53]

us a state space of 2· n+ 2 dimensions. We can control every joint separately to arrive atn− 1 controlvariables.

Coulom [15] managed to learn the swimmer task for a 3, 4 and 5-segment swimmer (which givesus a maximum of 12 state variables, which is quite respectable). He used continuous time RL and afeed forward neural network with 30 neurons for the simpler and 60 neurons for the more complexswimmers as function approximators. Training was done for more than 2 million learning trials toget good policies. Consequently the learning time was huge. The learning performance also showedmany instabilities, but the learning system always managed to recover from these instabilities, exceptin the case of the five segment swimmer, where the learning performance collapsed after over twomillion learning trials.

• Racetrack Problem: For this problem the Robot Auto Racing Simulator (RARS) is used to learn todrive a car. The simulator uses a very simple two dimensional model of car driving, where a single carhas four state variables (the 2-dimensional positionp and the velocityv) and two control variables.Additional state information about the track can be added to the state space (e.g. if different tracksare used during learning). The aim of the task is to drive in the track as fast as possible, either onthe empty track or in a race with opponents. There are annual championships where several differentalgorithm can compete in a race. Current algorithms either calculate an optimal path off-line first,which results in very good lap times, but which is poor if passing an opponent is necessary. Otherapproaches try to find a good policy by observing the current state, and by using clever heuristics.These policies are usually good at passing opponents, but the lap times are not as good as the off-policy calculation. Coulom [15] tried to learn a policy which has good lap times and is also good atpassing, but he had very limited success, the learned controller could not compete with either one ofthe existing approaches. Coulom tried two different approaches using continuous time RL, one witha 30-neuron feed forward neural network and one using specific useful features, which performed


better. The best policy managed to solve the given training track in 38 seconds, which is 8 secondsslower than one of the fastest existing policies.

• Robot Stand-Up Problem In this task, a three linked planar robot has to stand up from the lyingposition. The robot has up to 10 state variables (θ1, θ1, θ2, θ2, θ3, θ3, x, x, y, y), but for the stand uptask only the first six state variables are used. The robot can be controlled by applying torques to thetwo joints of the robot. Morimoto and Doya [30] successfully used hierarchic RL to learn this task.Q-Learning was used for the upper hierarchy level and an Actor-Critic approach was used for thelower hierarchy. An adaptive normalized RBF-network (GSBFN) was used as function approximator.

Figure 1.4: The robot stand up task, the figure is taken from Morimoto [30]

1.3.3 RL for Robotics

RL is rather difficult to apply to robotics because it needs many learning trials, and in many cases we canonly measure a part of the state of the robot. Nevertheless, RL has been applied successfully to robots byseveral researchers, usually policy gradient approaches are used in this case due to the high dimensionalstate space of these problems.

• Quadruped Gait Control and Ball Acquisition : Stone, Kohl [24] and Fidelman [18] used a policygradient algorithm to learn fast locomotion with the dog-like robot Aibo. The same approach wasused to learn Ball Acquisition, where the task was to capture the ball under the chin of the Robotwithout kicking it away. They used a parameterized open-loop controller with 12 or 4 parameters, thegradient of the policy was estimated using a numerical differentiation approach directly on the AIBOrobot. The learnt locomotion policy was faster than all the other existing hand coded and learnedpolicies. The ball acquisition task could be learned succesfully, as well this task had to be manuallyoptimized for different gait controls or different walling surfaces.

• Robot Navigation Tasks: Smart [45], [47] uses Q-Learning in combination with locally weightedlearning [4] for navigating a mobile robot, including obstacle avoidance tasks.

1.4. Basic Definitions for RL 19

• Humanoid Robots: Peters [37] uses the natural Actor-Critic algorithm [23], for point to point move-ments with a humanoid robot arm, given the desired trajectory. A pre-defined regulator is used forthe policy, and the parameters of the regulator are optimized. The successful approach in this paperindicates that RL can also be useful for very high dimensional tasks like humanoid robot learning.

As we can see many interesting problems have been solved using RL in robotics and optimal control, butalso for many other interesting learning tasks. If used correctly, RL can solve very hard learning problems.

1.4 Basic Definitions for RL

1.4.1 Markov Decision Process (MDP)

In the formal definition (taken from Suttons book [49]) an MDP consists of:

• The State-SpaceS: The space of all possible states. Can be discrete (S ∈ N), continuous (S ∈ Rn) ora mixture of both.

• Action SpaceA: Space of all actions the agent can choose from, again, it can be discrete (a set ofactions), continuous or a mixture of both.

• A numerical reward functionr : S × A→ R

• The state transition function:f : S × A→ S.

• An initial state distributiond : S→ [0,1] over the states space.

The agent is the acting object in the environment, at each step the agent can choose an action to execute,which affects the state of the agent (according to the state transition function). The typical task of the agentis to find a policyπ : S → A that maximizes the future discounted reward at timet: Vt =

∑∞k=0 γ

kr(t + k).γ is the discount factor and is restricted to the interval [0,1]. For γ < 1 the sum is bounded if the rewardfunction is bounded, forγ = 1 the sum can diverge even for bounded rewards as well. The state transitionfunction f , the reward functionr and the policyπ may be stochastic functions. For MDPs we have tomake an additional assumption on the state transition function and the reward function, which is calledthe Markov property. Both functions may only depend on the current state and action, not on any state,action or reward, that occurred in the past. Consequently, we can writer(t) = r(st,at, st+1) for the rewardfunction andst+1 = f (st,at) for the transition function (or when talking about probability distributionsP(st+1|st,at, st−1,at−1, st−2,at−2, ...) = P(st+1|st,at)). Most of the algorithms require the Markov property fortheir proved convergence to the optimal policy, but may still work if the Markov property is not violated toodrastically.

1.4.2 Partially Observable Markov Decision Processes (POMDP)

POMDPs lack the Markov property, the next state depends not only on the current state and action, it candepend on the whole history. There are two points of view for POMDPs. If the current state of the POMDPcan be definitely determined by the history of the state, we can convert a POMDP in to an MDP by addingthe whole history to the current state. Then the decision process would have the Markov property again. Butthis approach vastly increases the state space size, so it is not applicable.The second point of view is that we can see a POMDP as an MDP with belief states. This is applicablewhen parts of the state of the POMDP are not visible to the agent. Here, the agent maintains a probability


distribution of what the agent believes about the current not observable state, these distributions can then beupdated according to the Bayes rules. The distribution itself can now be seen as the state of the decisionprocess, as a consequence the process is a MDP again.After having fixed the basic definitions of RL, we can take a closer look at the RL Toolbox in the nextchapter.

Chapter 2

The Reinforcement Learning Toolbox

2.1 Introduction

The Reinforcement Learning Toolbox (RLT) is a general C++ Library for all kinds of reinforcement learningproblems (not just continuous ones). The Toolbox was designed to be of general use, to be extentable and toprovide a satisfactory computational speed performance. The library can be used with Windows and Linux.The RLT is a general tool for researchers, and also for students, who want use reinforcement learning, and itspares them a lot of additional programming work and allows the researcher to concentrate on the learningproblem instead. Since it requires more effort to write a Toolbox of general use, instead of coding theprograms just for a specific case, the Toolbox is a main part of this thesis.The Toolbox contains a large selection of the most common RL algorithms. There are TD(λ) for Q and V-Learning, all with two different residuals for discrete and continuous time learning, Residual (Gradient) al-gorithm, Advantage Learning, several Actor-Critic methods and policy gradient algorithms (like GPOMDP[11]) and a version of PEGASUS [32]. Most of these algorithm can be used with different kind of functionapproximators, we implemented constant and also adaptive normalized RBF networks (GSBFNs, see [30]),feed forward neural networks (FF-NNs) and gaussian sigmoidal neural networks [42].One main goal of the Toolbox is to enable the end-user to use RL without having to do any programming.The Toolbox (the current and the old version) has been used by 20 to 30 researchers from all over the world,and hopefully further users will test the Toolbox in the future.

2.1.1 Design Issues

For the design of the Toolbox, we attached importance to the following points:

• Adaptable Learning System: The learning system should be very adaptable, so that new algorithmscan be added easily. A general interface for the learning algorithms is needed. In order to be able totry many different algorithms for a learning problem, it should be possible to exchange the learningalgorithm easily. The possibility of learning with more than one algorithm at a time (can be used foroff-policy learning) also has to be considered. Since reinforcement learning problems should have theMarkov property, we decided to provide each algorithm with just the tuple< st,at, st+1 > at each timestep.

• Learning from other Controllers : One good way to induce prior knowledge is to show the learningalgorithm how to solve the learning problem (usually not in an optimal way, otherwise you wouldnot need a learning tool). You can do this with a self written controller. It is often easy to write a

21

22 Chapter 2. The Reinforcement Learning Toolbox

simple policy which solves the learning problem, but for the learning algorithm these simple policiesare difficult do find if the state space is very large. There has to be the possibility of using a controllerindependent from the learning algorithm, so a general interface for controllers is needed.

• The algorithms should be independent from the used state representation: Very few algorithmsdepend on a single kind of state representation (e.g. discrete states or linear features) or on a sin-gle kind of function approximator. So the algorithms should work with any kind of representation wewant to use for the Q-Functions or learned policies. In general, there are three different representationswe can learn with the different algorithms. We can learn V-Functions, Q-Functions or the policy di-rectly, depending on the algorithm used. These three different kinds of representations need a generalinterface for getting, setting and updating the value of the function for a specific state. Consequently,the algorithms will work, no matter what function approximator (Tables, Linear approximators, FeedForward NN’s) is used for the learned representation.

• Easy methods for constructing your state space individually: An RL system should provide toolsfor constructing and adapting the state space very easily, because this is one of the most crucial aspectsof RL. We have to provide tools for partitioning continuous state variables, combining discrete statevariables and substituting a discrete state space (which is more accurate) for a specific discrete statenumber of another discrete state space (see chapter3).

• Tools for logging, analyzing policies and error recognition: In order to provide the opportunityto analyze the learning process we have to construct tools for logging the episodes, analyzing V-Functions, Q-Functions and Controllers. There are also a few areas like robotics where episodes arevery expensive to obtain (i.e. time consuming). There also has to be the possibility of learning alsofrom stored episodes instead of learning online. Learning from stored episodes will not work as wellas online learning (since it is a off policy learning), but the stored episodes can be used as kind of priorknowledge for the agent. The stored episode data is also used by a few planning algorithms.

We also added tools for analyzing and visualizing policies, V and Q-Functions and a sophisticateddebugging system.

• Representation of the actions: Since there is such a wide range of application for RL a singledata structure that actually matches all possible actions does not exist. There are actions havingonly a specific index, having continuous action values or having different durations. For hierarchicallearning actions can consist of other actions. A class structure must be developed to match all theserequirements.

• Hierarchical Reinforcement Learning: In Hierarchical Reinforcement Learning we can constructdifferent layers of hierarchy for the learning problem. In each hierarchy level we then have again aMarkov decision process with the same requirements as we had for the original learning system. So,simply, we have an agent in each hierarchy level which must decide what to do. The design of theagent has to consider that it can also act in a hierarchic MDP instead of directly in the environment.

• Speed: For most reinforcement learning problems we need a huge number of trials to learn from. Veryoften the parameters were not chosen correctly, so the learning process must be repeated often to findgood parameter regimes. Thus speed is a crucial design issue of the Toolbox in order to be useable.In many situations, a more complex implementation has been chosen to get a better performance.There is always a trade-off between good performance and generality of a software; consequently,the Toolbox will never be as quick as those specialized solutions which are optimized just for one

2.1. Introduction 23

algorithm and for a specific learning problem. But with a good implementation of the classes, a quiteimpressive performance can be reached.

• Easy to Use: After all, the Toolbox should be user friendly, and not just for RL experts. Thus theclass system has to be intuitive.

2.1.2 Programming Issues

The Toolbox conforms to the following programming standards (if not explicitly mentioned otherwise):

• All class names begin with a ‘C’ prefix.

• All method names begin with lower case, all words of the method name should begin with upper case.

• The use of references instead of pointers is intentionally avoided

• For standard input and output operations, the ansi-c functions printf and scanf has always been used.The use of cout, cin and other c++ streams has been avoided deliberately.

• All objects that are created by the Toolbox are deleted if they are not used any more. If the object isinstantiated by the user, the Toolbox will never delete such an object.

• Nearly all objects or data fields that are needed more than once and must be dynamically allocated arecreated just once globally for the class and will not be deleted until after they are used for the finaltime. This is done very rigorously especially if the data field is needed each step of the learning trial.

• If a class uses a data array given from outside the class, it always creates a new one and never storesjust the pointer of a given data array. This is done for usability, so we can also pass statically allocatedarrays to objects, which lose their scope later on.

Please follow these standards when extending the Toolbox.

2.1.3 Libraries and Utility classes used

The Torch library

The Torch library (www.torch.ch) is a c++ class framework for the most common supervised learningalgorithms. We use the Torch library because of the good neural network support. With the Torch librarywe can create arbitrary feed forward NN’s with an arbitrary number of different layers. The layers can beinterconnected as needed (but usually a straight forward NN is used). For these neural networks the gradientwith respect to the weights can be calculated, given a specific input and output. For the gradient calculationthe Torch library uses the back propagation algorithm, the gradient is needed by many different algorithmsto update the policy or the V-Function. We integrated the whole Torch library gradient calculation in theToolbox, so that every ‘Torch Gradient Machine’ can be used. For further details consult the online referenceof the Torch library (www.torch.ch).

The Math Utilities

For convenience we also developed a small mathematical utility library for vector/matrix calculations. Thereare two main classes:


• CMyVector: The vector class represents an n-dimensional row or column vector containing real num-bers. The main mathematical operations such as the scalar product, dot product, vector addition andthe matrix multiplication have been implemented.

• CMyMatrix: The matrix class represents ann×mmatrix for real numbers. Again basic mathematicaloperations have been implemented (matrix multiplication, multiplication with a vector), but there areno complex operations like the calculation of the inverse.

We intentionally did not use any existing library, because this approach best suited in our class system, andmoreover, only basic mathematical operations are needed. Many objects that can be represented as a vector(e.g. a state or a continuous-action values vector) are directly derived from the vector class, so mathematicalcalculations with these objects are very easy.

Debugging Tools

Since the Toolbox and reinforcement learning processes in general are very complex systems, we need agood debugging tool to recognize errors more easily. If the learning process does not show the expectedresults, it is usually very hard to distinguish bugs from incorrect parameter settings or the wrong use of analgorithm. So the learning process must be exactly trackable. But tracking the learning process is very timeconsuming and since we usually do not need the whole learning process, rather only parts of it, we decidedon the following system:

• The Debug output is written to specified files via the call ofDebugPrint.

• Debug outputs are always assigned to a specific symbol (e.g. q stands for Q-Functions). We canassign an individual debug file for each symbol or a general debug file for all symbols.

• The Debug output of a certain symbol will only be written to a file if a file has been specified withDebugInit(symbol, filename). Otherwise theDebugPrintcall is ignored.

• The Symbol+ when usingDebugInitis an abbreviation for all debug symbols, so it enables all debugoutputs.

• Nearly all objects needed for learning have their individual debugging messages. But usually, theoutput is not totally intuitive; thus looking in the source code where a debug message has occured isoften necessary.

2.2 Structure of the Learning system

To learn an optimal or good policy for a MDP we need the tuple< st,at, rt, st+1 >, no matter what algorithmwe use. Thus we need a good and robust system which provides these values for our learning algorithms. Thelearning system is structured into three main parts. The agent (classCAgent) interacts with its environment.It has an internal state and can execute actions in its environment which affects this internal state. Thesecond part of the learning system is the listeners (classCSemiMDPListener). The agent maintains a list oflistener objects. At each step the agent informs the listeners about the current state, the executed action andthe next state (so the listeners obtain the tuple< st,at, st+1 >). The agent also informs the listeners when anew episode is started.What the listener class does with this information is not determined at this point. The listeners can be usedfor different learning algorithms, but also for logging or parameter adaptation. Through this principle, many

2.2. Structure of the Learning system 25

Figure 2.1: The structure of the learning system, with the agent, the environment, the agent listeners asinterface for the learning algorithms and the agent contollers

listeners can trace the training trials, so we can do logging and learning simultaneously. It is also possible touse more than one learning algorithm at a time, but we typically have to do off-policy learning when usingseveral learning algorithms at once, so it is only partially recommendable.The final main part is the controller of the agent. A controller tells the agent what action to execute in thecurrent state. This controller commonly calculates the action with a Q-Function, but some algorithms useanother representation of the policy. It would also be convenient to be able to use self-coded controllersto improve the learning performance. For that reason we must design a general interface for controllerswhich is decoupled from any learning algorithm, so that we can use arbitrary controllers with any learningalgorithm as listener.

2.2.1 The Listeners

Listeners are all inherited from the classCSemiMDPListener. This class defines the interface for the agentto send the step information< st,at, st+1 > and the beginning of a new episode to the listeners. The interfacebasically consists of two functions which must be overridden by the subclasses.

• nextStep(CStateCollection*, CAction*, CStateCollection*) : called by the agent to send the< st,at, st+1 >

tuple to the listener.

• newEpisode(): when this function is called, the agent indicates that a new episode has begun.

So, as already discussed, each listener gets the< st,at, st+1 > tuple , but for learning we also need to knowthe reward output of this step.

Reward Functions and Reward Listeners

A reward function has to return the reward for each step, so it implements the functionrt = r(st,at, st+1).Since we want to be as flexible as possible, we do not just want to use one reward function for one learningproblem. We decouple the reward function from the rest of the environment model. Reward functions


are implemented by the interfaceCRewardFunction, where the functiongetRewardmust be implemented.There is also a class representing the reward functionrt = r(st), which depends only on the current state.This class is calledCStateReward.The information coming from a reward function is provided by the classCSemiMDPRewardListenertothe listeners. This kind of listener gets a reward function object as an argument in the constructor andtransfers the reward as additional information to thenextStepmethod. Thus we can define different rewardfunctions for different listeners. ThenextStepmethod has the following signature for reward listeners:nextStep(CStateCollection*, CAction*, double reward,CStateCollection*)Nearly all learning algorithms implement theCSemiMDPRewardListenerinterface.

2.2.2 The Agent

As already mentioned, the agent is the acting object. The agent has a current internal state and can executeactions to change the current state. At each step it stores the current state, executes the action and storesthe next state. This information is then sent to all listeners. Since this approach should work for differentenvironments, we must decouple the internal state representation and the state transitions from the agentclass. Therefore we introduce environment models. An environment model stores the current state andimplements the state transitions. The agent itself is independent from the learning problem and implementedin the classCAgent. The agent has a set of actions from which it can choose. Usually he follows a policycoming from a controller object (CAgentControllerset bysetController(CAgentController*) ).The agent provides functions for executing a single step or a given number of episodes with a specifiedmaximum number of steps. The agent class also maintains a function for starting a new episode. With thisfunction, the model is reset and the new episode event is sent to the listeners. Of course this is only possiblefor simulated tasks where we have the possibility of resetting the environment. In a robotic task where therobot has to be placed at a given starting position, this would not be possible.

2.2.3 The Environment Models

The environment maintains the current internal state and describes the agent’s internal state transitions whenexecuting an action. It also determines whether an episode has ended. In the Toolbox we distinguish statetransitions which can be described accurately (e.g all simulated tasks) from those which can not (e.g. robotictasks). We provide an individual interface for each of these two different types of environment models:

The CEnvironmentModelclass

The classCEnvironmentModelrepresents the agent’s environment model. It provides functions for fetchingthe current state into a state object, executing an action and determining whether the episode has ended. Forthis functionality the user has to implement the following methods:

• ThedoNextState(CPrimitiveAction*) function has to calculate the internal state transitions (or executean action and measures the new state). To indicate that the model has to be reset after the current step(since the episode has ended) we must set theresetflag; to indicate that the episode failed, we set thefailedflag.

• The getState(CState*state) function allows the agent to fetch the current state. The internal statevariables have to be written in thestateobject. We will discuss the state model later.


• doResetModel(): Here we have to reset the internal model variables. For example in simulated taskswe can set the internal state to an initial position; in a robot learning task we would have to wait untilthe robot has been moved to his initial position.

Figure 2.2: Interaction of the agent with the environment

The CTransitionFunctionClass

This interface represents the transition function which is used for the internal state transitions. This classshould be implemented instead of theCEnvironmentModelclass if the transition function is known, whichis generally true for all (self coded) simulated tasks. The class has to implement the functions′ = f (s,a).Additionally there are functions for retrieving an initial state for a new episode and determining whether themodel should be reset in a given state. For this functionality we provide the following interface methods tothe user:

• transitionFunction(CState*oldState,CAction*action,CState*newState, CActionData *actionData):Here the state transitionst+1 = f (st,at) must be implemented. Our calculated new state has to bewritten in the specified newState object.

• getResetState(CState*resetState): Here we can specify our initial states for the episodes. This can berandom states or a set of specified states. The initial state has to be written in the specifiedresetStateobject. A few initial state sampling methods (like random or initialization with zero) are alreadyimplemented, the kind of initial states we want to use is specified by the functionsetResetType.

• isFailedState(CState*state): returns if the episode has failed in the given state.

• isResetState(CState*state) : returns if the model should be reset after visiting this state (similar toisFailedState).

This transition function, which is specified by the user, can now be used to create an environment model,which maintains the current state of the agent and uses the transition function for the state transitions.It also resets the model according to the specified functions. This functionality is provided by the classCTransitionFunctionEnvironment.


Figure 2.3: Using an individual transition function for the Environment

2.2.4 The Action Model

As already discussed in the design issues section, the action model has to match a wide variety of applica-tions. When only using a finite set of actions, the index of the action in the action set is typically the onlyinformation the action contains.But there are actions which take more than one step to be executed, for that reason, we should also be ableto store the number of steps needed. This number does not have to be fixed, but can depend on the currentstate.For a continuous action space, the action object must store the action values which have been used. Again,these actions can last for more than one step. There should also be the possibility of intermixing continuousactions with discrete ones (e.g. robot soccer: navigating and shooting).Other actions contain primitive actions, as in hierarchic learning, but we will discuss this kind of actionslater.We decided on the following action model: discrete actions coming from an action set do not contain anyinformation, the information is just the index of the action in an action set. An action object is always createdonly once and the action object pointer serves as search criteria in an action set. This approach works wellfor actions which do not contain any changeable data, but it will not work, for example, for continuousactions. For this changeable data we provide a general interface for setting, obtaining and creating thesedata objects. We introduce action data objects for decoupling the action from its changeable information.Each type of action with changeable data has a specific kind of action data object; multi step actions storethe duration in the action data object, while continuous actions store the action values vector. The currentaction data is stored in the action’s own action data object.But this approach leads to another problem. What happens if an algorithm wants to change the actiondata (for example, set other continuous action values and determine an other Q-Value)? If the action dataobject is changed, all other listeners will get this falsified action data object, because all listeners receive thesame action object. So we would need to rely on all listeners to change at least the action data back to theoriginal state after using it. This would not be a good approach, therefore we introduce additional actiondata parameters for all methods which receive action objects. This action data parameter always has higherpriority than the action data object of the action itself. As a result the action data implied by the parameteris always used. If the action data object is omitted, the action’s native action data object is used. A listenermust not change the action data object of an action; instead he has to use its individual action data objects.The data object of the action is only changed by the agent itself and always represents the action executedin the current step.All actions provide a general interface for obtaining the action data object of the action (a null pointer is


returned if no action data object is used) and creating a new action data object of the correct type. Addition-ally all action data objects provide functions for setting and copying the action data object by another actiondata object, so it is easy to create new action data objects and use them, if needed.

Figure 2.4: Action objects, Action sets and action data objects

The following action types with different action data objects are implemented in the Toolbox.

Discrete Actions

As already mentioned, all the information of a discrete action is contained in the action index, which isalready represented by the action pointer and an action set. What else do we need to represent discreteactions? In some states a specific action may not be available, so it must be possible to restrict the action setfor certain states. As a result the action object has to provide a function which determines whether the actionis available in a given state. An action can also last for more than one step (we only allow a fixed durationhere). These functionalities are already implemented in the action base classCAction. TheCActionclassalready defines the interface for the action data objects, but its individual action data object is always empty,because we do not have any data to store.

Multi-step Actions

For actions which do not have a fixed duration, we need another implementation, because this variableduration must be stored in an action data object. The action data object contains the following information:

• The number of steps the action has already executed.

• Whether the action has finished in the current step (usually used for hierarchical learning).

Multi-Step actions are represented by the classCMultiStepAction. Whether an action has been finished hasto be decided in the current step (so it depends on< st, st+1 >). This approach gives us two possibilities forusing multi-step actions.

• The duration can be set by the environment model. This is useful in robotics, for example where wedo not know the exact duration of a specific action before execution. After execution, the duration canbe measured and then stored in the multi-step action data object. In this case, the finished flag willalways be true to indicate that in the next step another action can be chosen.


• The duration and the finished flag can be set by an hierarchic controller, which increments the durationof each step and decides whether it should continue executing the action. We will discuss this approachlater in the hierarchical reinforcement learning section.

Continuous Actions

Continuous actions store an action value vector with one action value per control variable. In order tofacilitate the calculation with continuous action data objects, the data objects are derived directly from theCMyVectorclass.

2.2.5 The Agent Controllers

For controllers, we need an interface which is decoupled from the used learning algorithms, so that we canuse any controller we want, no matter what listeners we use. This is accomplished with relative ease byintroducing an individual controller object for the agent, which can be set by the user. The controller canchoose from a given action set and has to return an action for a given state. This approach works wellwhen using discrete actions which contain no changeable information, in this case we can return the actionpointer and we are finished. But this approach does not work for our action data model. The controller isnot allowed to change the action data object of the action itself (only the agent is allowed to), so how canwe return changeable action data from a controller?At this point we introduce action data sets. An action data set is a companion piece of an action set, foreach action in an action set we store a new corresponding action data object in that set (provided that typeof action has an action data object). Note that this can be action data of any kind, so we are able to mix thedifferent action types. When the agent (or any other object) wants to retrieve an action from a controller, healways passes an additional action data set to the controller. The controller now chooses a specific action,modifies the action data object assigned to the specified action and returns the pointer of the chosen action.In order to access the action data object, the agent must get it from his individual action data set. Afterretrieving the action from the agent, the agent changes the content of the action data of the current action tothe action data calculated by the controller. The native action data objects can only be changed by the agent.

Figure 2.5: The interaction of the agent with the controller


2.2.6 The State Model

Since the choice of which state representation is used for learning problems is one of the most essentialsteps, we must design a very powerful state model.In order to avoid misunderstandings due to the different formulations we use the following notations for thestate model:

• state: everything that is a state object.

• state variable: a single state variable from a state object (so, for example, continuous state variablenumber one, which could be the x location of the agent).

• model state: the state object obtained from the environment model, thus the agent’s internal state.

• modified state: state object that can be calculated from the model state. For instance, this can be adiscretization of the continuous model state.

For general reinforcement learning tasks, we have an arbitrary number of continuous and discrete statevariables. Our state model collects these state variables in one state object.A state is represented by the classCStateand consists of an arbitrary number of continuous and discretestate variables.The state properties objectCStatePropertiesstores the number of discrete and continuous state variablesa state object maintains. It also stores the discrete state sizes for the discrete state variables and the validranges for continuous state variables. For continuous state variables we can additionally specify whetherthe variables is periodic or not (e.g. for angles). The state properties are either created by the environmentmodel (where the user has to specify the exact properties) or for modified states by the state modifier. Allstate objects describing the same state maintain a pointer to the specified state properties object. They donot create their own copy.The model state should contain all information about the agent’s internal state: usually a few continuous anddiscrete state variables. There is no need for the model state to contain any discretization of the continuousstate variables because this type of information is stored elsewhere.

Figure 2.6: State Objects and State Properties

The State Modifiers

Up to this point, we have defined a general representation for the agent’s internal state, but we can usually notuse this state directly for learning. In general we need to discretize the model state or calculate the activation


factors of different RBF centers. All these new state representations can also be represented by our stateclass (for example a discrete state is aCStateobject containing just one discrete state variable). So weneed to define an interface which takes the model state and calculates a modified state representation fromthis model state. This is done by the classCStateModifier. A state modifier gets the model state (or evenother modified states) and returns a modified state, which can now be, for example, a state containing onlyone discrete state variable for the discretization. Thus every component accessible to the model state cancalculate a discretization of that state if needed. To mantain of flexibility we do not restrict the componentsto use just one specific state representation.

All modified state representations have their individual state properties, therefore we derive the class fromthe state properties class so that the state modifiers can be used to create new state objects. The state mod-ifiers have to implement the interface functiongetModifiedState(CStateCollection*originalStates,CState*modifiedState), where the modified state is calculated and stored in the correspondingmodifiedStateob-ject.

The State Collections

It is likely that one and the same modified state is used in more than one component (for example thedifferent V-Functions of a Q-Function usually use the same state representation). In order to avoid redundantcalculations of modified states we introduce state collections (classCStateCollection). State collectionsmaintain a collection of different state objects, the model state and each modified state needed by the wholelearning system. Now we transfer state collections instead of state objects to our learning components.Whenever a component needs to access a specific state, it retrieves the specified state from the collection.The modified state is only calculated the first time it is needed for the current model state and subsequentlystored in the state collection. The state modifier gets a state collection as input, so it can also use othermodified states for its calculation. The modified state is also marked as valid, so the state collection knowsthat the modified state does not have to be calculated repeatedly.

Figure 2.7: Calculating modified states and storing them in state collections

Our state collections use the state properties pointer as the index for a state object. Thus in order to retrievea specific state, we only need to transfer the state properties object of the desired state. If no state propertiesobject is specified the collection will always return the model state.


2.2.7 Logging the Training Trials

For the Toolbox we also need tools for logging the training trials in order to trace the learned policy or toreuse the stored trails for learning. We must store the states and the actions, as well as also the reward values.For the states it would be convenient if we could log more than one state of the state collection. For exampleif we wanted to store the features from a RBF-network too, because calculating these features can be quitetime consuming.

In a few areas, it is very difficult to gather the learning data (e.g. in robotics), so it is useful to have a toolthat is able to log the entire training trials and then use these trials to learn again with other parametersfor the algorithm or even with another learning algorithm. Of course, the stored episodes can only be usedfor off-policy learning, that is, a different policy is learned, other than the one that is followed. Off-policylearning often leads to a worse performance, but can be used as a kind of prior knowledge before startingreal learning. Due to our listener design, creating logging tools is very easy, because the loggers can beimplemented as agent listeners. The listening of the most important classes for logging follows below.

State Lists

For storing a whole episode in memory we need a list of states. Creating a new state object for each step andplacing it in a list would be possible, but it is rather slow, because a state object muste be allocated each timedynamically. Thus we decided to design an individual state list class (CStateList) which maintains a vectorfor each state variable (double vectors for continuous state variables, integer vectors for discrete variables).Since we useSTL (Standard Template Library) vectors, the vectors are dynamically enlarged as needed.The class provides functions for placing a state at the end of the list or retrieving the state of a given index(an already existing state object is passed as a buffer). The class already supports saving/loading a state listto/from disk. The output format is patterned on the structure of the state list class, a vector for each statevariable is stored separately in a new line. So if we look at the output file, we will see the state transitionsfor each state variable separately.

State Collection Lists

The classCStateCollectionListstores a list of state collections. Therefore, this class contains a set of statelists; we can choose which states we want to store from the state collection.

It is therefore possible to store not only the model states of a learning trial, but also other states such as thecalculated RBF-features (which is rather time consuming for big RBF-networks). Storing RBF-features, onthe other hand requires quite a lot memory, but that should not be a problem nowadays.

Action Lists

As already discussed, an action consists mainly of its action pointer and the action data object. Storing theaction pointer does not make sense, so we store only the index of the action in a given action set. We alsohave to store the action data object from the action. For the action data object, we store copies of the currentaction data object, so these have to be dynamically allocated.

In the output format we can see the sequence of action indices. Each index is followed by the action data ofthe action (if there is one).


The Episodes

The episode objects(CEpisode) can store one episode in memory. They are already designed as listeners.Since only one episode can be stored, the episode object dismisses all stored data once a new episode begins.The class maintains a state collection list and an action list. So we can specify which states we want to store.In an episode there are obviouslynumS teps+ 1 states andnumS tepsactions to store. The episode objectscan already be used to store the current episode to a file. At first the state collection list is stored and thenthe action list is written to a file.

Logging the entire learning process

The agent logger (CAgentLogger) is able to store more than one episode. It also implements theCSemiMD-PListenerinterface in order to get the data from the agent. The agent logger maintains a list of episodeobjects, so you can retrieve whole episodes from the logger. From the episodes you can again retrieve thesingle states. The number of episodes the logger should hold in memory can be set. When storing the wholelearning trial to a file, the output function of the episode objects is used. This is also true when loading froma file.

The Episode Output class

The output format of the agent logger class is easy for the computer to understand when reloading it again,but since it is not readable for humans, we furthermore create an individual class for better readability.For us it is more practical to read the logged learning trials as a sequence of the< st,at, rt, st+1 > tuples.The classCEpisodeOutputprovides this functionality. It does not cope with holding anything in memory,but writes the step tuple directly to the file. For this class we can only define one single state from the statecollection.There is also a second class (CEpisodeOutputStateChanged) which does the same, but only when the spec-ified state changes. This is useful if we want to trace discrete states, which do not change very often.

Using stored episodes for learning

We already mentioned that we would like a tool which enables us to learn from a stored learning trial. Inorder to provide this functionality we create the interfaceCEpisodeHistory. This interface represents a setof stored episodes.For presenting the stored episodes to the listeners, we need an environment model which goes throughthe stored states of an agent logger and a controller class which goes through the stored actions. Boththe controller class and the environment class are implemented by the classCStoredEpisodeModel. If theepisodes of the agent logger contain more than the model state, the class copies the additional states in thestate collection of the agent logger and marks them as valid. Therefore, stored modified states can be reusedand we do not need to calculate them again.Another method of using previous episodes is by performing Batch updates as mentioned in [49]. Whenperforming Batch updates, after each episode, we show one or more previous episodes to the learningalgorithm again. This can improve learning for certain kinds of algorithm (e.g. Q-Learning), but it alsofalsifies the state transition distributions, so we have to be careful with model based algorithms. Batchupdates are represented by the classCBatchEpisodeUpdate, which presents a specified number of storedepisodes to a specified listener after each new episode. Therefore the Batch Update class also defines itsown public functions for presenting a specific episode, N-random episodes or all episodes to the listener.


Only the episodes the agent logger is currently holding in memory can be used, which gives us the ability touse only the newest K episodes for the batch updates.

Using the stored steps for learning

It is also possible to use the stored single step information for learning from past episodes. Here we send ran-domly chosen (thereby temporally unrelated) steps to the listeners. Since the steps are temporally unrelatedwe can see each step as an individual episode.Again we introduce an individual interface, theCStepHistoryinterface, which represents a (time indepen-dent) set of steps. This interface also implements functions for presentingN randomly chosen or all steps toa given listener.Similar to batch updates, we may use past step information during the learning process. We can perform theprevious step updates as long as there is time (an the next step begins). For batch step updates we create theclassCBatchStepUpdate. We can specify the number of steps from the history simulated to a listener aftereach real step and after each episode. The steps are randomly chosen.It is not clear whether the episode or the step update is better, this depends on the problem. Both approachescan only be used for Q-Value based algorithms. The advantage of the episode update is that algorithms withe-traces can be used, the advantage of the step update is that the steps can be randomly. Through intermixingthe step information, the algorithm might discover a better action selection strategy.The idea of doing the step updates is also strongly connected with combining planning and learning ap-proaches like the Dyna-Q algorithm [49]. We will discuss this algorithm in the next chapter.

2.2.8 Parameter representation

In our design we also need a general interface for the algorithms’ parameters. We decided on the followingconcept. Each parameter is represented as a< string,double> pair where the string represents the nameand the double value naturally determines the parameter’s value. The parameters of an object are stored ina string-map. We provide functions for adding a new parameter, and getting and setting a parameter’s valuewith the classCParameters. All objects that maintain some sort of parameters implement this interface, thusa parameter’s value is always retrieved or changed in the same way.

Figure 2.8: The adaptable parameter representation of the Toolbox

Parameters of different object hierarchies

An object A is on a lower hierarchical level than another object B if A is referenced by B. We will call A achild object from B (note that this has a different meaning than child classes, where the class information is


inherited; in this case child objects are only referenced). Often objects with a low hierarchical level containadditional parameters. In general, it is complicated to retrieve these objects, for that reason we decided on amore complex parameter representation.Each object referencing other objects with parameters take on the parameters of its children. Thus allparameters from objects on a lower hierarchical level are added to the parameter map of thes parent object.All objects also maintain a list of their child objects and every time a parameter is changed at the parentobject, the corresponding parameter of the child object(s) is adapted too. Thus we can change a parameterdirectly at the learning algorithm, which is actually a parameter of child object, for instance the e-traces ofa learning algorithm.All listeners, all agent controllers and all classes which represent learned data are subclasses of theCParam-eterObjectclass, so the parameter dealing with all these objects is generalized.

Adaptive parameter calculation

With our design it is also easy to add the ability to adapt the parameter values dynamically. In the area ofRL, this approach can be used for many different parameters, such as the learning rate or an exploration rate.In the normal case, the parameter’s value depends on one of the following quantities:

• the number of steps

• the number of episodes

• the average reward of the last N steps

• the estimated future discounted reward (coming from a value function)

For each of these classes we provide a corresponding adaptive parameter calculator class (CAdaptivePa-rameterFromNStepsCalculator, CAdaptiveParameterFromNEpisodesCalculator, CAdaptiveParameterFro-mAverageRewardCalculator, CAdaptiveParameterFromValueCalculator). All these are subclasses of theCAdaptiveParameterCalculatorand therefore can be assigned to a parameter of a parameter object. If anadaptive parameter calculator has been assigned to a certain parameter, the value coming from the calcu-lator is used instead of the constant value from the map. These adaptive parameter classes also providea huge degree of freedom to calculate your parameter’s value. We can set different offsets and scales forboth the target value (number of steps, episodes, average reward...) and the parameter value. For the target-value/parameter-value mapping we can choose from different functions like a linear, square or a logarithmicfunction.

2.2.9 A general interface for testing the learning performance

For this thesis, we need to test the performance of specific algorithms with a specific parameter setting. Wewill call an algorithm with a specific parameter setting a ‘test suite’. Test suites are particularly importantfor the benchmark test of this thesis. With an interface for evaluating a test suite, we could also write toolsfor finding good parameter settings of the algorithm automatically. In order to design such tools we needthe following preliminaries.

An interface for learned data

Every RL algorithm needs to store the learned data in a specific kind of representation (e.g. a Q-Functionor directly the policy). Due to the wide range of algorithms, these learned data can be stored in many


different representations. Nevertheless there are functionalities for such learned data objects which areneeded for all these representations, such as resetting, storing or loading the learned data. Therefore wecreate the interfaceCLearnedDataObject, which provides abstract functions for these functionalities. Allclasses which maintain some kind of learned data implement this interface.

Policy Evaluation

How do we decide if a policy is a good or bad one? We can estimate the future discounted reward for acertain number of states or the average reward during a certain number of episodes from real or simulatedexperience. These methods are also called Monte Carlo Methods. We provide the classesCValueCalculatorandCAverageRewardCalculator. Both classes are subclasses ofCPolicyEvaluatorand thus can be used inthe same way. We can set the number of episodes used for evaluation for both classes. The initial states ofthe episodes are sampled as usual (determined by the environment model). If the initial states are sampledrandomly, we need a large number of episodes in particular for large initial state spaces in order to get areliable result. We also make it possible to use the same set of initial states every time for each evaluationwith the classesCSameStateValueCalculatorandCSameStateAverageRewardCalculator. Always using thesame state does not make the result more reliable, but because there is not so much variance between theresults of different policy evaluation trials, so this method is better suited for tracing the learning process.

Test Suites

As already mentioned, we refer to test suites as a specific algorithm-parameter setting. We require that thetest suites can be evaluated with a scalar value, i.e. we want a scalar value indicating how good the learningperformance of this test suite is. We also want to be able to change some of the algorithm’s parameters andthen evaluate the test suite once more.In our approach a test suite consists of one or more listeners (representing the learning algorithm) as well asone or more learned data objects (representing the Q-Functions, V-Functions or policies). The test suite class(CListenerTestSuite) maintains a list of both object categories, the listeners and the learned data object. Theclass also has access to the agent, the controller used during learning and the controller used for evaluation.The user has the opportunity to employ different controllers for evaluation and learning. For example, thelatter can use exploration steps, which are not desirable for the evaluation process.The test suite class already provides the functionality needed for learning a given number of steps andepisodes (therefore the agent is needed).

Evaluating test suites

There are many ways to evaluate a test suite. For example, we can measure the average reward or some otherquantity during the learning process. This gives us a good estimation of the algorithm’s performance. But itis also possible to count the number of episodes needed for the algorithm to achieved the goal of the learningtask several times. We created the test suite evaluation interfaceCTestSuiteEvaluatoras an interface for allthese kinds of test suite evaluation. We only implemented the first approach.In our test suite evaluation approach we begin by learning for a given number of steps and episodes. Thenthe learners are disabled (removed from the agent’s list) and the test suite’s evaluation policy is evaluatedwith a given policy evaluator. This value is stored and then the learning is resumed. This is not a veryfast approach, quite a bit simulation time is spent for policy evaluation, but it is more reliable because anindividual evaluation policy can be used without the falsifying effect of exploration. The average of the


values evaluated during the learning trial is used as the result. But evaluating the learning process for justone learning trial is not very reliable, so it can be repeated several times.The policy evaluation values gotten during the learning process are also stored in a database like file format.If a test suite with a specific parameter setting has already been evaluated, the stored values are reused. Theyare also used for the creation of the diagrams.

Searching for good parameter settings

The Toolbox provides tools for searching the parameter space of one specified parameter. We may simplysearch at specific, given parameter values and return the best value. Or we can specify a starting point,the number of search iterations, and the search interval, and the Toolbox will try to find a good parametersetting. This is done by the classCParameterCalculator. When searching within the specified interval, theclass begins with the starting point and then evaluates the policy at the double and at the half of the startingparameter’s value. It continues with this process until it finds a specific maximum (more than 25% betterthan the worst result), or it leaves the given interval. After this procedure it tries to locate the maximumvalue more accurately if there are any iterations left.

Chapter 3

State Representations in RL

In RL there are three common state representations which are used for learning:

• Discrete States: Discrete states identify the current state of the agent with just one discrete statenumber. This state number is then used for look up tables.

• Feature States: These states are used for linear function approximators. A feature state consists ofnfeatures, each having an activation factor typically between [0,1].

• Other function approximators (like feed-forward neural networks): States for other function ap-proximators usually have no requirements; they can actually consist of any number of discrete andcontinuous state variables.

All of the discussed state representations can be used for most learning algorithms, which use these statesas input for their Q (or V) Functions or directly for their learned policy. We will discuss each of these statemodels, as well as the function representations that can be used for these states.

3.1 Discrete State Representations

3.1.1 Discretization of continuous Problems

For continuous learning tasks, a discrete state representation can be problematic. The continuous MDP canlose its Markov property if the state discretization is too coarse. As a consequence, there are states whichare not distinguishable by the agent, but which have quite different effects on the agent’s future. It alsofollows that the probabilitiesP(s′|s,a) change for different policies (since we lost the Markov property).Nevertheless, if this effect is not too dramatic, most of the algorithms can cope with it. There are somesuccessful examples of using a discrete state representation for continuous state problems such as SuttonsActor-Critic cart-pole balancing task [49] p. 183. These approaches obviously could not calculate theoptimal policy, but worked sufficiently well, at least for such easy cases. In general, it is advisable to usediscrete state representations only for discrete problems.

3.1.2 State Discretization in the RL Toolbox

For calculating a discrete state representation of the model state, we introduce the classCAbstractState-Discretizer. The user can already implement any state discretization he wants by deriving this class andoverwriting thegetDiscreteStateNumbermethod, which returns a discrete state number given the current

39

40 Chapter 3. State Representations in RL

model state. But because calculating this state discretization is normally tedious, so we provide tools tosimplify the process. When dealing with discrete state variables, it is necessary to consider the followingscenarios:

• We have one or more continuous state variables and want to discretize them.

• We have two or more discrete state variables within the same state object and we want to combinethem.

• We have two or more discrete state objects and want to combine them.

• We want to add a more precise state representation only for one or more discrete state numbers. Forexample the discrete state object A contains useful information only if the discrete state object B isin state X. Otherwise we can neglect the information in A. Combining A and B would give us a statesize of|A| · |B| states, but by substituting the state object A for the state X of state object B (instead ofbeing in state X, we can now be in one of the states of A) we get a state size of|A| + |B| − 1. We willcall this approach a state substitution.

We design classes which support all these scenarios, so that a discrete state representation can be easilybuilt. All the classes mentioned below are subclasses ofCAbstractStateDiscretizer, so they fit in to our statemodifier model.

3.1.3 Discretizing continuous state variables

In our approach we can only discretize a single continuous state variable at a time, so we get a discrete stateobject for each continuous state variable. These objects can then be combined later on.The classCSingleStateDiscretizerimplements this approach. We can specify an arbitrary partition array andthe continuous state variable to be discretized.

Figure 3.1: Discretizing a single continuous state variable

3.1.4 Combining discrete state variables

In a few model states we have more than one discrete state variable, e.g. the x and y coordinates in a gridworld. The classCModelStateDiscretizercombines these state variables to one discrete state object. Wecan also specify which discrete state variables we want to use. The new discrete state size is obviously theproduct of all initial discrete state sizes.

3.1. Discrete State Representations 41

3.1.5 Combining discrete state objects

If we have more than one discrete state object (e.g. if we discretized several continuous variables), wecan combine them with the classCDiscreteStateOperatorAndto one discrete state object. We can use anarbitrary number of discrete state objects for the ‘and’ operator, and the new discrete state size is again theproduct of all discrete state sizes.

Figure 3.2: Combining several discrete state objects with theandoperator

3.1.6 State substitutions

The circumstances in which state substitions are needed have already been expained. It would be advanta-geous to use such state substitutions for each discretizer. We therefore add this functionality to the abstractdiscretizer class. With the functionaddStateSubstitutionwe can substitute one discrete state representationcoming from another discretizer object for a given discrete state number.

Figure 3.3: Substituting state object B for statea5 of state objectA. Two state scenarios are sketched ingreen and yellow. In the green case state object A is in statea5, so the stateb3 from state objectB is used.In the yellow case, state objectA is in statea2, so the information fromB is neglected.


With these functionalities we have covered the most common use cases, so a discrete state representationcan defined easily be with these classes in the most common cases.

3.2 Linear Feature States

These states are used for linear function approximators. Linear function approximators are very popular,because they can generalize better than discrete states and are also easy to learn at least when using localfeatures. A feature state consists ofN features, each having an activation factor between [0,1]. Linearapproximators calculate their function value with

f (x) =N∑

i=1

φi(x) · wi (3.1)

whereφi(x) is the activation function andwi is the weight of the featurei. Note that the discrete staterepresentation is a special case of a linear function approximator, where only one feature has the activationfactor 1.0 and all others the activation factor 0.0. It is therefore possible to treat linear feature states anddiscrete states in the same way, because we can divide the feature states in to several discrete states with adifferent weighting. If a feature has only local influence, such as in RBF-networks, adapting the weights ofthe approximator changes the function value only within a neighborhood. That is one reason why learningwith these linear function approximators yields a much better performance than learning with feed forwardNN’s. Another reason is, of course, that it is a linear function, which is typically easy to learn. The drawbackof local features is, as for discrete states, the number of features grows exponentially with the number ofdimensions of the model state space. So we do not get rid of the curse of dimensionality by using locallinear features, rather just the generalization ability improves in comparison to discrete states. There aremany ways to calculate the feature factors. We will discuss three common approaches: tile coding, RBFnetworks and linear interpolation.

3.2.1 Tile coding

In tile coding, the features are grouped into exhaustive partitions over the input state space. Each partition iscalled a tiling, and each element of a partition is called a tile. There is always just one feature active in onetiling, but we can use several tilings simultaneously, so that the number of active features equals the numberof tilings used. The shape of the partitions is not specified, but we usually use grid based partitions overthe state space. We can combine partitions with different sizes, offsets or even partitions over different inputspace variables.

3.2.2 Linear interpolation

For linear interpolation, each feature has a center position. In each dimension just two features are active,namely those that are nearest to the current state. All other features have an activation factor of zero. Thefeature factors are scaled linearly with the distance to the feature centers. Thus we get 2N active features,whereN is the number of input dimensions and the factors are calculated by

φi(x) =d∏

j=1

distj + |x j − Pos(φi) j | (3.2)

wheredistj is the distance between the two adjacent features in dimensionj, andPos(φi) j is the jth dimen-sion of the position vector of theith feature.

3.2. Linear Feature States 43

3.2.3 RBF-Networks

Here we use RBF functions with fixed centers and sigmas, so we just have to learn the linear scale factorsof the RBF-Functions. The RBF Function is given by

φi(x) = exp(−12

(x − µj )Σ−1j (x − µj )

T) (3.3)

A fixed uniform grid of centers is typically used for RBF functions and linear interpolators. A more sophis-ticated distribution of the centers is often useful and also necessary, but this distribution is hard to find, as itis for a good discrete state representation as well.

3.2.4 Linear features in the RL Toolbox

The linear feature factors do not depend on the weight of the approximator, so they can be easily representedin our state model. The linear feature factors are always calculated using a state modifier and stored in a stateobject. So, if the feature state is needed more than once for a state, no redundant recalculation is needed.All feature states are created by subclasses of the interfaceCFeatureCalculator, which in turn is a subclassof theCStateModifierclass. This class receives the number of features and the maximum number of activefeatures as input, with this information the state properties are initialized correctly.Additionally we want to combine different feature states easily, therefore we provide two feature operatorclasses.

The Or Operator

The or operator provides us the possibility of using different, independent feature states simultaneously.We can use an arbitrary number of feature calculators for theor operator, all the active features from thedifferent modifiers are then simultaneously active in the same feature state. The feature state size is the sumof all sub-feature state sizes, and the number of active features is obviously the sum of all numbers of activefeatures. All feature factors are normalized after this calculation so that the sum of all factors is 1.0The feature operatoror is used to combine two or more (for the most part) independent feature statesdescribing the same continuous state space. Examples of this include tilings or RBF networks with differentoffsets/resolutions, which increase the accuracy or perhaps the generalization properties of the linear staterepresentation.

The And Operator

The ‘and’ operator allows us to use different feature states that describe different dependent continuousstate variables simultaneously. This class works primarily like the discrete ‘and’ operator class, in that itcalculates for the tuple< fi , f j , fk, ..., fn > of active features, where each feature comes from another featurecalculator, a new unique feature index. The new activation factor of the feature is the product of all featurefactorsφi(x) · φ j(x) · φk(x) · ... · φn(x). The feature state size of the operator is the product of all feature statesizes, and the number of active states is the product of all numbers of active sub-feature states.Theand feature operator is used to combine two or more dependent states: for example, if we use featurescoming from single continuous state variables.


3.2.5 Laying uniform grids over the state space

In many cases we want to lay a grid over the state space, because we do not have enough knowledge or timeto specify the distribution of the feature centers more sophisticatedly. The grid can represent tilings, RBFcenters, or linear interpolation centers (or any other functions). The base class for all these different gridsis CGridFeatureCalculator. This class contains functions to specify the grid, to calculate the position of afeature (the exact position of a feature is always the middle of a tile) and to determine the active tiling ofthe grid (the tiling which contains the current state). We can specify which dimensions we want to use, howmany partitions to use per dimension and an offset for each dimension. Additionally a scaling factor (1.0 isthe default value) for each dimension can be defined; in combination with the offset we can lay the grid justover specified intervals of the state variable. The grid based classes always use the normalized interval [0,1]for the continuous state variables. We have to consider this when we specify the offsets.

Tilings

Tilings are represented by the classCTilingFeatureCalculator. This subclass ofCGridFeatureCalculatoralways returns the active tiling number with activation factor 1.0. So there is always just one active feature.We can use theor operator to combine several tilings. So, actually, one individual tiling can be seen asdiscrete state representation.

Figure 3.4: The use of more than one tiling with theor operator

Grids with more than one active feature

These grids are represented by the classCLinearMultiFeatureCalculator, which is a subclass of our gridbase class. We can also specify the number of active features for each dimension. Theni features nearestto the current state are always considered active in dimensioni. For each feature that is in the ‘active’ area,the feature factor is calculated by the interface functiongetFeatureFactor, which receives the position ofthe feature and the current state vector as input. This function is be implemented by the subclasses. Aftercalculating the feature factors, the active features factors are normalized again.

3.2. Linear Feature States 45

RBF-Networks

For the RBF-network class we additionally have to specify theσ values for each dimension (always referredto in the interval [0,1]), so it is not possible in our approach to specify any cross-correlation between thestate variables. The following, simplified formulae is used to calculate the feature factors:

φi(x) = exp(−n∑

j=1

(xi j − µi j )2

2σ2i j

) (3.4)

All features within the range of 2· σi are considered active, but at least two features per dimension must beactive. The number of active features per dimension is crucial for the speed of the Toolbox, so we have totune the sigma values carefully. At the end there is, as usual, the normalization step; so we use a normalizedRBF network as it is used by Doya [17] and Morimoto [30]. Often, such RBF networks are also referred toas Gaussian Soft-Max Basis Function Networks (see chapter6).

Figure 3.5: Using a grid of RBF-Centers for the feature state. For the x-dimension we use eight RBFcenters, for the y-dimension four centers. The sigma values are chosen in such a way that there are twoactive features per dimension.

Linear Interpolation

For the linear interpolation approximator only two features per dimension are active . The feature factor iscalculated by the product of the distances to the current state for all dimensions.

φi(x) =d∏

j=1

distj + |x j − Pos(φi) j | (3.5)

As usual these factors are normalized in the end.

3.2.6 Calculating features from a single continuous state variable

We also provide functionalities for specifying the centers of the features more accurately. Thereby, weprovide classes which enable us to choose the centers of the features explicitly for a single input dimension.These feature states can be combined by theandoperator.


The super class for creating features from a single continuous state variable isCSingleStateFeatureCalcula-tor. For this abstract class we can specify the location of the one dimensional centers of the features and thenumber of active features. The method for calculating the feature factors is again abstract and implementedby the subclasses.

Figure 3.6: Calculation of the RBF features from single continuous state variables and combining them withthe ‘and’ operator. In this example we use seven RBF features for dimensioni and six RBF features fordimensionj. Both feature states use two active features simultaneously, resulting in a feature state with fouractive features after theandoperation.

RBF Features

Additionally we have to specify the sigma values for each RBF center. The feature factor is calculated usingthe standard RBF-equation for one dimension.

Linear Interpolation Features

There are always two features active, the feature factors are linearly scaled between the two neighboredfeatures.All the discussed feature calculators take periodic continuous states into consideration, so in fact, it is alwaysthe nearest features, which are chosen to be active.

3.3 States for Neural Networks

Feed Forward Neural networks do not have any requirements for the state representation, but some pre-processing can be useful. We use two pre-processing steps for continuous state variables.

• Periodic states get scaled in the interval [−π,+π]. In order to represent the periodicity for a neu-ral network more accurately, we replace the scaled periodic state with two new state variables, onerepresenting thesin(x) and the other one thecos(x)

3.3. States for Neural Networks 47

• Non periodic states variables are scaled to the interval [-1,1], so that they have the same scale as theperiodic state variables.

These pre-processing steps are done by the classCNeuralNetworkStateModifier. We always have to considerthat the resulting input state for the neural network contains the number of periodic state variables morecontinuous state variables. We can also use discrete state variables as input states. For discrete state variablesno generalization between the values is typically intended, so a separate input variable for each discrete statenumber is usually used. The value of the input variabledi is 1.0 if thei is the current state number and 0.0otherwise.

Chapter 4

General Reinforcement LearningAlgorithms

In this chapter we will discuss common RL approaches which are, in general, designed for a discrete stateand action space. We will discuss the algorithms in their theoretical form and also their implementation inthe Toolbox. We begin with value based approaches to learning the V-function and the Q-Function. Afterthat, we will cover discrete Actor-Critic architectures and finally we will discuss model based approaches.At the conclusion of each theoretical discussion, an additional section discusses the implementation issuesin the RL Toolbox.

4.1 Theory on Value based approaches

Value based methods estimate how desirable it is to be in a given state or to execute a certain action in agiven state. Therefore, the algorithms use so-called value functions (V-Functions) or action value functions(Q-Functions).

4.1.1 Value Functions

Value functions estimate how desirable it is to be in states. The value of states is defined to be the expectedfuture discounted reward the agent gets, if starting in states and following the policyπ. Formally, this canbe written as

Vπ(s) = E[∞∑

k=0

γkr(t + k)] (4.1)

where the successor statesst+1 are sampled from the distributionf (st, π(st)) for all t. The expectation isalways caculated over all stochastic variables (π and f ). We can write4.1 in the recursive form

Vπ(st) = E[r(t) + γVπ(st+1)] (4.2)

When referring to the valueV(π) of a policy π, we always mean the expected discounted reward whenfollowing π, beginning at a typical initial states0. The value of policyπ is given by:

V(π) = Es0∼D[Vπ(s0)] (4.3)

D is the initial state distribution of the given MDP.

48

4.1. Theory on Value based approaches 49

4.1.2 Q-Functions

A Value Functions can be used to estimate the goodness of a certain state, but we can only use it for actionselection if we know the transition function. Action value functions (referred to as Q-Functions) estimatethe future discount reward (i.e. the value), if the agent chooses the actiona in statesand then follows policyπ again. Hence Q-Functions estimate the goodness of executing actiona in states.

Qπ(s,a) = E[r(s,a, s′) + γVπ(s′)] (4.4)

Note thatEπ[Qπ(s, π(s))] = Vπ(s) so we can also write

Qπ(s,a) = E[r(s,a, s′) + γQπ(s′,a′)] (4.5)

where the actiona′ was chosen according to the policyπ.

4.1.3 Optimal Value Functions

Given definition of the value functions, we can also compare two policies. A policyπ1 is better or as goodas policyπ2 if Vπ1(s) ≥ Vπ2(s) for all s ∈ S. From the definition we see that the agent gathers at least asgreat a reward followingπ1 as if he followsπ2. So the optimal policyπ∗ satisfies the condition

Vπ∗(s) ≥ Vπ(s) (4.6)

for all states and possible policiesπ. We define the optimal value function asV∗(s) = Vπ∗(s).Optimal policies also have optimal action values, i.e.Q∗(s,a) = maxπQπ(s,a) (whereQ∗ is already definedasQπ∗). The optimal policy always chooses the action with the best action value, so it is also clear that

maxa∈AsQ∗(s,a) = V∗(s) (4.7)

for all statess, As is the set of available actions in states.Inserting the optimal policy in4.4we can also write forQ∗:

Q∗(s,a) = E[r(s,a, s′) + γ ·maxa′∈A′sQ∗(s′,a′)] (4.8)

This equation is called the Bellman optimality equation. This equation can also be stated for value functions:

V∗(s) = maxaE[r(s,a, s′) + γV∗(s′)] (4.9)

4.1.4 Implementation in the RL Toolbox

V-Functions

Given the above theory, V-Functions need to provide the following functionalities:

• Return an estimated value for a given state object.

• Update the value for a given state object.

In the discrete case V-Functions are usually represented as tables. But for more complex problems anyrepresentation of a function (polynomials, neural networks, etc.) can be used. We design a general interfacefor V-Functions, which has then to be implemented by the different implementations of the V-Functions.This interface is calledCAbstractVFunction. It contains interface functions for retrieving, updating (adda value) and setting a value for a given state. In this class we can also set which state representation theV-Function will use, the specified state object is then automatically retrieved from the state collection andpassed to the interface functions.

50 Chapter 4. General Reinforcement Learning Algorithms

Figure 4.1: Representation of the Value Function. The value function can choose from any state representa-tion from the state collection.

Value Functions for discrete States

Value functions for discrete states commonly store the value information in a tabular form. Thus we have avalue entry for each discrete state number. In the Toolbox tabular V-Functions are represented by the classCFeatureVFunction, which can be used for discrete states and linear features. We can specify a discretizerobject for the value function, which determines the used state properties object (which has to be a discretizeror feature calculator), which is used to retrieve the state from the state collection. The size of the table isalso taken from the discretizer. The feature V-Function also supports value manipulation directly with thediscrete state number, so we do not have to use the state objects. This possibility is used for example by theimplemented dynamic programming algorithms.When using feature states, the feature state is decomposed into its single features before calling the functionsfor the discrete state indices. The feature factors are used as weighting for the value calculation or the valueupdates respectively.

Q-Functions

Q-Functions return the action value of a given state-action pair. Again, we provide a general interface forall implementations of Q-Functions. The interface contains functions for getting, setting and updating a Q-Value, so it has the same functionality as for V-Functions, but with the action as additional input parameters.Each of these methods now contains additional arguments for the action, consisting of the action pointeritself and an action data object (see2.2.4). Additionally, the interface provides functions to accomplish thefollowing:

• getActionValues: calculate the values of all actions in a given action set and write them in a doublearray.

• getMaxValue: calculate the maximum action value for a given state.

• getMax: return the action with the maximum action value for a given state.

The Q-Function interface is calledCAbstractQFunction.

Q-Functions for a set of actions

If we have a discrete set of actions, the action values for each action can be seen as separate, independentfunctions. It would be optimal if we could use different representations for each function, e.g. if we could

4.2. Dynamic Programming 51

use other discrete state representations for the different actions or even use a neural network for one actionand a linear function approximator for another.If we look at each action separately, the corresponding action value function has only to store state values, soit has the same functionality as a V-Function. Consequently, it is rather obvious to use V-Functions for thesingle action value functions of the Q-Function. For each action, we can specify an individual V-Function,so that different function representations can be used for different actions. This approach of representing aQ-Function is managed by the classCQFunction. For this class the user has to set a V-Function for eachspecified action.

Figure 4.2: Q-Functions for a finite action set: For each action the Q-Function maintains an individualV-Function object.

Q-Functions for Discrete or Feature States

For discrete Q-Functions we need only to use discrete V-Functions (CFeatureVFunction) for our singleaction value functions. If we need a Q-Function which uses the same discrete or feature state representationfor each action, creating and setting the V-Functions each time can be very arduous. Thus we provide theclassCFeatureQFunctionwhich creates the feature V-Function objects by itself, all with the same staterepresentation.

4.2 Dynamic Programming

Dynamic Programming (DP) approaches represent iterative methods to estimate the value or the action valuefunctions with the use of the perfect model of the MDP. Due to the need for the perfect model and a hugecomputational expense, these algorithms are used only rarely in practice, but they are still very importanttheoretically. DP methods do not use any real experience of the agent either, (the agent never executes anaction); only the model is used. Therefore DP methods are not really learning methods, rather a planningapproach.DP methods use the perfect model to calculate the (optimal) value or action value for a given states, assum-ing the values of the successor states ofs are correct. In general this assumption is false, because we do notknow the values of the successor states, but by repeating this step for all states infinitely often, the algorithmconverges to the required solution.


4.2.1 Evaluating the V-Function of a given policy

The recursive equation of the V-Function for the stochastic policyπ and stochastic transition probabilitiesP(s′|s,a) is given by:

Vπ(s) = E[r(t) + γV(s′)] =∑

a

π(s,a)∑

s′P(s′|s,a)(r(s,a, s′) + γV(s′)) (4.10)

This equation is iteratively applied for all states, resulting in a sequence of Value FunctionsV0,V1,V2, ...,Vk,whereVk is

Vk(s) =∑

a

π(s,a)∑

s′P(s′|s,a)(r(s,a, s′) + γVk−1(s′)), for all s (4.11)

We can think of backups as being done in a sweep through the state space. These updates are also commonlyreferred to as full backups, since all possible transitions from one states to all successor statess′ are used.This sequence of value functions is proved to converge toVπ if done infinitely often. This process is oftencalled iterative policy evaluation. Note that we can use the V-Function in combination with the model of theMDP to create a Q-Function for action selection.

Qπ(s,a) = E[r(s,a, s′) + γVπ(s′)] =∑

s′P(s′|s,a)(r(s,a, s′) + γVπ(s′)) (4.12)

4.2.2 Evaluating the Q-Function

Of course we can evaluate the action values with dynamic programming straightaway. In this case we usethe following iterative equation:

Qk(s,a) =∑

s′P(s′|s,a)(r(s,a, s′) + γ ·

∑a′π(s′,a′)Qk−1(s′,a′)), for all s anda (4.13)

In fact evaluating the V-Function and evaluating the Q-Function are theoretically equivalent. In the equationfor the V-Function update, we have to calculate the Q-Values of the current state, and in the equation for theQ-Function update we have to calculate the V-Values of the successor states. In practice there is actually adifference in how we represent our data.

4.2.3 Policy Iteration

Up to this point, we can calculate the values (or action values) for a given policy. But we want to evaluatethe values of the optimal policy. There are two common ways to do this. The first is called policy iteration;in this case we evaluate the V-Function of a (fixed) policy (policy evaluation step), then we create a policygreedy on that V-Function (policy improvement). This policy is proved to be better than or at least as goodas the old policy. Then we repeat the entire process. Policy iteration is guaranteed to converge to the optimalpolicy (or value function). But it is very time consuming because we have to do an entire policy evaluationfor each improvement step.

4.2.4 Value iteration

Value iteration combines the two steps of policy evaluation and policy improvement. We directly evaluatethe values of the greedy policy (greedy on the current values, not on old values like in policy iteration). Thuswe get the following equation for the value function:


Vk(s) = maxa

∑s′

P(s′|s,a)(r(s,a, s′) + γVk+1(s′), for all s (4.14)

And for the action value function:

Qk(s,a) =∑

s′P(s′|s,a)(r(s,a, s′) + γ ·maxa′Qk−1(s′,a′), for all s anda (4.15)

These are the iterative equations for the optimal value and action value function. Value iteration is alsoguaranteed to converge to the optimal policy.All these approaches only work for a discrete state space representation, otherwise we get problems withrepresenting the transition probabilities. Adapted versions for a continuous state space are called Neuro-Dynamic Programming (see [12] or [15] for an exact description of these algorithms), but these approacheswork only with limited success due to the huge computational expense. The performance of DP methodssuffers mostly from the sweeps through the state space. Theoretically the value updates must be done forevery state, but for the major part of the state space, the values remain unchanged. We will discuss analgorithm called Prioritized Sweeping, which has a more sophisticated update schema.

4.2.5 The Dynamic Programming Implementation in the Toolbox

The Toolbox supports evaluating the value function or the action value function of any stochastic policy, sovalue iteration is also possible if we use a greedy policy. Why is it useful to provide dynamic programmingmethods for both value and action value learning if these two approaches are equivalent? There are severalapproaches to combining dynamic programming with other learning methods (actually, in the Toolbox wecan use any other value based algorithm in combination with DP we just have to use the same V or Q-Function) and depending on what other type of learning algorithm we want to use, we need either theV-Function or the Q-Function.For dynamic programming we need additional data structures to represent the following:

• Transition Probability Matrix P(s′|s,a): Typically, most of the entries of this matrix are zero, be-cause for common problems, only a few states can be reached from a specific state within one step.Some other algorithms also need the backward transition probabilityP(s|s′,a), which is actually thesame quantity, but we need a data structure which provides a quick access to the probabilities greaterthan zero in both directions. We want to access these probabilities by supplying either the successorstate or the predecessor state.

• Stochastic policies: The policies have to return a probability distribution over the action for eachstate. This will be discussed in section4.5. It is enough to know that such stochastic policies existand are used.

Representing the Transition probabilities

The requirements of our transition probability matrix have already been explained. It is understood that wecan not store the probability values in a matrix because we would waste a lot of memory and computationtime. Instead we implement an interface classCAbstractFeatureStochasticModeland an implementationclassCFeatureStochasticModel. This gives room for other implementations. In our approach we store listsof transitions. A transition object contains the index of the initial state, the index of the final state and theprobability of the transition. For each state-action pair we maintain two lists, one for the forward transitions


and one for the backward transitions. If a transition is added to the probability matrix, the transition is addedto the initial state’s forward transition list and to the end state’s backward transition list for the specifiedaction. This allows a quick access to all successor (or predecessor states) of a given state. The action itself isnot stored by the transition object, the action is already implicitly specified with the location of the transitionobject.

Figure 4.3: Representation of the Transition Matrix. For each state-action pair we have a list of forward anda list of backward transitions.

The CFeatureStochasticModelclass offers functions that retrieve the probability of a specified transitiondirectly or retrieve all forward (or backward transitions) of a given state-action pair as well.For the stochastic model, the state (or alternatively the feature index) is represented as an integer variable,because these updates have to be done several hundred thousand times for larger MDP’s, therefore we donot want to waste any computational resources by using state objects.

Discrete Reward Functions

For the DP update we need reward functions which take discrete states as arguments, instead of our usualstate objects. For performance reasons we represent the discrete states with integer variables in this context.Therefore we create the classCFeatureRewardFunction, which returns a reward value for states representedby integers.

The Value Iteration Algorithm

As already mentioned, the Toolbox supports calculating the values or the action values of a given stochasticpolicy. Therefore we introduce some tool classes:

• Converting V-Functions to Q-Functions: We build an extra read-only Q-Function class (CQFunc-tionFromStochasticModel) which calculates the action values given a value function, the transitionprobabilities and the reward function. For action value calculation the standard equation

Q(s,a) =∑

s′P(s′|s,a)(r(s,a, s′) + γV(s′))

is used. Of course the sum over all statess′ is only computed for all successor states in the forwardlist of states. Note that this is already a main part of policy evaluation.


• Converting Q-Functions to V-Functions: The classCVFunctionFromQFunctionwas used to calcu-late the V value from a Q-Function and the corresponding policy. For this calculation, the standardequation

V(s) =∑

a

π(s,a)Q(s,a)

is used Note that combining these two conversions already defines the update step used for valueiteration.

Depending on what we want to estimate (values or action values), the algorithm takes a stochastic policyand either Q- or a V-Function as input. If no policy has been specified, a greedy policy is used.

• Value estimation: The given V-Function is converted to a Q-Function. Then we convert that Q-Function back to a V-Function. For each update step we set the value of states of the original V-Function to the value of the virtual V-Function.

• Action-Value estimation: Here it works vice versa, we create a V-Function from the given Q-Functionand then convert the V-Function back to a Q-Function. For each subsequent update step, the valueswhich were calculated by the virtual Q-Function are used.

We also added some useful extensions for choosing which states to update to the standard value iterationalgorithm. The states which get updated are usually chosen at random or sequentially by a sweep. Withthis update scheme it is unlikely to update states where the value will change significantly. In our commonapproach, a priority list for the state updates is used. The state with the highest priority is chosen for theupdate of the value function. If a state update has been made for states and the value of that state changedsignificantly, it is likely that the values of the predecessor states will also change. If stateshas been updatedand the bellman errorb = |Vk+1(s)−Vk(s)| is the difference between the new and the old value, the prioritiesof all predecessor statess′ are increased by the expected change of the value of states′ (which isP(s|s′,a)·b).The algorithm is listed in4.2.5. Only priorities above a given threshold are added. We provide functions for

Algorithm 1 Priority Listsb = Vk(s) − Vk−1(s)for all actionsa do

for all predecessor statess′ dopriority(s′)+ = P(s|s′,a) · b

end forend for

• doUpdateSteps: Updating the first N states from the priority list.

• doUpdateStepsUntilEmptyList: Do the updates until the priority list is empty.

• doUpdateBackwardStates: Update all predecessor states of a given state. This can be used to give thealgorithm some hints on where to start the updates; for example it is useful to begin in known targetor failed-states.

As already mentioned, this is not the pure value iteration algorithm any more. It is a sort of intermediatestep to the prioritized sweeping algorithm discussed in the section4.8.1.


4.3 Learning the V-Function

4.3.1 Temporal Difference Learning

TD learning approaches calculate the error of the bellman equation for an one step sample< st,at, rt, st+1 >.So in contrast to Dynamic Programming, TD methods use one sample backups instead of full backups.Since we use only a single sample for the update, the calculated values have to be averaged. For this, thelearning rateη is used. For the value function, this equation is defined as

Vπ(st) = E[r(t) + γV(st+1)]

We obtain the following solution for a single step sample:

Vk+1(st) = (1− η) · Vk(st) + η · (rt + γ · Vk(st+1) (4.16)

The one step error of the bellman equation for a step tuple< st,at, rt, st+1 > is also calledtemporal difference(TD) and is calculated by

td = rt + γV(st+1) − V(st) (4.17)

We can also use the TD value to express the update equation of the V value of statest.

∆V(st) = η · td = η(rt + γV(st+1) − V(st)) (4.18)

By following a given policyπ, we can estimateVπ using this approach.

4.3.2 TD (λ) V-Learning

The normal TD learning algorithm simply ‘rewards’ the current state with the temporal difference. Butin general, the states from the past are also ‘responsible’ for the achieved temporal difference. Eligibilitytraces (e-traces) are a common approach used to speed up the convergence of the TD-Learning algorithm(see [49]). An e-trace of a statee(s) represents the influence of the states to the current TD update. Noweach stateis updated with the help of its e-traces.

∆V(s) = η · td · e(s), for all s (4.19)

4.3.3 Eligibility traces for Value Functions

There are different ways of calculating the eligibility of a state. The eligibility of each state has to bedecreased with a given attenuation factorλ at each step, except for the current state, in which the eligibilityis increased. The two most common eligibility trace update methods are:

• Replacing e-traces:

et+1(s) =

λ · γ · et(s), if s, st

1, else(4.20)

• Accumulating (Non-Replacing) e-traces:

et+1(s) =

λ · γ · et(s), if s, st

λ · γ · et(s) + 1, else(4.21)

At the beginning of an episode the e-traces must obviously be reset to zero. In general it is not clear whichapproach works better. Non-Replacing e-traces are more common, but they can falsify the V-Function updateconsiderably, if the learning tasks allows to stay in the same state for a long time. Replacing e-traces havebeen introduced by Singh [44]. In his experiments, replacing e-traces had a considerably better learningperformance.

4.3. Learning the V-Function 57


E-Traces for V-Functions

There are different type of e-traces for the different types of value functions, so we again have to provide ageneral interface for the e-traces. This interface is calledCAbstractVETraces. An e-trace object is alwaysbound to a V-Function object, which is passed to the constructor of the e-traces object. The interface containsfunctions for the following:

• Adding the current state to the e-traces (addETraces(CStateCollection*) ).

• Updating the current e-traces by multiplying the e-trace value with the attenuation factorupdateE-Traces()

• Updatng the value function with a giventd (em updateVFunction(double td))

• Resetting the e-traces for all states (resetETraces()).

In the following we will only discuss the implementation for the discrete and linear feature state representa-tion (which is implemented by the classCFeatureVETraces).In the TD(λ) update rule we have to update all states in each step. But in general (and particularly for largestate spaces) this is not necessary, the e-traces for most states will be zero anyway. This conclusion arisesfrom the general assumption that the agent only uses a local part of the state space for a long time-period.As a result, we can not store the e-traces in an array, but we store the index and the eligibility factor in alist. All states which are not in the list have an e-trace factor of zero. The list is also sorted by the eligibilityfactors. In order to find the< s,e(s) > tuple faster, we also maintain an integer map for the tuple. In thiscontext the state number serves as map index, so we can search for the eligibility factor of states (as well asdecide whether the state is in the list) very fast. This is needed in order to add the eligibility of a given state.This kind of list is calledfeature list, classCFeatureListin the Toolbox and is also needed for other classesand functionalities. There are sorted and unsorted feature lists.When updating the value function, the class calls the update method of the feature V-Function (the one usingintegers as the state parameter) for every state in the eligibility list.We also implementreplacingandaccumulatinge-traces; this capability can be set by the parameter ‘Re-placingETraces’. We provide two additional parameters which control the speed and accuracy of the e-traces. The first parameter (‘ETraceMaxListSize’) controls the maximum number of states that are in thee-trace list. If there are more states in the list, the states with the smallest factors are deleted. The secondparameter controls the minimum factor of an e-trace (‘ETraceTreshold’). Once again, all states with aneligibility factor lower than the given factor are deleted.Additionally each V-Function provides a method for creating a new e-traces object of the correct type alreadyinitialized with the V-Function. Feature V-Functions always create feature e-trace objects as standard, but itis also possible to use other kinds of e-traces with feature V-Functions.

TD(λ) V-Learning

Since We have already designed the V-Function and the e-trace objects, implementing the TD(λ) algorithmitself is straight forward. The algorithm implements theCSemiMDPRewardListenerinterface, and getsa reward function and a value function passed at the constructor. At each new step and new episode itperforms the updates listed in algorithm2. A value function alone can not be used for action selection, butin combination with a transition function, it can be used to calculate a policy (see section4.5).


Algorithm 2 TD-V Learningfor eachnew Episodedo

etraces− > resetETraces()end forfor eachnew step< st,at, rt, st+1 > do

td = rt + γ · V(st+1) − V(st)etraces− > updateETraces() //e(s) = λ · γ · e(s)etraces− > addETrace(st) //e(s)+ = 1.0 or e(s) = 1.0

etraces− > updateVFunction(η · td) //V(s)η←− td · e(s)

end for

4.4 Learning the Q-Function

4.4.1 TD Learning

As we have seen, the Q-Function is defined according to4.5. Similar to learning the V-Function, we take aone step sample< s,a, r, s′ > of this equation and calculate the equation’s error when using the estimatedQ-Values.

td = r(s,a, s′) + γQπ(s′,a′) − Qπ(s,a) (4.22)

a′ in 4.22is chosen from the policyπ, because we estimate the action values of policyπ this policy is calledthe estimation policy. In the case of Q-Function learning the estimation policy does not have to be the policythat is used by the agent. The TD value is then used to update the Q value of states and actiona.

∆Q(s,a) = η · td (4.23)

If every state-action pair is visited infinitely often and the learning rateη is decreased over time, the TDalgorithm is guaranteed to converge toQπ.We can use any policy as the estimation policyπ, but in general, we want to estimate the values of theoptimal policy. There are two main algorithms, SARSA and Q-Learning, which differ purely in the choiceof the estimation policy.In Q-Function learning we can use stored experience from the past (like batch updates) for learning, becausewe can specify an individual estimation policy, which is learned. When learning the V-Function, the steptuple always has to be generated by the estimated policy, therefore past information can be used.

SARSA Learning

StateAction RewardStateAction learning [49], p. 145 always uses the policy the agent follows as estima-tion policy. That is, it uses the< st,at, rt, st+1,at+1 > tuple for the updates, which is why it is called SARSA.This approach is called on-policy learning, because we learn the policy we follow. Usually the agent followsa greedy policy with some kind of exploration. For this reason, SARSA does not estimate the optimal policy,but also takes the exploration of the policy into account. If the policy gradually changes towards a greedypolicy, SARSA converges toQ∗

Q Learning

Q-Learning [49], p. 148) uses a greedy policy as estimation policy. This it estimates the action values of theoptimal policy. Since we estimate a policy other than the one we follow (due to some random explorationterm), this type of update is called off-policy learning.

4.4. Learning the Q-Function 59

In general it is not clear which method is better. Q-Learning is likely to converge faster. SARSA learninghas an advantage if there are areas of high negative reward, because in that case SARSA tries to stay awayfrom these areas more significantly than Q-Learning. The reason for this is that SARSA also considers theexploring actions, which could lead the agent in these areas by chance.

4.4.2 TD(λ) Q-Learning

Similar to V-Function learning, we can also use eligibility traces for Q-Function learning. Now the e-tracesare not only for the states, but for state-action pairs. The update rules are the same as for the V-Learningcase (here again, there are replacing and non replacing e-traces).But there is a difference when resetting the e-traces. In this case we have to distinguish between actionswhich were chosen by the estimation policy and actions where the estimation policy differed from the stepinformation. Once we recognize that we have not followed the estimation policy, we need to reset thee-traces. This is done because we want to estimate the action values of the estimation policy and if theestimation policy has not been followed for one step, the state-action pairs from the past are not responsiblefor any further TD updates.Nevertheless, there are also approaches where the e-traces are never reset if the estimated action contradictsthe executed action. This may falsify the Q-Function updates, but it can also improve performance becausethe e-traces reach longer into the past. In the initial learning phase in particular, where we have manyexploratory steps, this can be a significant advantage.Of course there are extensions of Q and SARSA learning using e-traces. They are called Q(λ) and SARSA(λ).In this case, the SARSA(λ) algorithm has a small advantage over the Q(λ) algorithm, because the agent al-ways follows the estimation policy; thus we never have to reset the e-traces during an episode.


E-Traces for Q-Functions

We have already defined e-traces for V-Functions which store the eligibility for a state. Now we haveto store the eligibility for a state-action pair. For Q-ETraces we have to provide a similar interface classCAbstractQETraces. The interface contains functions for the following:

• Add the current state-action pair to the e-traces (addETrace(s,a)).

• Update the current e-traces by multiplying it by the attenuation factor (updateETrace()).

• Update the Q-Function, givenη · td (updateQFunction(td)).

• Reset the e-traces for all state-action pairs(resetETraces()).

The design of the Q-ETraces class is patterned on the design of the Q-Functions. Again, we maintain aV-ETraces object for each action. Each V-ETrace object is assigned to the corresponding V-Function of theCQFunction object. Thus, if the Q-Function uses feature V-Functions for the single action value functions,the Q-ETrace object will use feature V-ETraces. This functionality is covered by the class CQETraces,which implements the generalCAbstractQETracesinterface. Since we use e-traces for V-Functions, we canuse the entire functionality discussed previously, including replacing or non replacing e-traces and settingthe maximum list size and the minimum e-traces value before the state is discarded from the list.Each Q-Function contains a method for retrieving a new standard Q-ETraces object for that type of Q-Function, similar to the V-Function approach.


TD(λ) Q-Learning

We have defined all necessary parts for the algorithm. Now we can easily combine these parts and buildthe learner classCTDLearner. The algorithm implements theCSemiMDPRewardListenerinterface. It getsa reward function, a Q-Function object and an estimation policy object as parameters. Additionally we canspecify an individual Q-ETraces object, otherwise the standard Q-ETraces object for the given Q-Functionis used. The algorithm is listed in algorithm3.

Algorithm 3 TD-Q Learningae ... estimated actionπe ... estimation policy

for eachnew Episodedoetraces→ resetETraces()

end for

for eachnew step< st,at, rt, st+1 > doif ae , at then

etraces→ resetEtraces() //e(s) = 0, for all selse

etraces→ updateETraces() //e(s) = λ · γ · e(s), for all send ifae← π(st+1)td← rt + γ · Q(st+1,ae) − Q(st,at)etraces→ addETrace(st) //e(s) = e(s) + 1.0 or e(s) = 1.0etraces→ updateQFunction(η · td)

end for

We can disable resetting the e-traces when an incorrect estimated action occurs by the parameter ‘ResetE-TracesOnWrongEstimate’.There is an individual class for SARSA LearningCSARSALearner, which requires the agent for its esti-mation policy (remember that the agent is also a deterministic controller, which always returns the actionexecuted in the current state). There is also an individual class for Q-LearningCQLearning, which automat-ically uses a greedy policy for its estimation policy.

4.5 Action Selection

If we use a discrete set of actions the action can be selected with a distribution based on the action values.But always taking the greedy action is not advisable, because we also need to incorporate some exploringactions. There are three commonly used ways of selecting an action from the Q-Values:

• The greedy policy: Always take the best action

• The Epsilon-Greedy policy: Take a random action with probabilityε; take the greedy action withprobability 1− ε. This gives us the following action distribution:

P(s,ai) =

1− ε + ε|As|

, if ai = arg maxa′∈AsQ(s,a′)ε|As|

, else(4.24)

4.5. Action Selection 61

The advantage of the epsilon greedy policy is that the exploration factorε can be set very intuitively.

• The soft-max policy: The soft-max policy uses the Boltzmann distribution for action selection:

P(s,ai) =exp(β · Q(s,ai))∑|As|

j=0 exp(β · Q(s,a j))(4.25)

The parameterβ controls the exploration rate. The higher theβ value, the sharper the distributionbecomes. Forβ→ ∞ it converges to the greedy policy. When using the soft-max distribution, actionswith high Q-Values are more likely to be chosen than actions with a lower value, so it generallyhas a better performance than epsilon greedy policies, because the exploration is more guided. Thedisadvantage is that the exploration rate also depends on the magnitude of the Q-Values, so it is harderto find a parameter setting forβ.

4.5.1 Action Selection with V-Functions using Planning

We already covered how to learn the value function of a specified policy using TD(λ). Of course we usuallywant to learn the optimal policy, and not just estimate the value function of a given policy. With the V-Function alone, we can not make any decision about whether an action is good or bad, so we can not learna policy either. But we can use the transition functionf of the model (or learn the transition function if itis not known) to make a one step forward prediction of the current state for every action. We can then useequation

Qπ(s,a) = Es′ [r(s,a, s′) + γVπ(s′)]

where s′ is sampled from the distribution defined byf (s,a) to calculate action values from these statepredictions. For strongly stochastic processes we would have to repeat the prediction several times, which israther time consuming. But for deterministic (or hardly stochastic processes - for example, processes witha small amount of noise) we can omit the expectancy value and just calculate the Q-Value with a one stepforward prediction.

Qπ(s,a) = r(s,a, s′) + γVπ( f (s,a))

These action values have the same meaning as the Q-Values of an ordinary Q-Function. From this it followsthat by following a greedy policy (or a policy gradually converging to a greedy policy), we can also learnthe optimal policyπ∗. The advantage of this approach is that we use the transition function (and the rewardfunction) as a kind of prior knowledge for our policy. This can boost our learning performance consider-ably, in particular for continuous control tasks where the transition function usually defines some complexdynamics.Even if we do not know the transition function, we can use any kind of supervised learning algorithm to learnthe model. Learning the model is typically easier than learning the Q-Function (because it is a supervisedlearning task), so we can divide the entire task of learning the Q-Function into learning the V-Function andlearning the transition function.The disadvantage of this approach is that comparatively it requires much computation time. The policy hasto calculate the state predictions′ and the valuesV(s′) for each action. Calculating this value can be quitetime consuming especially if we use large RBF networks.

Planning for more than one step

Another nice advantage of this approach is that we can combine heuristic search approaches with V-Learning.We are not restricted to using just a one step forward prediction. We can also use anN-step forward pre-diction and span a search tree over the state space to calculate the action values. Hence we use a heuristic


search with a learned value function. The point in searching deeper than one step is to obtain a better actionselection, in particular if we have a perfect model but an imperfect value function. The action value of actiona is then

Q(st,at) = r(st,at, st+1) +maxat+1,at+2,...,at+N−1

N−1∑i=1

γi · r(st+i ,at+i , st+i+1)

+ γN · V(st+N) (4.26)

The number of states to predict (∑N

i=1 |A|i =

|A|N−|A||A|−1 ) and the number of V-Functions to evaluate (|A|N)

increase exponentially, consequently planning can obviously only be done for small N.

Figure 4.4: In this example, a two step forward prediction is used to select the best actiona1.


Stochastic Policies

All the policies mentioned above can be seen as stochastic policies. In our design, stochastic policies arerepresented by the abstract classCStochasticPolicy. The class calculates the action values of all availableactions in the current state and passes them to an action distribution object, which calculates the probabilitydistribution. Then an action is sampled from this distribution and returned. Since we use only a discreteaction set, we do not have to cope with action data objects in this case. The probability distribution itselfis also needed by a few algorithms; for that reason, we calculate the distribution in an individual publicmethodgetActionProbabilities. There are three action distribution classes for the three discussed policies(CGreedyDistribution, CEpsilonGreedyDistribution, CSoftMaxDistribution).How the action values for calculating the distribution are calculated is not specified at this point (thereforethe class is abstract, this functionality is determined by the subclasses). The classCQStochasticPolicyusesa Q-Function object for calculating the action values.

4.6. Actor-Critic Learning 63

Figure 4.5: The stochastic policy: The distribution is determined by the action distribution object, whichcan be a greedy, epsilon greedy or soft-max policy.

V-Planning

We want to use the same policies as for Q-Functions. For that reason, we implement an extra Q-Functionclass which calculates the action values from the value function, the transition function and the rewardfunction. Of course this Q-Function is read-only and can not be modified.The class is calledCQFunctionFromTransitionFunction. It takes a value function, the transition functionand the reward function as arguments. For the state prediction, a list of state modifiers as additional argumentis needed. These modifiers are used by the V-Function (typically we can specify the state modifiers list fromthe agent here). Because we can only calculate model states with the transition function, we have to maintaina separate state collection object which can also calculate and store the modified states.For the n-Step prediction the algorithm is more complicated. The search tree is built by calling the searchfunction recursively with a decremented search depth argument. The search process is stopped when wereach a search depth of zero (tat is we are in a leaf). We maintain a stack of already calculated state collec-tions, and the search function always uses the first state collection from the stack as the current predictedstate. Before the recursive function-calls (for each action we create a new branch), the new predicted state ispushed onto the stack. When the recursive calls are finished, the predicted state collection is removed again.The search depth of the search tree can be specified by the parameter ‘SearchDepth’.This Q-Function can be used for the stochastic policy class. The classCVMStochasticPolicyis inheritedfrom CQStochasticPolicyand creates such a Q-Function itself, making it more comfortable to use.

4.6 Actor-Critic Learning

Actor-Critic algorithms are methods which represent the value function and the policy in distinct data struc-tures. The policy is called the actor; the value function is known as the critic. The value function is learnedin the usual way (with any V-Learning approach we want). The critique coming from the V-Function is usu-ally the temporal difference (although there are a few algorithms which use other quantities coming from theV-Function), which indicates whether the executed action from the actor was better than expected (positivecritique) or worse (negative critique). The actor can then adapt its policy according to this critique. Actorcritic learning has two main advantages:

• We can learn an explicitly stochastic policy, which can be useful in competitive or non Markovianprocesses.


• For a continuous action set, we can represent the policy directly and calculate the continuous actionvector. When learning action values, we would have to search through an infinite set of actions topick the best one. Although it is possible to discretize the action space, we still get a huge number ofactions for a high dimensional continuous action space.

Figure 4.6: The general Actor-Critic architecture.

4.6.1 Actors for two different actions

To illustrate the Actor-Critic design we discuss a very simple actor. The algorithm was proposed by Barto[8] in 1983. The actor can only choose between two different actions. It stores an action-valuep(s) just forthe first action.The action value is then updated according to the rule

∆pt(s) =

0.5 · β · critique, i f at = a1

−0.5 · β · critique, else(4.27)

whereβ is the learning rate of the actor. If the first action has been used, the critique is added to the actionvalue. Otherwise, the critique is subtracted. Thus the action value increases if the first action has yielded agood result or the second action yielded a bad one.The first action is then selected using the probability

P(at = a1) =1

1+ exp(−p(st))(4.28)

Thus action number one is taken with high probability if the action value is positive.We can also use this algorithm with e-traces. The current state is added to the e-traces by the followingequation

∆et(s) =

+0.5, if at = a1

−0.5, else(4.29)

The remaining e-traces updates (attenuation, replacing or accumulating traces) and the updates of the actionvalues are carried out as usual. Consequently, we arrive at the following action value update rule:

∆pt(s) = β · critique · e(s), for all s (4.30)

4.6. Actor-Critic Learning 65

4.6.2 Actors for a discrete action set

Actors for a discrete action set calculate a probability distribution for taking an action in state s. Actionvalues for each state are used to indicate the preference for selecting the action (similar to Q-Functions).The action can then be selected again by a probability distribution (as discussed earlier in subsection actionselection).There are different approaches for updating this action value function.

• The main approach, discussed in [49] p. 151 is related to the TD update of a Q-Function.

∆p(st,at) = β · critique (4.31)

The difference to TD learning is that in this case we use a separate value function to estimate thetemporal difference.

• Another approach is to include the inverse probability of selecting actionat. As a result, the actionvalues from rarely chosen actions receive a higher emphasis:

∆p(st,at) = β · critique(1− πt(st,at)) (4.32)

whereπt is the stochastic policy at timet.

The problem with this approach is that actions with a high probability do not get updated at all any more.In the Toolbox we intermix the two approaches, which uses a minimum learning rate for all actions, andhigher learning rates for actions with low probabilities. Both approaches are also implemented with e-traces.Again, we can use those e-traces already discussed for Q-Functions.


The critic is already implemented; we can use any V-Function learner class. To receive the critique fora given state-action pair, we create the interfaceCErrorListener. Error listeners receive an error value ofan unspecified function (to maintain generality, the type of function is not specified here) for a given state-action pair. In our case this function is the Value Function; therefore we extend our V-Learner and Q-Learnerclasses. The TD learner classes can also maintain a list of error listeners, and after calculating the TD error,this quantity is sent to the error listeners. Therefore all actors needing the TD value must implement theerror listener interface. Then we only have to add the actor to the error listener list of the TD learner. Notethat through this approach we can use either a V-Function or a Q-Function as critique.Actors can also implement the agent controller interface directly, but this is not mandatory for every actor,as we will see.

Actors for two different actions

The class representing this type of actor is calledCActorFromValueFunction. This class maintains a V-Function object representing thep values and a designated e-traces object. The actor class also implementsthe discussed policy using the agent controller interface. The rest of the implementation is straight forwardusing the already implemented classes.


Figure 4.7: The class architecture of the Toolbox for the Actor-Critic framework

Actors for a discrete action set

In the RL Toolbox, these two approaches are implemented for the classesCActorFromQFunctionandCAc-torFromQFunctionAndPolicy. Both algorithms use Q-Functions to represent the action values. Because theactors can not be used as policies, we can use the Q-Function for theCQStochasticPolicyclass. This policymust also be passed to the latter class because the stochastic policy is needed for the action value update.For the e-traces, a designated Q-ETrace object is used, so we have the entire discussed functionalities forthe e-traces.

4.7 Exploration in Reinforcement Learning

In RL problems, we usually face a trade-off between exploration and exploitation. In order to find anoptimal policy and increase the reward during learning, we have to follow the action considered best at thetime (short-term optimization). But without exploration (long term optimization), we can never be certainthat the supposedly best action is really the optimal action, because certain values or action values may stillbe wrong. Thus we have to make sure that we visited all areas of the state space thoroughly enough whilestill following a good policy. In this section we will discuss the results from Thrun [51] and Wyatt [55] whodeal with the problem of directed exploration.There are basically two methods for integrating exploration in RL: undirected exploration and directedexploration. Undirected exploration schemes induce exploration only randomly, while directed explorationapproaches use further knowledge about the learning process.Thrun [51] proved that for finite deterministic MDPs, the worst case for the complexity of learning the giventask is exponential to the size of the state space (if we use a undirected exploration method), while theworst case bound is polynomial (if using a directed exploration approach). The results were not generalizedto general infinite stochastic MDPs, but intuitively, we can say that directed exploration also reduces the

4.7. Exploration in Reinforcement Learning 67

complexity of learning in general MDPs.

4.7.1 Undirected Exploration

Undirected exploration schemes rely only on the knowledge of the optimal control; they make better actionsmore likely. Exploration is ensured only at random. We already discussed two undirected explorationschemes: the soft-max policy and theε-greedy policy. If we initialize our action value function with theupper boundQmax, we get a special case of undirected exploration, where each action is guaranteed tobe tried at least once by the agent. Also, actions which have been selected more often in a certain stateare likely to have lower Q-Values than rarely chosen actions. Consequently, the probability of taking afrequently chosen action again is reduced.So in this case, it is a sort of mixture of the undirected and the directed exploration schemes. This method ofincorporating exploration is often called optimistic value initialization. Note that this is not possible with allfunction approximation schemes that we will discuss in chapter6. For example, neural networks are globalfunction approximators, i.e. changing the value of statest changes the value of many other statess′ evenif s′ has never been visited. Consequently, the value of the state can not be used to estimate how often thestate has been visited. Taking the Q-Value as the exploration measure is often referred to as autility-basedmeasure in literature.

4.7.2 Directed Exploration

Different exploration measures can be used, among thencounter-based, recency-basedand error basedmeasures. In this thesis, only the use of counter based measures is investigated in more detail , but the otherexploration measures can be implemented in the Toolbox easily. In the following discussion, the explorationmeasure of executing actiona in states will be calledχ(s,a). This exploration measure (exploration term)is usually linearly combined with a Q-Value (exploitation term) to calculate new action values.

Eval(s,a) = Q(s,a) + αχ(s,a) (4.33)

The action valueEval(s,a) can be used again for action selection in the usual way - for example, usingany stochastic policy. The exploration factorα is typically decreased over time in order to converge to agreedy policy. Thrun [51] also proposed a method for dynamically adjusting theα value, which is calledSelective Attention. The different exploration measures can be combined to get a more effective, but alsomore complex exploration policy.

Counter-based measures

Counter-based measures count the number of visits for each stateC(s). The exploration measure is typicallythe number of visits of the next statest+1

χ(st,a) = Est+1[C(st+1)|st,a] (4.34)

An exploration policy would try to minimize this term, so using a linear combination with an exploitationterm (which has to be maximized) does not work. For this reason, and in order to ensure that counter basedexploration measures converge with time, Thrun proposed the following counter based exploration measure:

χ(st,a) =C(st)

Est+1[C(st+1)|st,a](4.35)


which is the relation between the number of visits of the current state and those by the successor state.In order to get an exploration policy, this measure has to be maximized, and made suitable for the linearcombination again. The expectancyEst+1[C(st+1)|st,a] can either be learned or estimated using a model ofthe MDP.For the counter on the other hand a decay term can be used in order to induce the recency information (seenext section) of the state visits. In this case the counter is updated by

C(s) =

λ ·C(s) + 1 , if s= st

λ ·C(s) , elsefor all s (4.36)

whereλ is the decay factor. Counter-based measures with decay can be seen as combination of counterbased and recency based measures.

Recency-based measures

This exploration measure estimates the time that has elapsed since a state was last visited; therefore, it iswell suited for changing environments. Sutton [48] suggested using the square route of the timeρ(s,a)elapsed since the last selection ofa in states:

χ(s,a) =√ρ(s,a) (4.37)

Error-based measures

Another way to construct an exploration measure is to use the expected error of the value function in states. If the error of the value function is large in states, it is understood that visiting states and updatingV(s)is preferable. We can use the average over the last temporal difference values as an estimate of the expectederror∆V(s).

4.7.3 Model Free and Model Based directed exploration

In this section we will take a more detailed look at calculating the counter based exploration measure. Forcounter based measures, we need an estimate ofEst+1[C(st+1)|st,a]. The expectancy can either be estimatedby:

• Model Based Method:In this case we have a model of the (stochastic) MDP. The expectancy is givenby

Est+1[C(st+1)|st,a] =∑

s′P(s′|st,a)C(s′) (4.38)

If the model is not given we can learn the stochastic model (see section4.8.3).

• Model Free Method: In this case, we can estimate the expectancy by

Et+1[C(st+1)|st,a] = (1− η)Et[C(st+1)|st,a] + ηC(st+1) (4.39)

Wyatt [55] empirically confirmed the intuitive statement that model-based exploration methods work betterthan model-free.

4.7. Exploration in Reinforcement Learning 69

4.7.4 Distal Exploration

All the discussed exploration measures take only the next statest+1 into account in their calculations. Butit is advantageous to base the action decision on future exploration measures as well. In this case, wecan either use planning methods in the same way as discussed for V-Planning, or we can learn the futureexploitation measure as done by Wyatt. We can state the exploration problem in terms of a reinforcementlearning problem by taking the exploration measure as the reward signal. For this reward signal, we candefine a seperate exploration value function (here for the counter-based case)

ψπ(s) = Es′ [C(s′) + γψπ(s′)|s, π(s)] (4.40)

and also an exploration action value function

ψπ(s,a) = Es′ [C(s′) + γψπ(s′)|s,a] (4.41)

Having formulated the equations for the value function and action value function, we can use the samealgorithms to learn the exploration value functionψ(s) as we can for the standard value functionV(s). Foraction selection, the exploration value function is used instead of the immediate exploration measureχ(s,a).For example the distal counter-based approach uses the following equation to evaluate the merit of an action.

Eval(s,a) =C(s)ψ(s,a)

(4.42)

Wyatt [55] proposed two different methods for learning the exploration function: a model free approachand a model based approach. These approaches completely correspond to the value function learning algo-rithms TD(λ)-Learning and Prioritized Sweeping, so they are not covered in this section in more detail. Theprincipal difference between a exploration value function and a value function is that the exploration valuefunction is non-stationary, because the reward signalχ(s,a) changes over time. Consequently calculating theexploration reward function is a more complex task than calculating the value function. Nevertheless, Wyattshowed empirically that distal exploration methods can outperform their local counterparts in a gridworldexample.

4.7.5 Selective Attention

All the discussed methods use a fixed linear combination of the exploitation and exploration term. Thiscan be ineffective, particularly if the optimal action for the exploitation and the exploration point in exactlyopposite directions. Hence the exploration rule might yield an action which neither explores nor exploits.A basic idea for overcoming this problem would be to use selective attention to determine the current be-havior (exploration or exploitation); then the dynamically calculated attention parameterΓ ∈ [0,1] is usedfor action selection.

Eval(s,a) = ΓQ(s,a) + (1− Γ) · α · χ(s,a) (4.43)

Consequently, aΓ value of 1.0 totally exploits and aΓ value of 0.0 explores. Thrun [51] proposed thefollowing rules to calculate the attention parameterΓ.

κ = Γt−1Qt(s,a)Vt(s)

− (1− Γt−1) · χ(s,a) (4.44)

Γt = Γmin+ σ(κ) · (Γmax− Γmin) (4.45)

whereσ is a sigmoidal function. Typical values forΓmin andΓmax are 0.1 and 0.9. Theκ value estimatesthe exploration-exploitation tradeoff under the current attention setting. The newΓ value is calculated by


squashingκwith a sigmoidal function. If the exploitation measureQt(s,a)Vt(s)

is large compared to the explorationmeasure,κ will be positive, resulting in a preference for the exploration action, and vice versa for largeexploration measures. Due to the incorporation of the previousΓ values for calculatingκ, theΓ values cannot change abruptly, so either exploring or exploiting actions are executed over several time steps.


In the Toolbox only the counter based exploration measures are implemented directly, but the other explo-ration schemes can be implemented easily. For example, to estimate the expected error of the value function,we can exploit the already existing error listener interface. For the counter based methods, we use a valuefunction object as counter, because it already provides all the necessary functionalities. We implement twoclasses to count the state visits and state action visits (CVisitStateCounterandCVisitStateActionCounter).These classes take a feature V (or feature Q)-Function and increase the value of the current state by one ateach step. Through this approach, we can use tables and feature functions (e.g. RBF networks) as a counter.

The exploration Q-Function and exploration Policy

The exploration Q-FunctionCExplorationQFunctioncalculates the exploration measure

χ(st,a) =C(st)

Est+1[C(st+1)|st,a]

Thus it takes a value function which representsC(st) and an action value function which representsEst+1[C(st+1)|st,a]for the local orψ(s,a) for the distal case. The class is a read only Q-Function class, so only thegetValuefunction is implemented in the described way.The exploration policyCQStochasticExplorationPolicyimplements the sum of an exploitation Q-Functionand an exploration Q-Function to calculate the action values. Both are given as abstract Q-Function objects.The sum is calculated by

Eval(s,a) = ΓQ(s,a) + (1− Γ) · α · χ(s,a)

whereα andΓ are both parameters which can be set by the parameters’ interface.Γ is initialized with0.5. The class is a subclass of the stochastic policy class; because the remaining action selection part is notchanged, we can use any action distribution for action selection.

Local exploration

For the local exploration case we have two different ways to calculateEst+1[C(st+1)|st,a]

• Model Based:Here we can use standard planning techniques which are already implemented in theToolbox. If we have a stochastic model, we can use the classCQFunctionFromStochasticModel; fora deterministic model we can useCQFunctionFromTransitionFunction.

• Model Free: The model free approach requires an additional learner classCVisitStateActionEsti-mator, which is derived from the agent listener class and estimatesEst+1[C(st+1)|st,a] according toequation4.39.

4.8. Planning and model based learning 71

Distal exploration

Distal exploration approaches use the same learning algorithm as for value (likewise action value) functionlearning. We just need to define an appropriate reward function. A new reward function classCReward-FunctionFromValueFunctionis implemented, which takes a value function as input and always returns thevalue of the current state as the reward. For this value function the counter is used.Another value function object is needed for learning the exploration value functionψ(s). We can use any ofthe already existing architectures to do this; we can learn the exploration value functionψ(s) and calculatethe action values via planning, or we can learn the exploration action valuesψ(s,a) directly. Alternatively,prioritized sweeping (see4) can be used as the model based approach.

Selective Attention

For selective attention, we implement an additional agent listener classCSelectiveExplorationCalculator.The class takes an exploration policy object and calculates theΓ value according to equations4.43and4.45.It adapts theΓ value of the exploration policy at each step.

4.8 Planning and model based learning

In this section, we will discuss a few planning methods which can be used in combination with learning.At first we will look at the different definitions of planning and learning, then we will discuss the Dyna-Qalgorithm, and Prioritized Sweeping as an extension to dynamic programming.

4.8.1 Planning and Learning

Planning and learning are two very popular keywords in the area of Artificial Intelligence. But what exactlyis the difference between planning and learning? We use the same definition as Sutton [49], p. 227 whodistinguished learning from planning in the following way:

• Learning: Improving the policy from real experience. Learning always uses the current step tuple< st,at, rt, st+1 > to improve its performance.

• Planning: Produces or improves a policy with simulated experience. Planning methods use models togenerate new experience, or experience from the past. This experience is called simulated, becauseit has not been experienced by the agent at time step t. So we see that planning can also improve thepolicy without executing any action.

We will now take a closer look at state space planning, which is viewed primarily as a search through thestate space for a good policy. We already discussed the V-Planning approach and dynamic programming asrepresentatives of this approach. Two basic ideas are very similar to the value based learning approaches:

• State space planning methods compute value functions to improve their policy.

• The value function is computed by backup operations applied to simulated experience.

Thus the main purpose of using planning methods in combination with learning is to exploit the trainingdata more efficiently.


4.8.2 The Dyna-Q algorithm

The Dyna-Q algorithm as it is proposed in [49], integrates planning and learning methods. The originalDyna-Q algorithm uses a standard Q-Learning algorithm without e-traces for estimating the action values.After the common update (the learning part), the algorithm updates the Q-Values forN randomly chosenstate-action pairs from the past (the planning part). So the step information serves as a kind of replacementfor the e-traces in order to accumulate the current TD to the past states. The number of planning update stepscan very, depending on how much time is left before the next action decision takes place. For example, forrobotic tasks superfluous computational resources can be spent on these update steps).

The problem is that for large state spaces, we get many different state-action pairs which we have alreadyexperienced in the past, so it is unlikely that good state-action pairs with a high temporal difference occur.The advantage over e-traces is that we break the e-traces’ temporal order and thereby may discover betteraction selection strategies. How to sample state-action pairs from the past is still an open problem in theresearch. A few approaches [49] attempt to sample the on-policy distribution (the distribution of the state-action pair visited when following the policy). However, estimating this distribution accurately needs manyevaluation steps itself.

Another positive aspect of this approach is that we do not make any assumptions about the used state space.Other planning algorithms like DP only work for discrete state spaces, but here we need only store the stepstaken during the learning trial.

Dyna Q-Learning architecture’s realization in the Toolbox is quite simple, because all parts of this archi-tecture have already been designed. The Q-Learner already exists and the batch step update class providesa uniform sampling of the past steps. We will investigate the appropriateness of this architecture for otherlearning algorithms like Q(λ) or SARSA(λ) as well and find whether or not this can improve the performanceof the standard algorithm.

Figure 4.8: Using the Toolbox for Dyna-QLearning with the agent logger and batch step update classes.

4.8. Planning and model based learning 73

4.8.3 Prioritized Sweeping

Prioritized Sweeping (PS) is a model based RL method. It combines learning with planning (DP). PSprovides the same full backups as Dynamic Programming, but attempts to update only the most importantstates. Therefore, PS maintains a priority list for the states; at each update for a states, the priorities ofall statess′ that can reachs with a single actiona are increased by the expected size of the change in theirvalue. The expected change in the value of predecessor states′ can be calculated byP(s′|s,a) ·b(s), b(s), theBellman error occurred in states, which is given by the difference of the value before and after the update.Thus the Bellman error can be expressed by the equation

b(s) =∑

a

π(s,a)∑

s′P(s′|s,a)(r(s,a, s′) + γV(s′) − V(s)) = Vnew(s) − Vold(s) (4.46)

In PS the agent actually performs actions, so in contrast to DP, it is not just a planning algorithm. The valueof the current state is constantly updated (using the same procedure for the priorities of the predecessorstates); this is an additional help in choosing states for the updates. After updating the current state, the statesfrom the priority list can be updated as long as there is time remaining. In addition, PS does not require acompletely known model of the MDP; instead the model can be learned online with the information fromthe performed steps of the agent. The PS algorithm is listed in algorithm4:

Algorithm 4 Prioritized SweepingfunctionaddPriorities(states, b)for all actionsa do

for all statess′ that can reachs with a dopriority(s)+ = P(s|s′,a) · b

end forend for

functionupdateValue(states)Vold ← V(s)Vnew=

∑a π(s,a)

∑s′ P(s′|s,a)(r(s,a, s′) + γV(s′)

b = Vnew− Vold

addPriorities(s, b)

for eachnew step< st,at, rt, st+1 > doupdate the model parametersupdateValue(s)while there is timedo

s′ ← argmaxspriority(s)updateValue(s′)

end whileend for

DP methods use discrete state representation for planning, but can be easily extended to work with linearfeatures. The planning updates themselves still work in the discrete state representation, but it makes nodifference whether these state indices for the DP updates represent discrete state indices or linear featurestate indices. Hence, if the model learning part can cope with linear features, we can use linear functionapproximation for PS as well.


There is another approach called Generalized Prioritized Sweeping [2] which can cope with general paramet-ric representations of the model and the value function. This algorithm was tested with dynamic BayesianNetworks on a grid-world example. There was no time to implement this particular algorithm, but thesealgorithms are not very promising for continuous problems anyway.


The priority list algorithm has already been implemented for the value iteration class, so we derive our PSalgorithm from that class. Remember that we are able to learn the V-Function or the Q-Function with thatapproach. The rest of the algorithm is implemented in a straight forward manner within the agent listenerinterface. We can specify either a number of updatesK that are done after each step or the maximum timethe updates may need.

Learning the transition Model

In the Toolbox, we provide techniques for learning the distribution model for discrete state representationsas well as for linear feature states. In this section, we always refer to discrete integer states instead ofstate objects. When using feature states, by state indices we always mean single feature indices. For bothstate representations the same super classCAbstractFeatureStochasticEstimatedModelis used, this class issubclass ofCFeatureStochasticModel. Consequently it can be used for the DP classes. Our restriction inrepresenting the learned data is that access to the transition probabilities must work in exactly the same wayas it works with the fixed model (see the DP section). Therefore the estimated model class also maintainsa visit counter for each state-action pair (CStateActionVisitCounter. The transition objects themselves stillstore only the probabilities.The number of state-action visits can then be used in combination with the probability to calculate thefrequency of occurrence of a specified transition (< s,a, s′ >). In order to update the probability of thecurrent transition< st,a, st+1 > we calculate the frequency of occurrence for each transition beginning with< st,at >, increment the number of visits of< st,at > and increment the calculated number of occurrencefor the transition< st,at, st+1 >. Then we calculate the new probabilities again by dividing all transitionnumbersN(st,at, s′) by the new visit counterN(st,at).This approach can cope with two different state representations: discrete and linear feature states. For thediscrete states we just add the occurred transition to the probability matrix as explained. This is performedby the classCDiscreteStochasticEstimatedModel.For feature states, these updates have to be completed for combinations of feature indices from the currentand the next state. In this case, we do not increment the visit counters per 1.0, rather we increment the visitcounter of the transition< fi ,ak, f j > by the product of the two feature factors.

Learning the reward Function

Since PS is a DP method, we still need a reward function for our discrete state (or feature) indices. For thereward function, we need to store the reward value for each< st,at, st+1 > transition. Obviously the rewardvalue for many of the transitions will be zero since only a few transitions are possible. In our approach, westore a map for each state-action pair< s,a > with all possible successor statess′ as index and the reward asfunction value. This representation allows us quick and efficient access to the reward values. At each step,the rewardrt is added to the current value of the map. To calculate the average, the estimated stochasticmodels are used to fetch the number of occurrences of the transitions.

Chapter 5

Hierarchical Reinforcement Learning

One of RL’s principal problems is the curse of dimensionality: the number of parameters increases expo-nentially with the size of a compact state representation. One approach to combatting this curse is temporalabstraction of the learning problem. That is decisions do not have to be made at each step, rather temporallyextended behaviors are invoked which then follow their individual policies. Adding a hierarchical controlstructure to a control task is a popular way for achieving this temporal abstraction. In this chapter we beginwith a theoretical section on hierarchical RL, in particular SemiMDP’s and options. Then we briefly discussthree approaches to hierarchical RL, learning with Options [40], Hierarchy of Abstract Machines (HAM,[36]) and MAX-Q learning [16]. After this theoretical part, we will take a look at hierarchical approachesused for continuous control tasks. We conclude with the RL Toolbox implementation of hierarchical RL.

5.1 Semi Markov Decision Processes

Semi Markov Decision Processes (SMDP), as defined in [40], have the same properties as MDP’s, with oneexception: now the actions are temporally extended. Each temporally extended action defines its own policy,and thus return primitive actions. We will call these temporally extended actions ‘options’. An optionoi ∈ Ohas the following parts:

• The initiation setI ⊆ S is the set of all states where the option can be initiated.

• Either a stochastic policyµi : S × A→ [0,1] or a deterministic policyπi : S→ A.

• A termination conditionβi

For Markov options, the termination conditionβ and the policyπ must depend on the current statest. In thiscase, we can assume that all states where the option does not have to end are also part of the initiation setof the option. Consequently, it is sufficient to define Markov options only on their initiation set. Howeverin many cases, we also want to terminate an option; for example, terminating the option after a givenperiod of time, which contradicts the Markov option property (termination condition depends not on thecurrent state, but on the sequence of past states). Therefore we soften the requirement forβ and the policy.For Semi Markov options, the termination condition and the policy might depend on the entire sequenceof states and primitive actions that occurred since the option was initiated. We will call this sequence< st,at, rt+1, st+1,at+1, ..., rτ, sτ > the history of option o, which began at time stept. The set of all possiblehistories is calledΩ. Thus we haveβ : Ω→ [0,1] andπ : Ω→ A. Note that the composition of two Markovoptions is in general not Markov, because the actions are selected differently before and after the first optionhas ended, ; that is, a composed option also depends on the history.

75

76 Chapter 5. Hierarchical Reinforcement Learning

Policies over options select an optiono ∈ O according to a given probability distributionµ(s,o) and executeit until the termination condition is met, then a new one is selected. The options themselves return primitiveactions, which are executed by the agent.A policy which returns the primitive actions arising from the current option is called flat policy. In general,the flat policy is semi Markov, even if the policyµ itself and the options are Markov (since a policy is acomposition of options).This implies that for hierarchic architectures, where options can in turn choose from options (and so definea policy over options), we always have to deal with semi Markov options. For any MDP and any optionsetO defined on this MDP, the decision process which selects from those options is a Semi-MDP. Note thatprimitive actions can also be seen as options, always terminating after one step.

Figure 5.1: Illustration of the execution of a MDP, SMDP and a MDP with options,

5.1.1 Value and action value functions

Of course we must also consider the options for our V and Q-Functions.The value function of an SMDP is defined as

Vπ(st) = E[rt+1 + γ · rt+2 + · · · + γk−1 · rt+k + γ

k · Vπ(st+k)] (5.1)

For the recursive form of the value function, we need some additional definitions. Letr(s,o) = E[rt + · · · +

γk−1 · rt+k−1|st = s,ot = o, st+k = s′] be the expected reward when executing optiono in states The valuefunction can then be expressed by:

Vπ(s) =∑o∈O

µ(s,o) ·

r(s,o, s′) +∑s′

P(s′|s,o) ·∑

k

P(k|s,o, s′)γk · Vπ(s′)

(5.2)

, whereP(s′|s,o) is the probability that the executed option terminates ins′, and P(k—s,o,s’) is the proba-bility that optiono ends afterk steps, ifo is already known to end in states′.The corresponding equation for the Q-Function is:

Qπ(st,o) = E[rt+1 + γ · rt+2 + ... + γk−1 · rt+k + γ

k · Vπ(st+k)|ot = o]

=

r(s,o) +∑

s′P(s′|s,o)

∑k

P(k|s,o, s′)γk ·∑o∈O

π(s′,o) · Qπ(s′,o)

(5.3)

5.1. Semi Markov Decision Processes 77

Similar to the MDP approach, we arrive at the optimal value functionV∗ and action value functionQ∗ byusing a greedy policy forπ.


Some differences with semi-MDP learning arise naturally when implementing the learner classes. In thiscase, we need to represent the duration of an action, while incorporate the duration information into theDynamic Programming and Temporal Difference algorithms.

Temporal Extended Actions in the RL-Toolbox

We have already discussed multi step actions in the action model. Now we know the precise requirementsof these actions. The multi-step action (CMultiStepAction) has two additional data fields: the number ofsteps the action has executed to a certain point and whether the action has been completed in the currentstep. The duration and the finished flag are updated in every step. Whenever an action has finished, the stepinformation containing the duration field of the action is sent to all listeners and a new action is selected.The isFinishedmethod initially depends only on the current state, so it is primarily designed for Markovoptions. If we want to use semi Markov options, the action object additionally needs access to an episodeobject, or access its duration field if the duration of the option is all that is needed.Having discussed how the duration information is passed to the listeners, we can now look at the differencesbetween the algorithms,which we have already discussed for the MDP case. All algorithms support SMDPlearning automatically if they recognize multi-step actions (unless told otherwise explicitly), so we do notuse new algorithm classes.Naturally, there are differences for estimating the value or action value function, which will be discussed inthe following section.

Dynamic Programming

The DP approach uses the SMDP equations in order to calculate the value or action value function for theiterations. Therefore, we also need to represent the probabilities of the durations of a transition.For Semi-MDPs, since we have to store the probabilitiesP(s′, k|s,o) = P(s′|s,a) · P(k|s,a, s′), we also haveto store the duration probabilities. This is accomplished in the same way as for single step actions (see classCAbstractFeatureStochasticModel), in fact it is done by the same class. We simply use another type of thetransition objects, which are used for every multi step action automatically. In addition to the transitionprobabilities, these objects store the relative probability of the duration of that transition. Therefore, anindividual map for the durations is maintained in the transition object, and can be retrieved by an extrafunction call of the transition object (getDuration). The probability of the transition is retrieved in the sameway as for normal transitions (in order to allow the use of algorithms, which can not deal with Semi-MDPs).We also have to extend the stochastic model learner classesCFeatureEstimatedModelandCDiscreteEsti-matedModel. For SMDP transitions, the probabilities of the transitions are updated in the same way assingle step actions. For MDP transitions, the number of visits of the state-action pair< s,a > is used to cal-culate the frequency with which the transition< s,a, s′ > hac occurred in the past. Carrying out the updatefor the relative probabilitiesP(k|st,at, st+1) of SMDP transitions is similar to updating the probability of atransition. Before incrementing the visit counterN(st,at, st+1) of the occurred transition, we multiply therelative probabilities of the durations with this visit counter. Then the counter of the occurred durationkt isincremented; finally we divide again by the new visit counter of< st,at, st+1 > to arrive at the probabilities.


Since for multi-step actions the SMDP transition objects and the SMDP update rule are automatically used,so we can use the DP and the Prioritized Sweeping classes for SMDP learning. The reward function usedfor SMDP learning obviously also have to consider the options and returnr(s,o). The Toolbox does notcontain a method to calculate this reward function or the transition probabilitiesP(s′, k|s,o) automaticallyfrom the flat model of the reward respectively transition function and the model of the option. All possibletrajectories froms to s′ in k steps would have to be calculated to get the mean reward and the transitionprobabilities. But the transition probabilities and the reward for the options can be learned with the Toolbox.

Temporal Difference Methods

As already discussed in the previous section, TD-methods update their value functions with a one stepsample backup. For SMDP learning we have the following equations for the temporal difference:

td = ros + γ

k · V(s′) − V(s) (5.4)

for the V-Function update and

td = ros + γ

k · Q(s′,o′) − Q(s,o) (5.5)

for the Q-Function update.ros is a sample of the expected option rewardr(s,o), thus it is given byro

s =∑k−1i=0 γ

i · rt+i . Thereby the differences to the original algorithm include the need to calculate the discountedreward received during the execution of the option the need to exponentiate the discount factor by theduration of the action.For calculating the reward of an option, we use an individual reward function class. Then the only differencefor the TD-learning classes is that we exponentiate the discount factor by the duration if a multi-step actionis used. The SMDP reward function class stores all rewards coming from the flat reward function during anepisode. To calculate the reward of the step< st,ot, st+d >, the reward function retrieves the durationd ofoptionot, then sums up the lastd flat reward values.

Figure 5.2: Temporal Difference Learning with options.

5.2 Hierarchical Reinforcement Learning Approaches

We have discussed SMDPs and options as the theoretical framework of all hierarchical methods. Now wetake a look at three different hierarchical structures that have been proposed in the literature. We will discuss

5.2. Hierarchical Reinforcement Learning Approaches 79

the option approach in more detail [40], then cover the Hierarchy of Abstract Machines (HAM, [36]) andthe MAX-Q Value Decomposition approach [16]. Each of these approaches is discussed only theoretically,because they are not directly part of the Toolbox. However they inspired the design of the hierarchic structureimplemented in the Toolbox. And while the Toolbox only supports the general options approach, (but witha deeper hierarchy level) options can choose from other options. The other frameworks can be describedusing this general options framework, so the Toolbox can be extended to support the discussed approacheseasily.

5.2.1 The Option framework

We have already defined the options framework which is explained in more detail in [40]. The optionsframework can be extended to Markov options. Markov options can be initiated in every state where theyare active (so it suffices to define them by their initiation setI ). TD learning methods apply only an updatefor the tuple< s,o, s′ > after the execution of one option. But for Markov options we can also update thestates which were visited during the execution of the options. And we can further update the option values ofthe other options, which are not active, if the current state is in the initiation set and the policy of that optioncould have selected the actionat in this state. This is the motivation for the intra-option learning method.The intra-option Q-Learning algorithm works as follows: for each step< st,at, st+1 >, the algorithm updatesthe option values of all options that could have executed the actionat (soµi(st,at) > 0)

Q∗(st,o)η← rt+1 + γ · U(st+1,o), for all o (5.6)

U∗(s,o) = (1− β(s)) · Q(s,o) + β(s) ·maxo′∈OsQ(s,o′) (5.7)

U(s,o) is the value of states if we have executed optiono before the arrival in states. Then the value iseither the value of the option in states if the option does not terminate (with probability 1− β(s) ), or ifa new option is chosen (with probabilityβ(s)), then the value is the maximum option value (or any otheroption value, depending on the estimation policy) of all options available in states. If all the options inOare deterministic and Markov, the algorithm will converge toQ∗.This approach can only be applied to Markov options, so we can not use it for a more sophisticated hierar-chic architecture. The primary motivation for the option framework is to allow the addition of temporallyextended behaviors to the action set, without precluding the choice of primitive actions. The resulting taskmight then be easier to learn because the goal is attainable in fewer decision steps.

5.2.2 Hierarchy of Abstract Machines

Parr [36] developed a hierarchic structure approach called Hierarchy of Abstract Machines (HAM). Thisapproach attempts to simplify a complex MDP by restricting the number of realizable policies rather thanby adding more action choices. The higher hierarchy level supervises the behaviors and intervenes whenits antecedent current state enters a boundary state. Then a new low-level behavior is selected. This is verysimilar to hybrid control systems, except that the low level behavior can be formulated as MDP and the highlevel process as SMDP.The idea of HAM is that the low level policies of the MDP work as programs, based on their individual stateand the state of the MDP. So every behaviorHi is a finite state machine. EachHi has four types of states:action, call, choice and stop. The action state determines an action to be executed by the MDP, the actionselection is based on the current state of the finite state machineHi and of the MDP. Thus the behaviorHi defines a policyπ(mi

t, st). The call state interrupts the execution ofHi and executes another finite statemachineH j until H j has finished; then the execution ofHi is continued. The choice state selects a new


internal state ofHi and a stop state obviously stops the execution ofHi . Parr defines a HAM as the initialstate of all machine together with all states reachable by this initial state.In figure 5.3 we can see the structure of a simple HAM for a navigation task in a gridworld. Each timean obstacle is encountered, a choice state is entered. The agent can choose whether to back away from theobstacle or to try to avoid the obstacle by following the wall.

Figure 5.3: State transition structure of a simple HAM, taken from Parr [36]

The decompositionHM of an HAM H and a MDPM defines a SMDP. The actions are the choices allowedin the choice states; these actions can only change the internal state of a HAM. This framework defines anSMDP because after the choice state, the HAM runs independently until the next choice state occurs. Sofor action selection, only those states of the MDP and the internal state of the HAM which are possiblein the choice states must be considered, so we reduce the complexity of the problem considerably. HAMdrastically reduces the possible set of policies depending on the designers’ prior knowledge of efficient waysto control the MDPM.We can use the standard SMDP learning rules for the choice states of the HAM. The state of the SMDP isdefined as [sc,mc] (wheresc is the state of the MDP andmc is the state of the HAM). So for two sequencedchoice states we get the following update for Q-Learning.

Q([sc,mc],ac)η←− rt + γ · rt+1 + + γ

τ−1 · rt+τ−1 + γτ ·maxa′cQ([s′c,m

′c],a

′c) (5.8)

Parr and Russell [2] illustrate the advantages of HAMs for simulated robot navigation tasks, but no largerscale problem is known to have been solved with HAM.

5.2.3 MAX-Q Learning

MAX-Q learning was proposed by Dietterich [16]. MAX-Q tries to hierarchically decompose the valuefunction. MAX-Q builds a hierarchy of SMDPs, where each SMDP is learned simultaneously. MAX-Qbegins with a decomposition of the MDPM into n subtasks< M0,M1, ,Mn >. In addition, a hierarchic treestructure is defined, thus we have a root subtask (M0, which means that solvingM0 solvesM). Each subtaskcan have the policy of another subtask in its action set (so the actions are branches of the tree).


The hierarchic structure is visualized in a task graph, as shown in5.5(b) for the Taxi task, which is used asa benchmark by Dieterrich. The problem is to pick up a passenger at a specific location and get him to aparticular destination in a grid-world. In the initial setting there are four different destinations and locations(R, G, Y ,B) as illustrated in5.4(a). This is a shortest-time problem, so the agent gets a negative reward for

Figure 5.4: Illustration of the taxi task, taken from Dietterich [16]

each step. We have six primitive actions: one for each direction, one for picking up the passenger and onefor dropping him off again. This problem can be divided into two primary subtasks:gettingthe passengerfrom his initial location andputtinghim at his final destination. As we can see, these subtasks in turn consistof subtasks. Theget subtask can either navigate to one of the possible locations or pick up the passengerat the current location. Theput subtask can drop the passenger off or use the same navigation subtasks. Anavigation subtask is parameterized with the location; an individual subtask exists for each parameter value.Clearly, the subtasks can be used by several other subtasks. Because the choice of actions (subtasks) isalways made by the policy, the order of a subtask’s child nodes is unimportant.

Figure 5.5: MAX-Q task decomposition for the Taxi task, taken from Dietterich [16]

A subtaskMi consists of a policyπi which can select other subtasks (including primitive actions), a set ofactive states where the subtask can be executed, a set of termination states and a pseudo reward function. Thepseudo reward function is only used to learn a specific subtask and does not affect the hierarchic solution.Note that a subtask has the same definition as an option with an additional pseudo reward function.


Each subtask defines its own SMDP by its state setSi and the action setAi , which consists of the subtask’schildren. The transition probabilitiesP(s′, k|s,a) are also well defined by the policies of the subtasks.The main feature of MAX-Q is that we can now decompose the value function of the MDP with the helpof the hierarchic structure. In each subtaskMi we can learn an individual value functionVπ(i, s), definedthrough our SMDP learning rules. The reward gained during the execution of subtask is estimated by itsvalue function, so we can express the reward values of an option with the learned value function.

Vπ(i, s) = V(π(s), s) +∑s′,k

P(s′, k|s, π(s))γkVπ(i, s′) (5.9)

The definition of the Q-Function can also be extended to the subtask approach by

Qπ(i, s,a) = V(a, s) +∑s′,k

P(s′, k|s,a)γkQπ(i, s′, π(s′)) (5.10)

The second term is called the completion functionC, which represents the expected reward the agent willget after he has executed subtaska. ThusC is defined as

Cπ(i, s,a) =∑s′,k

P(s′, k|s,a)γkQπ(i, s′, π(s′)) (5.11)

So we can write for QQπ(i, s,a) = V(a, s) +Cπ(i, s,a) (5.12)

This decomposition can also be done for subtaska, yielding for all active subtasks from the root taskM0 tothe primitive actionMk the following form of the value function for subtaskM0 (so for the MDP):

Vπ(0, s) = Vπ(ak, s) +C(ak−1, s,ak) +C(ak−2, s,ak−1) + ... +C(0, s,a1) (5.13)

The value of a primitive action’s subtask (remember that primitive actions are themselves considered assubtasks) is defined as the expectancy of the reward in states executing primitive actionak.

Vπ(ak, s) =∑

s′P(s′|s,ak) · r(s,ak, s

′) (5.14)

The decomposition is the basis of the learning algorithm, the C-Function can be learned with temporaldifference methods. If a pseudo reward function is used to guide the subtask to specified sub goals, wehave to learn two C-Functions. We need an intern C-Function for the policy of the subtask and an externC-Function for the value decomposition.In the described algorithm, the policies of each subtask converge to an optimal policy individually. There-fore, the hierarchical policy consisting of< π0, π1, , πn > can only converge to the recursive hierarchicaloptimal policy. Recursive optimal policies do not take the context of the subtask into consideration (whichsubtasks are active in the higher hierarchy levels). For example, the optimal solution for navigating to adestination could also depend on what we intend to do after arrival. To learn the hierarchical optimal policy,we would have to include the context of the subtask in the state space. But by doing this, we would lose theability to reuse subtasks as descendants for several other subtasks.The MAX-Q algorithm has shown good results for the taxi problem, as Dietterich illustrates using differentsettings. Yet the problem seems to be too simple to estimate the benefits of MAX-Q well. The MAX-Qapproach was also used successfully in the multi-agent control of an automated guided vehicles schedulingtask [27]. It outperformed the commonly used heuristics and thereby became one of very few successful


multi agent learning examples. The taxi domain is also part of the Toolbox; even if MAX-Q and HAM’s arenot implemented, this example task can be used to experiment with the hierarchic structure of the Toolbox.The preceding paragraphs were only a brief overview of the hierarchic algorithms to demonstrate howconcepts from programming languages can fit into the RL framework. These approaches also influencedthe design of the hierarchic structure in the Toolbox.

5.2.4 Hierarchical RL used in optimal control tasks

Only a few researchers have tried hierarchical RL for optimal control tasks. Most of the algorithms andframeworks were only tested on discrete domains like the taxi domain from Dietterich [16], or the sim-ulated robot navigation task in a grid world used for the HAM algorithms. To our knowledge, none ofthe introduced architectures have been used for continuous optimal control tasks, but there are a few otherapproaches which use other, usually simplified hierarchic architectures.

Using Predefined Subgoals

Morimoto and Doya [30] used a subgoal approach for the robot stand up problem (see1.3). The basic ideais to divide a highly non-linear task into several local, high-dimensional tasks and a global task with lowerdimensionality. The subgoals are predefined by target areas in the (continuous) state space. Each sub-goalhas its own reward function which is dependent of the distance to the target area of the subgoal. Eachgoal is learned independently with an independent value function (in the case of Morimoto, an Actor-Criticalgorithm was used). The individual reward function of the subgoal simplifies the global problem drastically.If a subgoal reaches its target area, the subgoal has finished and a new subgoal is selected by the upper levelcontroller. At the upper level, the sequence of the subgoals can be fixed or, alternatively, be learned by aQ-Learning algorithm in a reduced state space.

Figure 5.6: Subgoals defined for the robot stand-up task, taken from Moritimoto [29]

Morimoto specified three different subgoals, as illustrated in figure5.6, each of which defines a differentposture of the robot (so the velocities of the joint angles do not matter in reaching a subgoal). With the helpof these subgoals, the learning time and the needed number of RBF centers drops drastically.

RL with the via-point representation

The goal of this approach, also used by Morimoto [31] is to learn ‘via-points’ with an Actor-Critic archi-tecture. A ‘via-point’ defines a certain point in the state space the trajectory should reach at a certain time.


The actor tries to learn good via points for the trajectory and at which timetn these via points should bereached. This information is then used by a local trajectory planning controller to create the control vector.As a result, the algorithm can choose its own time scale for executing an action. When the algorithm wasused for the cart-pole swing up problem it outperformed the flat architecture.

Hierarchic Task Composition

The only approach that employs a complex hierarchical framework is a method used by Vollbrecht [53] forthe TBU (Truck-Backer-Upper) example. There are three different categories of subtasks: avoidance tasks,goal seeking tasks and state maintaining tasks. These groups of tasks may interact with the veto principle, thesubtask principle and the perturbation principle. The tasks are learned individually in a bottom up manner,so that all subtasks which are needed by a higher subtaskTH are learned in individual learning trials beforethe subtaskTH can be used.In the veto principle, an avoidance taskT1 may veto an action selected by taskT2. The first task learns inisolation which actions lead, for example, to a collusion. When the second task is learned, only those actionsmay be taken by the agent which are not predicted to lead to a collusion by the first subtask.In the subtask principle, a taskT can choose from several subtasksTi . If Ti has been chosen byT, one actionis executed byTi , then a new subtask is selected byT no matter whetherTi has finished. It is assumed thatthe taskT learns the composition of all goals of its subtasksTi , which is only possible if certain conditionsare met for the goals of the subtasksTi .For the perturbation principle, we have two hierarchically related subtasksTH and TL, whereTH has ahierarchical higher level. IfTH perturbs the goal state of the lower level taskTL by executing actiona, thelower taskTL is activated until it has reached its goal state once more. Then the control returns toTH again.The advantage of this approach is that the high level subtask’s state space can be reduced to the goal area ofthe low level subtask, since if this area is left,TL immediately interrupts the execution ofTH. It is also clearthat this only works for certain kind of subtasks, for example, it is possible that the subtaskTL always doesthe inverse action of the high level taskTH to restore the goal state again, which would not be the desiredeffect.The approach worked well for the TBU task, this is probably due to the specific hierarchic nature of the task.Intuitively it is hard to scale this approach to a general framework, no other usage of this approach has beenseen. This approach is totally different to the option approach, it shows the possibilities of the interaction ofthe different hierarchic levels very good.

5.3 The Implementation of the Hierarchical Structure in the Toolbox

The hierarchical structure in the Toolbox is a mixture of the option frame work and the MAX-Q framework,but the HAM approach can also be implemented easily. As has already been mentioned, the standard SMDPlearning rules are implemented, but there are no algorithms for intra-option learning or the MAX-Q valuedecomposition.

5.3.1 Extended actions

In the Toolbox, we use extended actions (CExtendedAction) as options. An extended action is a multi stepaction, so it is a temporally extended action; it stores the duration and finished flag and has its individualtermination condition. Additionally, the extended action defines an own policy interface, so it has to returnanother action for a given state with the functiongetNextHierarchyLevel. Once again, this action can be

5.3. The Implementation of the Hierarchical Structure in the Toolbox 85

either again an extended action or a primitive action. Consequently there can be more than one extendedaction active at a time. The design is modeled similarly to the hierarchy levels of the task graph explained inMAXQ learning. Extended actions cannot be directly executed by the agent, as the agent can only executeprimitive actions. We need therefore a hierarchic controller which manages the appropriate execution of theactive extended actions.

5.3.2 The Hierarchical Controller

The hierarchical controller (CHierarchicalController) has three functionalities: It manages the execution ofthe active extended actions, builds a hierarchic stack, and send this hierarchic stack to the specified hierarchicstack listeners. The hierarchic stack is a list of all actions that are currently active, so it begins with the rootactiona0 and ends with a primitive actionak. The hierarchic controller also has to be used as the controllerfor the agent, because it returns the primitive action which was returned by the last extended action.The hierarchic controller contains a reference to the root action from the hierarchy. From this root action,it builds the hierarchic stack by recursively calling thegetNextHierarchyfunction of the extended actionsuntil a primitive action is reached. All these actions are stored in the hierarchic stack, beginning with theroot action and ending with the primitive action. This primitive action is then returned to the agent by thestandard controller interface. The agent executes this action and sends the step information< st,at, st+1 >

to the listeners as usual. Note that through this approach, the agent’s listeners are always informed about theflat policy, so they do not get any information about the extended actions which have been used.

Figure 5.7: The hierarchic controller architecture of the Toolbox as the main part of the hierarchic frame-work.

The hierarchic controller also implements the agent listener interface, so it gets the step information too.With the step information it can update the hierarchic stack. The duration of the primitive action is addedto each extended action in the action stack (note that even primitive actions can have different durations,which are generally fixed or set by the environmental model). It also calls theisFinishedmethod of eachextended action and updatesisFinishedflag of the action with this information. If an extended action endsin the current state, the finished flags of all other extended actions following on the stack (i.e. actions witha lower hierarchy) are finished too. After updating the hierarchic stackh, the hierarchic step information< st,h, st+1 > is sent to all hierarchic stack listeners. Hierarchic stack listeners define almost the same


interface as common listeners, but they get the hierarchic stack instead of an action object as the argumentfor thenextStepinterface.After sending the hierarchical step information, the primitive action and all terminated extended actions aredeleted from the stack. At the next call of thegetNextActioninterface, the stack is filled with the actionobjects from thegetNextHierachyLevelcalls again . We also provide a parameter for setting the maximumnumber of hierarchic execution steps. Each extended action (except the root action), is terminated if it hasbeen executed longer than this maximum number of steps. This parameter for example can be used to slowlyconverge to a flat policy again when the hierarchy levels have already been learned (this was also tried byDietterich for his MAX-Q approach)

5.3.3 Hierarchic SMDP’s

Now we know how the hierarchic controller builds the hierarchic stack and therefore how the hierarchicpolicy is executed, but we also want to learn in the different hierarchy levels. For learning in differenthierarchy levels we have to send the step information< st,ot, st+k >; we also want to offer almost the samefunctionalities as for the agent, like using more than one listener or setting an arbitrary controller as policyfor the SMDP.A hierarchical SMDP (CHierarchicalSemiMarkovDecisionProcess) has a set of optionsO, an individualpolicy π, which can choose from these options, and a termination condition . Also it obviously must bea member of the hierarchy structure, so it is a subclass of the extended action class. Additionally, thehierarchic SMDP class is a subclass of the standard semi MDP class (CSemiMarkovDecisionProcess). Thisclass is the super class of the agent and supports the agent’s primary functionality for managing the listenerslist, like sending the step information to all specified listeners. The semi MDP class already maintains acontroller object and is itself a deterministic controller, storing the current action to be executed. Hence wedo not have to calculate the action in the current state twice. The specified controller of a hierarchic SMDPis used for thegetNextHierarchyLevelfunction of the extended action interface, meaning that it implementsthe policy of the extended action.In addition the hierarchical SMPD implements the hierarchical step listener interface, so it retrieves thehierarchic stack from the hierarchic controller at each step. Once the hierarchic SMDP has been executed,it becomes a member of the stack. If that is the case, it searches for the next action on the stack (i.e. theaction which the policy of the SMDP has selected). If this action has been completed in the current step, thestep information< st,ot, st+k > can be sent to all listeners. We get the hierarchical step information fromthe hierarchic controller, but this information contains only the states< st+k−1,h, st+k >. In order to find thecorrect initial statest of the executed option, the hierarchic SMDP class contains an extra state collectionobject which always stores the statest at the beginning of an option.For this approach, all hierarchical SMDPs have to be added to the hierarchical step listener list from thehierarchic controller. See also figure5.8for an illustration of the class system.

5.3.4 Intermediate Steps

For Markov options we can also use the intermediate states (as done in intra-option learning) for the valueupdates. A Markov option could have been started in any of the intermediate states that occurred duringthe execution of the option. Hence, we can create additional step information< s′,ot, st+τ > for everys′

in st+1, , st+τ−1 and send it to the listeners. To retrieve this intermediate step information, the hierarchicSMDP needs access to an episode object, which has to be specified in the constructor if the intermediatesteps are used.


Figure 5.8: The hierarchic Semi-MDP is used for learning in different hierarchy levels.

This step information must be treated differently because these steps do not correspond to a ascendingtemporal sequence. The steps sent by the standard agent listener interface are supposed to be temporallysequenced if no new episode has started. For this reason, we have to use an individual interface functioncalledintermediateStep, which is added to the agent listener interface, but does not have to be implemented.The intermediate steps are sent after the standard step information, for every intermediate step, the durationof the optionot is also set correctly. For extended actions, we can choose whether we want to send theintermediate step information to the listeners or not. We implement an individual intermediate step treatmentfor the TD learning algorithms, which all other algorithms ignore.

Figure 5.9: Intermediate Steps of an option can also be used for learning updates when using Markovoptions.


TD-learning with intermediate steps

TD Learning with intermediate steps has the same effect as intra-option learning. The problem with interme-diate steps is their missing temporal ascending sequence, so we can not add them to the e-traces in the usualway. But we can carry out the standard TD-learning update without e-traces (because these updates workfor any step information, they do not require any temporal sequence) and we can also add the intermediatestates to the e-trace list, because these states are also predecessor states ofst+τ. But this has to be done afterthe normal TD(λ) update of the standard step information< st,ot, st+τ >. The intermediate step update is

Algorithm 5 TD learning with intermediate stepsfor eachnew step< st,ot, st+τ > do

etraces→ updateETraces()etraces→ addETrace(st,ot)td =

∑τ−1i=0 γ

irt+i + γτQ(st+τ,o′) − Q(st,ot)

Q(s,o)η← td · e(s,o)

for k = 1 toτ − 1 dotd =

∑τ−1i=k γ

i−krt+i + γτ−kQ(st+τ,o′) − Q(st+k,ot)

Q(st+k,ot)η← td

etraces→ addETrace(st+k,ot)end for

end for

shown in algorithm5. This is done for all TD Q-Function and V-Function learning algorithm.

5.3.5 Implementation of the Hierarchic architectures

Now we want to take a closer look at the option and the MAX-Q hierarchic structure framework and howthese can be implemented in the Toolbox.

The option framework

The option framework consists of one hierarchy level, namely an SMDP which can choose from differentoptions. Nevertheless, we need the hierarchic controller in order to use options. The hierarchic controllerexecutes the options and returns the primitive actions to the agent. The options have to be implementedby the user, and all options must be subclasses of the extended action class. For learning, we have to add ahierarchic SMDP as hierarchic stack listener to the controller. This hierarchic SMDP is also the root elementof the hierarchic structure. At this hierarchic SMDP we can add our learning algorithms and specify agentcontrollers in the usual way. If we add listeners to the hierarchic SMDP they will be informed about theoption steps and about the intermediate steps if required. If we add listeners to the agent, the listeners willbe informed about the flat step information. So we can learn the option values and the values of primitiveactions simultaneously. For example, we can add a TD learner to the hierarchic SMDP and a TD learner tothe agent, both TD-Learners can use the same Q-Function, which contains action values for the options andthe primitive actions. The realization of this approach is also illustrated in figure5.10

The MAX-Q framework

In the MAX-Q framework we have a hierarchic task graph, for each subtask a policy can be learned. Weonly discuss the creation of the hierarchical structure of MAX-Q learning in the Toolbox; the MAX-Q value


Figure 5.10: Realization of the option framework with the hierarchic architecture of the Toolbox.

decomposition algorithm is not implemented. A subtask has almost the same structure as options do. Again,we represent the subtask as hierarchic SMDP with an individual controller as policy and an arbitrary numberof listeners for learning. The difference is that the descendants of the subtasks can in turn be subtasks. Thedifference to the standard option approach is that the actions selected from a a subtask’s policy can in turn bea hierarchic SMDP. So the policies can choose from either hierarchic SMDPs or primitive actions (but alsoan intermixing with self coded options is possible). The only functionality that is missing is the terminationcondition of the subtasks, this has to be implemented by the user himself. The standard hierarchic SMDPtermination condition is always false. Thus an individual class is needed for each subtask to implement thetermination condition, this class can simultaneously be used for the individual reward functions.

Figure 5.11: Realization of the MAX-Q framework with the hierarchic architecture of the Toolbox.

Chapter 6

Function Approximators for ReinforcementLearning

In this section, we discuss several function approximation schemes that are successfully used with RL. Inthe context of RL function approximation is needed to approximate the value function or the policy for con-tinuous or large discrete state spaces. We begin the chapter by briefly covering gradient descent algorithmsfor function approximation. Then we take a look at the available function representations for the Toolbox,including tables, linear feature representations, adaptive gaussian soft-max basis function networks, feedforward neural networks and Gaussian sigmoidal networks. Then we will explain how these FAs can beused to approximate either the V-Function, Q-Function or the policy directly. All the function approxima-tion schemes are implemented independently of the RL algorithms, so we can use almost any approximatorfor a given algorithm.

6.1 Gradient Descent

Gradient descent is a general mathematical optimization approach. Given a smooth scalar functionf (w),we want to find the weight vectorw∗ which minimizes the functionf . One approach for this is gradientdescent. In general, the vectorw is initialized randomly; then the weights are updated according to thenegative gradient.

∆wt = −ηt ·d f(w)

dw(6.1)

This algorithm is proved to converge to alocal minimumif the learning rateη satisfies the following prop-erties:

∞∑t=0

ηt = ∞ (6.2)

∞∑t=0

η2t < ∞ (6.3)

For regression and other supervised learning problems we usually want to approximate a functiong wherejust n given input-output values are known. So, for every input vectorxi we know the output valueg(xi).Our goal is to find a parameterized function ˆg(x,w) which approximatesg as well as possible (at least forthe given input pointsxi).

90

6.1. Gradient Descent 91

In this case the function which we want to minimize is the quadratic error function

f (w) = E(w) =12

n∑i=1

(g(xi ; w) − g(xi))2 . (6.4)

There are two different gradient descent update approaches:

• Simple Gradient Descent (Batch Learning): This approach calculates the gradient of all input pointsfor one weight update, as a result we have the following update rule:

∆w = −ηt ·

n∑i=0

(g(xi ; w) − g(xi)) ·dg(xi ; w)

dw(6.5)

Conditions of convergence are well understood for batch learning, another advantage is that severalacceleration techniques like the Conjugate Gradient, the Levenberg-Marquant or the QuickProp algo-rithm can only be applied to batch learning updates.

• Incremental Gradient Descent: In this case the weight update is already done after the first gradientcalculation of an input point.

∆w = −ηt · (g(xi ; w) − g(xi)) ·dg(xi ; w)

dw, for a random inputxi (6.6)

Simple gradient descent corresponds to epoch-wise learning, incremental gradient descent to incre-mental learning. Incremental learning is usually faster than batch learning. This is a consequencefrom the fact that there exists usually several similar input output patterns. In this case batch learn-ing wastes time computing and adding several similar gradients before performing the weight update.Because of the randomness of incremental learning, we are also more likely to avoid local minima,for that reason incremental learning often leads to better results.

For all these update schemes the convergence results has been extensively studied [12].General RL algorithms like TD(λ) all use a incremental learning update, since the updates are done imme-diately after one step.

6.1.1 Efficient Gradient Descent

One major aspect of gradient descent is how to choose the learning rateη. If η is too small, learning will bevery slow, on the other hand, ifη is too large, learning can diverge. In general optimalη values differ fordifferent weights, so applying different learning rates can be useful. If we assume a quadratic shape of theerror function, the optimal learning rate depends on the second order derivative (curvature) of the weights.Since we have different curvatures for different weights, most algorithms try to transform the weight spacein order to have uniform curvatures in all directions. Most of the algorithm that deal with that problem arebatch algorithms, so they can not be used for reinforcement learning tasks easily.But there exists a method called the Varioη algorithm used by [15], which can be used for incrementallearning.

92 Chapter 6. Function Approximators for Reinforcement Learning

The Vario η Algorithm

The Varioη algorithm measures the variance of each weight during during learning. This variance is thenused to scale the individual learning rates.

vk+1(i) = (1− β) · vk(i) + β ·∆wk(i)ηk(i)

2

(6.7)

ηk+1(i) =η

√vk+1(i) + ε

(6.8)

vk(i) measures the variance of weight updateswi (it is assumed that the variance of the is high in comparisonto the mean, which has been empirically verified),β is the variance decay factor (usually small values like0.01 are used).ε is a small constant which prevents a division by zero. The initial learning rateη getsdivided through the standard deviation of the weight, as a result the updates for all weights have the samevarianceη2.The Vario η algorithm is not used in online learning, instead it is used for observing the variance andcalculating the learning rates for a given function approximation framework in advance. In this thesis weuse the same results as in Coulom [15] for FF-NNs.

6.2 The Gradient Calculation Model in the Toolbox

For the Toolbox it is desirable to have a general interface for calculating the gradient of a specific functionrepresentation. Function representations which are allow calculating the gradient are obviously alwaysparameterized functions. In our case we assume that the function has a fixed set of weights. We needinterface functions for updating the weights, given the gradient and also for calculating the gradient givenan input vector. But at first we have to take a look at how we represent a gradient vector.

6.2.1 Representing the Gradient

The properties of gradients can be very different, some are sparse (like for RBF networks), for other functionrepresentations the gradient can be non-zero everywhere. We already defined an appropriate data structureto handle these demands efficiently in the section4.3.3when discussing e-traces (see classCFeatureList).For our gradient representation we use the same, but unsorted feature list. In this case the feature indexrepresents the weight index, and the feature factor the value of the gradient vector. For all weight indiceswhich are not in the list, we again assume that the derivation with respect to that weight is zero.

6.2.2 Updating the Weights

We introduce an own interface for updating the weights, given the gradient vector as a feature list. Thisinterface is calledCGradientUpdateFunction. Gradient update functions additionally provide the function-ality to get and set the weight vector directly by specifying the weight vector as a double array. A methodto retrieve the number of weights is also provided.The weight update is done by two different functions. The actual gradient update is done by the interfacefunctionupdateWeights(CFeatureList*gradient), which has to be implemented by the subclasses. As dis-cussed in the section6.1.1about efficient gradient descent it can be advantageous to apply different learningrates for different weights. This is done by extra objects called adaptiveη calculators. Whenever the weightsgradient function has to be updated, theupdateGradientfunction is called. At first the gradient vector

6.2. The Gradient Calculation Model in the Toolbox 93

is passed to theη calculator (if one has be defined), which applies the individual learning rates for eachweight. Then theupdateWeightsfunction is called.The gradient update function interface does not specify any input output types of the function, it just providesthe interface for updating a arbitrary parameterized function approximator with the gradient vector.

Applying different learning rates

In order to apply different learning rates for different weights we introduce the abstract classCAdaptiveEta-Calculator. We can assign an adaptiveη calculator for each gradient update function. If anη calculator hasbeen specified for an update function, theη calculator’s interface functiongetWeightUpdatesis called beforeperforming the actual weight update. This function can now be used to apply different learning rates to thegradient vector. There are two different general implementations ofη calculators already implemented inthe Toolbox:

• Individual η-Calculator: This class maintains an own array for the learning rates (initialized with1.0), we can set the learning rate for each weight individually.

• Vario η-Calculator: This is the algorithm discussed in the section6.1.1 to calculate an optimallearning rate for the weights, considering the variances of the weight updates. In praxis this algorithmsis not used online for performance reasons, but the results are used to calculate static learning ratesfor specific function approximators.

The structure of a gradient update function and the interaction with the adaptiveη calculator class is alsoillustrated in figure6.1

Figure 6.1: Interface for updating the weights of a parameterized FA

Delayed Weight Updates

In a few cases it is worthwhile to postpone the weight updates to a later time (e.g. it can be advantageousto apply the updates only when a new episode has started). Therefore we implemented the classCGra-dientDelayedUpdateFunction, which encapsulates another gradient update function. The updates to theencapsulated gradient function are stored as long as long as the methodupdateOriginalGradientFunctioniscalled, then the stored updates are transferred to the original gradient function.


With the classCDelayedFunctionUpdaterwe can choose when we want to update the specified functionapproximator. We can choose how many episodes and/or steps have to elapse until the next update isperformed.

6.2.3 Calculating the Gradient

For calculating the gradient we provide the interface classCGradientFunction, which is a subclass ofCGra-dientUpdateFunction. In difference to gradient update function we already define interfaces for calculatinga m-dimensional output vector given an-dimensional input and calculating the gradient at given an inputvector. Thus the input and output behavior are all ready fixed for this class, for the input and the outputvectorsCMyVectorobjects are used. The gradient calculation interface additionally gets the error vectoreas input, the returned gradient is calculated in the following way

grad =m∑

i=1

d fi(x)dw

· ei (6.9)

wherei denotes theith output dimension. Hence we can specify which output dimension we want to use forthe gradient calculation by specifying an appropriate error vectore

Figure 6.2: Interface for parameterized FAs which provide the gradient calculation with respect to theweights.

6.2.4 Calculating the Gradient of V-Functions and Q-Functions

We still need to implement value and action value functions which support our gradient interfaces. Themain difference to our general gradient calculation design is that we now need state and action data objects

6.2. The Gradient Calculation Model in the Toolbox 95

as input. Therefore we create two additional classesCGradientVFunctionandCGradientQFunctionwhichare both subclasses ofCGradientUpdateFunction. Both classes have an additional interface function forcalculating the gradient given either the current state or the state and the action object as input (in differenceto CGradientFunction, where the input is aCMyVectorobject).Gradient V and Q-Functions are updated through the gradient update function interface. As a result wecan already implement theupdateValuemethods, which can use the gradient calculation and weight updatefunctions.Now we can implement gradient V-Functions and gradient Q-Functions independently, but we do not wantto implement the same type of function approximator twice. For this reason we already introduced theCGradientFunctioninterface. With the help of this interface we design classes which encapsulate a gra-dient function object and implement the V-Function respectively Q-Function functionalities. Therefore wecreate an own class for V-Functions (classCVFunctionFromGradientFunction) and for Q-Functions (classCQFunctionFromGradientFunction), which implement the gradient V-Function interface respectively thegradient Q-Function interface and use a given gradient function object for each calculation. It is assumedthat the given gradient function has the correct number of output and input space dimensions, otherwise anerror message is thrown. The number of outputs is obviously always one, the number of inputs depends onthe number of discrete and continuous state variables of the input state. With this approach we can create aV or a Q-Function just by specifying a gradient function object.

Figure 6.3: Value Function class which uses an gradient function as function representation. The functioncalls are just passed to the gradient function object, for thegetValueandgetGradientfunctions of the V-Function object we first have to convert the state objects to to vector objects in order to be able to use thegradient functions interface methods.

6.2.5 Calculating the gradient of stochastic Policies

Many policy search algorithms need to calculate the gradient of the likelihooddπ(s,a)dθ or the log-likelihood

d log(π(s,a))dθ = 1

π(s,a)dπ(s,a)

dθ of a stochastic policyπ, whereθ is the parametrization of the policy. In the case


where the policy depends on action values (CQStochasticPolicy) of a Q-Function (or also reconstructedvalues from a V-Function) the policy parametrizationθ is equal to the weightsw of a (action) value function.The gradient can in this case be expressed by

dπ(s,ai)dθ

=dπ(s,ai)

dw=

dπ(s,ai)dQ(s,a)

dQ(s,a)dw

=∑

a j∈|As|

dπ(s,ai)dQ(s,a j)

dQ(s,a j)

dw(6.10)

dQ(s,a)dw is the derivation of the Q-Function for all action values, hence it is am× p matrix

dQ(s,a1)

dwdQ(s,a2)

dw. . .

, mbeing

the number of actions,p the number of weights.dπ(s,ai )dQ(s,a) is a m-dimensional row vector, representing the

derivatives of the action distribution with respect to the Q-Values, thus this distribution has to be differen-tiable. The only action distribution we discussed which fulfills this requirement is the soft-max distribution

π(s,ai) =exp(β · Q(s,ai))∑|As|

j=0 exp(β · Q(s,a j))

The gradient of the soft-max distribution with respect toQ(s,a j) is given by

dπ(s,ai)dQ(s,a j)

=

βπ(s,ai)(1− π(s,a j)

)if ai = a j

−βπ(s,ai)π(s,a j) else(6.11)

We extend our design of the action distribution objects by an additional interface, which calculates thegradient vectordπ(s,ai )

dQ(s,a) = [ dπ(s,ai )dQ(s,a1) ,

dπ(s,ai )dQ(s,a2) , . . . ]. This interface function is not obligatory, since it can only be

implemented by the soft-max distribution class. It gets the action values [Q(s,a1),Q(s,a2), . . . ] as input andreturns the gradient vector. Since not all action distribution support the gradient calculation, an additionalboolean function indicates whether the gradient calculation is supported.The stochastic policy class gets also extended by three functions, one for calculatingdπ(s,a)

dw , one ford log(π(s,a))dw

and one for calculating the gradientdQ(s,a)dw , which is used by the two former functions in combination with

the action distribution’s gradient function to calculate the demanded gradients of the likelihood resp. loglikelihood. The gradientdπ(s,a)

dw is simply calculated by the weighted sum of the single gradientsdQ(s,a)dw

(represented as feature lists) given by equation6.10

6.2.6 The supervised learning framework

In RL we may also need supervised learning algorithm for example for learning the state dynamics of adynamic system. In this section we will briefly present the supervised learning interfaces of the Toolbox, anexample how we can learn the dynamics of a model is given in the section7.1.6.The classCSupervisedLearneris the super class of all supervised learning algorithms. Supervised learningis only supported for regression problems like continuous state prediction, so we have a n-dimensionalcontinuous input and m-dimensional continuous output state space. The inputs and the outputs are allrepresented as vector objects. The supervised learning base class consists of two methods, one for testingan input vector and return the output of the function approximator, and one for using a given input-outputvector pair for learning. Both functions are only interfaces, and have to be implemented by the subclasses.The Toolbox supports just one supervised learning algorithm implementation, using incremental gradientdescent as discussed in the theoretic part of this section. The classCSupervisedGradientFunctionLearnergets a gradient function as input, and serves as connection between the supervised learning interface and

6.3. Function Approximation Schemes 97

the gradient function interface. When learning a new example, it calculates the error between the outputof the gradient function and the given output behavior, then the gradient and the error are used to updatethe weights of the gradient function according to the stochastic gradient equations. We can additionallyspecify a momentum factorα for the supervised gradient learner, which calculates a weighted average overthe gradient updates:

∆wk+1 = α∆wk + η∇E(x,g(x)) (6.12)

This is a common approach to boost the performance of supervised gradient descent algorithms.Through this approach we can also use any implemented gradient based function approximation schemealso for supervised learning.

6.3 Function Approximation Schemes

In this section we will discuss the function approximation schemes which are implemented in the Toolbox.All these function approximators are updated via gradient descent. The following section will cover theimplementation of the discussed function representations.

6.3.1 Tables

Tables are the simplest function approximators. As we have already discussed in the section3.1 a singlestate index is used to represent the state, the value of the state is then determined through tabular look up.

g(si ; w) = wi (6.13)

Continuous problems have to be discretized in order to get a single state index, which is a very crucial task,so tables are only recommendable for discrete problems. But nevertheless we can treat tables like normalfunction approximators, and therefore we can also calculate the gradient with respect to the weights:

dg(si ; w)dw

= ei (6.14)

whereei is theith unit vector.

6.3.2 Linear Approximators

We already discussed linear feature state representations in the section3.2. For linear approximators wehave, similar to tables, one entry for each state, the difference is that the function value is interpolatedbetween several table entries.

g(s; w) =n∑

i=0

φi(s) · wi = Φ(s) ·W (6.15)

Heren is the total number of features,φi calculates the activation factor of featurei. The gradient of linearapproximators is calculated very easily:

dg(s; w)dw

= Φ(s) (6.16)

There are different ways how to create the linear feature state representation, the two most common are:

• Tile Coding

• Normalized Gaussian networks


For a detailed discussion of these two approaches see chapter3.2.Linear feature approximators have, depending of the choice of the feature space very good learning prop-erties. An important advantage is that we can choose our features to just have local influence of the globalfunction (RBF-Features, Tile Coding), so updating states does only change the function value in a localneighborhood ofs. Another important advantage that we can see now, is that the gradient does not de-pend on the current weight values, so it is purely specified by the state vector. The disadvantage of theseapproaches with just local features is that they suffer from the curse of dimensionality, i.e. the number offeatures increases exponentially with the number of states.

6.3.3 Gaussian Softmax Basis Function Networks (GSBFN)

GSBFNsss are a special case of RBF networks, where the sum of all feature factors is normalized to 1.0. As aresult, the difference to the standard RBF network approach is that GSBFNs have an additional extrapolationproperty for areas where no RBF center is nearby located. Doya and Morimoto successfully used adaptiveGSBFNs to teach a planar, two linked robot to stand up [30].For a givenn dimensional input vectorx the activation function of the centeri is calculated by the standardRBF-formular

ai(x) = exp(−12

((x − ci)/si)2) (6.17)

whereck is the location of the center and the vectorsk determines the shape of the activation function. Forsimplicity we choose to specify the shape of the function just by a vector instead of a matrix. So we canspecify the shape (size of the bell-shaped curve) for each dimension separately, but we can’t specify anycorrelated expanse.The soft-max basis activation function is then given by

φi(x) =ai∑n

j=1 a j(x)(6.18)

The function value is then calculated straightforward like in the linear feature case.

g(x; w,C,S) =n∑

i=0

φi(x)wi = Φ(x) ·W (6.19)

The gradient with respect tow is then given by

dg(x; w,C,S)dw

= Φ(x) (6.20)

which is the same as for linear feature approximators. Consequently the non adaptive case of a GSBFN canbe treated as a usual linear function approximator.

Adaptive GSBFN

The framework of GSBFNs can be extended to have adaptable activation functions, we adapt the locationand the shape of the centers. There are three additional update schemes for adaptive GSBFN as proposed in[30].

• Add a center: A new center is allocated if the approximation error is larger than a specified criterionemax and the activation factorai of all existing centers is smaller than a given tresholdamin

|g(x; w) − g(x)| > emax and maxk

ak(x) < amin


The new center is added at positionx with given initial shape parameterss0 and the weightwi isinitialized with the current function value.

If a new basis function is allocated, the shape of neighbored basis functions also change due to thenormalization step.

• Update the center positionsWe can also calculate the gradient with respect to the position of a centerci to adjust the location of the centers.

dg(x; w,C,S)dci

=dφ(x)dai

dai

dci= (1− φi(x)) · φi(x) ·

(x − ci)

s2i

wi (6.21)

• Update the center shapesThe shape of the centers is updated by:

dg(x; w,C,S)dsi

=dφ(x)dai

dai

dsi= (1− φi(x)) · φi(x) ·

(x − ci)2

s3i

wi (6.22)

Actually both gradients calculations are just approximations of the real gradient because they neglectthe fact thatφi(x) is a function ofc j andsj even if i , j.

In practise the adaption of the center position and shape has to be done very slowly, so usually individuallearning ratesηc andetas are used for these parameters.Adaptive GSBFNs do not trouble the user to choose the positions and shapes of the centers that accurately,but they still suffer from the curse of dimensionality.

6.3.4 Feed Forward Neural Networks

Feed forward neural networks (FF-NNs) consist of a graph of nodes, called neurons, connected by weightedlinks. These nodes and links form a direct, acyclic graph. FF-NNs contain one input layer, one or morehidden layers and one output layer. Usually only neurons of two neighbored layers are connected throughlinks. For each nodeni the input variables of the neuron is multiplied by the weightswi j and summed up(figure6.4), each node has its own activation function, which can be a sigmoidal, a tansig or a linear function(usually used for the output-layer). The sigmoidal transfer function of the hidden neurons divide the inputspace with a hyperplane into two regions. Therefore these functions are global functions in difference toRBF centers, which use only a small region close to the center.The weights are usually updated with the back-propagation algorithm (backprop), which exists in severalmodifications. The backprop algorithm calculates the gradient of the error function with respect to theweights of the neural network by propagating the initial error back in the network. This algorithm is notcovered by this thesis, there are several resources in [15].FF-NNs are not used as often as linear approximators because they are quite tricky to use for RL. Theyhave a poor locality, learning can be trapped in local minima and after all we have very few convergenceguarantees.The major strength of FF-NNs is that they can deal with high dimensional input, in other words FF-NNs donot suffer from the curse of dimensionality [7]. Another advantage is that the hidden layer of a FF-NN hasa global generalization ability, which can be reused for similar problems if we already have learned a task.FF-NNs have been extensively used by Coulom [15] for several optimal control tasks like the cart-poleswing up, the acrobot swing up and a high dimensional swimmer task. These results show that the use ofFF-NNs can solve problems which are too complex for linear function approximators.


Figure 6.4: A single neuronyi = σi(wi0 +∑n

j=1 wi j xi j )

Coulom showed empirically with the Vario-η algorithm that the variance of the weights to the linear outputunits is typicallyn times larger than the of weights of the internal connections (n being the total number ofneurons in the hidden layer). So good learning rates are simply obtained by dividing the learning rates ofthe output units by

√n.

The weights also have to be carefully initialized, because with bad initial weights we are likely to get stuckin a local minima. Le Cun [26] proposed to initialize all weights of a node randomly according to a normaldistribution with a variance of1m, m being the number of inputs of the node.

6.3.5 Gauss Sigmoid Neural Networks (GS-NNs)

These function approximator scheme tries to combine the benefits of the local RBF functions and the globalsigmoidal NNs. This approach has been proposed and used by Shibata [42] for learning hand reachingmovements with visual sensors and also for a biologically inspired arm motion task by Izawa [20].GS-NNs consists of two layers. The first layer is the gaussian localization layer. This layer uses RBFnetworks (or also rather GSBFNs) to localize the n-dimensional input space. Actually we can use here anykind of feature calculator we want for localization, even adaptive GSBFNs where the centers and the shapeof the activation functions are adapted can be used. In this case the learning rate of the adaptive GSBFNhas to be very small to keep stability in learning. The second layer is a sigmoid layer like in a FF-NN. Thislayer contains the global generalization ability. The input to the second layer is obviously the feature vectorcalculated by the first layer.Through the second global layer we get the advantage that we can use less accurate feature representations(consequently also less features) or another approach which is used by Izawa [20] is to do the localization ofeach state variable separately and rely on the global layer to combine the localized state variables correctly.This gives us the huge advantage because we escape the curse of dimensionality in this case the number offeatures only increases linearly with the number of state variables. If this approach also scales up to morecomplex problems has to be investigated.

6.3.6 Other interesting or utilized architectures for RL

In this section we briefly provide an overview of other kind of function approximators which are used inthe area of RL, in particular for optimal control tasks. These architectures are just mentioned, and their


(a) RBF network (b) GS-NN

Figure 6.5: (a) RBF-networks: there are no hidden units which can represent global information (b) GS-NNwith an added sigmoidal hidden layer to provide better generalization properties. Taken from Shibata [42]

interesting properties are pointed out, they are not implemented in the Toolbox.

Normalized Gaussian networks with linear regression (NG-net)

NG-nets approximate anm-dimensional function with ann dimensional input space. The output of a NG-netis given by

y =M∑

i=1

ai(x)∑Mj=1 a j(x)

(Wix + bi) (6.23)

The ai again represent gaussian RBF functions, thus the inner term is the same is for GSBFN’s.Wi is alinear regression matrix andbi is the offset for theith gaussian kernel. Instead of just summing the weightedactivation factors of the radial basis functions like in GSBFNs, each radial basis function defines an ownlinear regression, and the NG-net defines the sum of these linear regression kernels. In addition to theparameters of the gaussian kernel functionsµi ,Σi we have them× n linear regression matrixM and them-dimensional offset vectorb as parameters. Usually this approach needs less RBF-centers for a goodapproximation, because the linear regression is more powerful than linear function approximation.Again we can use gradient descent methods to find a good parameterization of the NG-net. Yoshimotoand Ishii [57] used another, also very interesting approach, they used an EM algorithm to calculate theparameter setting. This can be done by defining a stochastic model for the NG-net and then using thestandard (E)xpectation (M)aximation algorithm, where the probability ofP(x, y|w) is maximized at eachmaximation step. In their approach an Actor-Critic algorithm was used. They use two different trainingphases, one with a fixed actor for estimating the value function, the other training phase is used to improvethe actor given the fixed estimated value function of the policy defined by the actor. Fur further details pleaserefer to the given literature references.

Adaptive state space discretization using a kd-tree

Vollbrecht uses an adaptive kd-tree to discretize his state space in the Truck-Backer-Upper example. Thisapproach which is quite interesting, because no prior knowledge is needed to construct the discrete state


space. The basic partitioning structure is a kd-tree which divides the state space in n-dimensional cuboidlike cells. The whole state space gets successively split withn − 1 dimensional hyper-planes, which cuta cell in two halves along a selected dimension, consequently a kd-tree can be represented as binary tree.Two neighbored cells can not differ in their box length in any dimension by more than a factor of two. In acell, the Q-Value is constant, similar to using tables. An action is executed as long as the Q-Value does notchange, consequently the state has to leave the current cell of the kd-tree. Every time a certain condition ismet, the current cell of the kd-tree is split into half and two new cells replace the existing one. Vollbrechtuses different kinds of tasks for his hierarchic structure, ‘avoidance’, ‘goal seeking’ and ‘state maintaining’tasks. For each of these tasks different rules are used to split the cells of the kd-tree. For a more detaileddiscussion about the used hierarchical system see section5.2.4.

Echo state networks

Echo state networks (ESNs) have been proposed by Herbert Jager [22], [21] to do non-linear time seriesprediction. The idea of the echo state networks is to use a fixed recurrent neural network with sparseinternal connections and only learn the linear output mapping. The internal connections are chosen at thebeginning randomly and are not learned at all, only the linear output mapping is learned, which can beeasily sone by the LMS rule or alternatively also online via gradient descent. Under certain conditions theinternal nodes of the networks represent echo state functions, which are functions of the input state history.The network uses these echo state functions as the basis functions for the linear mapping, if we have alarge pool of uncorrelated basis functions we are likely to find a good linear output mapping. The recurrentneural network has echo states (so useable uncorrelated base functions) if the input function is compact andcertain conditions on the internal connection matrixWi are met. ESN have never been used with RL, but thisapproximation schemes has many interesting properties. We do not need to incorporate any knowledge in thefunction approximator (as in linear function approximation), but we can still use linear learning rules whichusually converge faster. Another interesting approach is the incorporation of state information from the pastwith the help of the recurrent network, so it can be advantageous for learning POMDPs. Unfortunately therewas no time for an additional investigation of these ideas.

Locally weighted regression

Locally weighted learning (LWL, [4], [3]) is a popular supervised learning approach for learning the for-ward model or the inverse model of the system dynamics. In difference to all the other discussed approachesthis method is memory based, that is to say it maintains all the experience (input-output vectors) in mem-ory (similar to the nearest neighbor algorithms). Hence locally weighted learning is an un-parameterizedfunction representation. Using a set of the k-nearest input points from the query pointx a local model ofthe learned function is created and used to calculate the function value ofx. The local model can be anykind of parameterized function, usually a linear or quadratic model is used, but also using a small neuralnetwork is possible. The parameters of the local model have to be recalculated for each query point, whichis in combination with the look-up of the k-nearest neighbors computationally more expensive than using aglobal parameterized model. The advantage of this approach is that hardly any time for learning is needed(we just have to add the new input point to the memory), and that the same input point just has to be learnedonce (in difference to gradient descent methods where we have to use a learning example more than onceto train the desired function value more exactly). Atkeson gives a very good overview of locally weightedlearning (LWL)and how to use it for control tasks [3]. But in the context of RL, LWL is just used for learn-ing the forward model of the system to improve the performance of a reinforcement learning agent in thispaper. Smart [46], [47], [45] uses LWL directly to represent the Q-Function of the learning problem with a


LWL system using a kd-tree for a faster look up of the neighbored inputs. Since the Q-Values are changingwith the policy, special algorithms has to be used to update the already existing examples in memory. Smartemphasizes on robot learning, the approach was tested on corridor following and obstacle avoidance tasks.The LWL approach is particular interesting since LWL does not suffer from the curse of dimensionality asmuch as linear feature state representations.


Linear Approximators and Tables

Tables and linear approximators are represented by the classCFeatureFunction. Because of our state repre-sentation model, this class is not subclass ofCGradientFunction(it needs a state collection object as input).Instead we directly derive a value function classCFeatureVFunctionfrom the base classCGradientVFunc-tion. We already introduced this class in the chapter4.For the gradient calculation we just store the given feature state object in a gradient feature list. So the mainfunctionality of a linear approximator is still implemented in a feature calculator object.

Adaptive GSBFNs

In this case we decided not to use the gradient function interface immediately for the base class (CAdap-tiveSoftMaxNetwork) for the sake of extensibility, because there exists approaches which do not just usethe features for a linear approximation,like the NG-network (see6.3.6). Our base class implements just thelocalization layer, without the weightsw for the linear approximation. Thus base class just contains thecenter position and shape information, and does only represent a part of a function approximator. Thereforethe class is subclass ofCGradienUpdateFunctioninstead ofCGradientFunctionbecause the gradient andthe input/output behavior are not known at this place.The weightsw for the linear approximation are maintained by a common feature function object. At theconstructor we have to fix the maximum number of centers that can be allocated and that can be activesimultaneously. The class also already implements the feature calculator interface and calculates the activa-tion factors of themaxFeaturesmost active features, which are stored in a feature state object. We can treatadaptive GSBFNs as feature calculators if we assume only small changes of the center positions and shapesduring one learning episode. In general this assumption holds at least for small learning ratesηc andηs.The class maintains a list of RBF-centers. A RBF center is stored in an own data structure which storesvectors for the location and the shape of the center.The calculation of the derivation with respect to the location and shape of a single center is implementedin the function (getGradient(CStateCollection*state, int featureIndex,CFeatureList*gradientFeatures)). Itreturns the gradient [φi (x)

dci,φi (x)dsi

], given the input state collection, the gradient is calculated as discussed above.Each weight of the location and shape of a center is assigned a unique index. This index is used to identifythe associated center to a certain weightwi and also to update the data structures of the centers(implementedin the interface functionupdateWeights).Since we use a gradient update function as base class we can use an individualη calculator for our adaptiveGSBFN. Thisη calculator identifies the weight as location or shape weight and then applies the specifiedlearning ratesηc respectivelyηs to the weight updates. Ifηc or ηs are set to zero, the corresponding deriva-tions are not calculated, resulting in the constant case again, but where new centers can still be added.The calculation of the feature factorsφk(x) is done in thegetModifiedStatefunction of the feature calculatorinterface. We can specify aεi rectangular neighborhood for each dimension i for the local search of nearbyRBF centers. At the beginning of the search, all centers are stored in the search list. Then the first dimension


of the centers location is checked to be in the specified neighborhood of the current state. If that is not thecase, the center is deleted from the list. This process is repeated for all dimensions. At the end the search listcontains only centers which are located in theε neighborhood. The factors of all these centers are evaluatedand themaxFeaturesmost active are written in the feature state object. Then the feature state is normalizedas usual. The feature state is also needed by the gradient calculation itself, so it has to be added to the statecollection objects of the agent.For adding a RBF center automatically, the methodaddCenterOnErroris provided. The function adds ancenter at the current position if the activation factors of all centers are less thanamax and the error is largerthanemin. But here arises a problem with our state representation using the feature calculator interface.Feature calculators can only be used if the feature factors do not vary for the same state over time. Asalready mentioned we can neglect the drift of the center location and shape over one training episode, but ifwe add a new center, the feature activation factors change very abruptly. We decided on the following workaround: We can use the standard state model of the Toolbox as long as no new center has been allocated,which spares us a lot of computation time in a convenient way (because we calculate the feature factors justonce). The state collection - state modifier interface is now adapted slightly. Each state modifier maintainsa list of all state collections which store a state object of that modifier. As a consequence the modifier canalways inform the state collections whenever a state object has become invalid, and so has to be calculatedonce again. This is done each time after a new RBF center is added.In addition we have the possibility to specify an initial set of RBF centers. This can either be done byspecifying the centers individually or by specifying a grid feature calculator object. Then a RBF center isadded at each tile of the grid.

Feed Forward Neural Networks

The RL Toolbox uses the Torch Library to represent all FF-NNs. With the Torch library we can create arbi-trary FF-NN’s with an arbitrary number of different layers. The layers can be interconnected as needed (butusually a straight forward NN is used). For these neural networks the gradient with respect to the weightscan be calculated, given a specific input and output data structure. All objects which supports gradient cal-culation in the Torch Library are subclass of the classGradientMachine. The classCTorchGradientFunctionencapsulates a gradient machine object from Torch, it is also subclass of theCGradientFunctioninterface,so it can be used to create a V-Function or also a Q-Function. All the communication to the Torch Libraryis done by this class, which involves converting the input and output vectors from the Torch intern structureto the used data structures of the Toolbox and also getting and updating the weight vectors directly.A standard FF-NN is created very easily with the Torch Library. For further details consult the Torchdocumentation of the classMLP.FF-NNs additionally use an ownη calculator, which scales the learning rates of the output weights by thefactor of 1√

n, n being the number of neurons in the hidden layers.

At the creation of the FF-NN the weights are initialized in theresetDatamethod by the proposed method ofLeCun. We can additionally scale the variance of the initial weights by the parameter ‘InitWeightVariance-Factor’.

Gaussian Sigmoidal Neural Networks

The two main parts of GS-NNs already exists, the localization part and the FF-NNs part. What is still missingis the interface between both. We have to provide a conversion from the sparse feature state representationto the full feature state vector (including the features with activation factor 0.0). This full feature state vectorcan then directly be used as input for a FF-NN. The conversion is done by the classCFeatureStateNNInput.


Through this approach any constant feature representation can be used for the GS-NN, so we can for examplechoose whether we want to localize the global state space or each state variable separately (usingandrespectivelyor feature operators). Unfortunately this approach does not work for adaptive GSBFNs, becausewe would have to calculate the gradient of the composition of the GSBFN and the FF-NN, which is quitetricky and time consuming (the question is if that is really necessary for small learning rates of the adaptiveGSBFN). Anyway, the use of GS-NN with an adaptable localization layer is not supported in the Toolbox.

Chapter 7

Reinforcement learning for optimal controltasks

Reinforcement Learning for control tasks is a challenging problem, because we usually have a high di-mensional continuous state and action space. For learning with continuous state and action spaces, usuallyfunction approximation techniques are used. Many interesting problems have been solved in the area ofoptimal control tasks with different RL algorithms. For a detailed description of the successes of RL in thisarea see chapter1. Using RL with function approximation needs many learn steps to converge, consequentlyalmost all results are for simulated tasks.In this chapter we will discuss a few commonly used algorithms for optimal control tasks. At first we willtake a closer look at the use of continuous actions and extend the framework of the Toolbox for continu-ous action learning. Then we will come to value approximation algorithms, which allows us to use TD(λ)learning even function approximators [6]. We will also cover two new value-based approaches, that is to saycontinuous time RL [17] and advantage learning [6]. The next section will cover two policy search algo-rithms (GPOMDP [11]and PEGASUS [33]). Then we will also come to continuous Actor-Critic learning,whereas also discuss two different approaches will be discussed, the stochastic real valued algorithm (SRV,[19]), and a new proposed algorithm which is called policy gradient Actor-Critic learning (PGAC).

7.1 Using continuous actions in the Toolbox

For continuous control tasks we need to use continuous actions. For a low dimensional action space, wecould alternatively discretize the action space, but this can impair the quality of the policy and it is notpossible for high dimensional action spaces anyway.There arises several limitations when using continuous actions. At first, action value functions can not beused straight forwardly. Since we want to search for the best action value of states, we somehow needto discretize the action space or use a more sophisticated search method. The same is true for V-Functionplanning, where an action discretization is also necessary for finding the optimal action.Concerning the Toolbox, we already discussed the concept of continuous actions using the action pointerto identify the action and using the action data object for the continuous action values (see2.2.4). Weadditionally design a continuous controller interface and also an interface for Q-Functions with continuousaction vectors as input. Even if we need to discretize the action space for the Q-Function in order to searchfor the best action value, it can be advantageous to use continuous inputs for learning in order to inducesome generalization effects between the actions.

106

7.1. Using continuous actions in the Toolbox 107

As a discretized version of a continuous action we introduce static continuous actions. Static continuousactions have the same properties than continuous actions, so they maintain an action data object representingthe continuous action value vector, but this action value is now fixed, the action represents a fixed point in theaction space. Static continuous actions are for example used for the discretization needed for Q-Functionsto search for the best action value. For each static continuous action, we can additionally calculate thedistance to any other continuous action object in the action space, which is needed later on for interpolatingQ-Values.

7.1.1 Continuous action controllers

We need to design a continuous action controller interface which still fits in our agent controller architec-ture. An agent controller always returns an action pointer (to identify the action) and stores, if needed, theaction data object associated with the returned action in a given action data set. A controller specificallybuild for continuous actions always returns the same action pointer (the pointer of the continuous actionobject) and stores the calculated action vector in the corresponding action data object. For that reason wecreate the interface classCContinuousActionController, which is subclass ofCAgentControllerand has anadditional interface functiongetNextContinuousActionwhich gets the state as input, and has to store theaction vector in a given continuous action data object. This function is called by thegetNextActionmethodof the controller, the continuous action data object is automatically passed to thegetNextContinuousActioninterface. Consequently this class simplifies the design of continuous action controllers because we do nothave to worry about the action model any more. Also consult figure7.1for an illustration.

Figure 7.1: Continuous action controllers and their interaction with the agent controller interface.

Every continuous action controller already has an own noise controller, this noise is added to the action valueat each timegetNextActionis called. Ifu(s) is the action vector coming from thegetNextContinuousActionfunction, the policyπ is defined to be

π(st) = u(st) + nt (7.1)

Random Controllers

For the noise we provide a general noise controller calledCContinuousActionRandomPolicy. The noiseis normally distributed with zero mean and a specified sigma value. In order to accomplish a smoothernoise signal we can also choose to low-pass filter the noise signal. Hence the noise vector is calculated thefollowing way:

nt+1 = α · nt + N(0, σ)

108 Chapter 7. Reinforcement learning for optimal control tasks

α being the smooth factor andN(0, σ) is a normally distributed random variable. In order to switch the noiseoff for a certain continuous controller we have to set theσ value to zero.The noise used for a given action is needed by the SRV algorithm. Since the noise signal is not stored withthe action object, we have to recalculate the noise signal if we know the action vectorat. The continuousagent controller interface also supports calculating the noise vector given a control vectorat and a statest.The noise (or the derivation to the original control vector) is then obviously calculated by

nt = at − π(s) (7.2)

7.1.2 Gradient calculation of Continuous Policies

For continuous policies the gradientdπ(s)dθ is needed by a few algorithms which represent the continuous

policy directly. dπ(s)dθ is am× p matrix,mbeing the dimensionality of the action spaceU andp is the number

of weights used to represent the policy. We can use the already discussed function approximtation schemeswhich are implemented as gradient functions for the continuous policies, hence we encapsulate gradientfunction objects. For the gradient calculation we can then use the encapsulated gradient function objecteasily.

Implementations of the continuous gradient policies

There are two implementations of this interface

• Gradient function encapsulation :(CContinuousActionPolicyFromGradientFunction) A gradient func-tion object withn inputs andm outputs is used to represent the policy. The user has to specify agradient function with the correct number of input and output values. The class itself is the interfacebetween the gradient function and the continuous gradient policy classes, so all function calls arepassed to the corresponding functions of the gradient function object. In this case only one gradientfunction for all control variables is used. This class can for example be used to encapsulate a FF-NNfrom the Torch library and use it as policy.

• Single state gradient function encapsulation :(CContinuousActionPolicyFromSingleGradientFunc-tion). In this case we can use a list of gradient functions, for each control variable an individual gra-dient function is used (so the gradient functions haven inputs and one output dimension. We can usethis class to represent our policy with independent functions for the different control variables, forexample with individual feature functions for each control variable.

Control Limits

Usually we have limits for our control variables which are given by an interval [umin,umax] in the simplestcase (e.g. see [15]). Just cutting the control vector outside this interval to the interval limits would be apossibility, but this definitely falsifies the gradient calculation. Our approach is to use sigmoidal functionsfor the control variable which get saturated at the limit values. We use the following function to get aquasi-linear behavior between the limits, whereu j is the output vector of the policy without any limits (thisfunction is also illustrated in figure7.2).

u′j = σ(u j) = u j,min+ (u j,max− umin)logsig

(−2+

4 · (u j − u j,min)

(u j,max− umin)

)(7.3)


Figure 7.2: Limited control policy, the actionu is limited to [umin,umax] by a sigmoidal function. In themiddle of the interval the limited control policy is quasi identic to the unlimited policy.

Obviously the gradient also changes:

dπ′(s)dθ

= (u j,max− umin)logsig′−2+

4(π(s) − u j,min)

(u j,max− umin)

· 4(u j,max− umin)

·dπ(s)

dθ(7.4)

Due to the scaling of the sigmoidal function argument, we get a quasi-identical behavior of the originaland the sigmoidal function for 90% of the allowed control space. For introducing the limits of the controlvariables, we create an own classCContinuousActionSigmoidPolicy, which encapsulates another gradientpolicy class and uses the introduced sigmoid function to limit the control variables and also to calculate thenew gradient. For sigmoidal policies we can use a different kind of noise which we will refer to as internalnoise. The internal noise is added before the sigmoidal function is applied:

π′(s) = σ (π(s) + n) (7.5)

This internal noise has lower effect if the control variable if the original policy is outside the given controlinterval because of the saturation effect. Usually a value outside the limits has the meaning that the algorithmis quite sure about taking the maximum or minimum control value, so it makes sense to reduce the effect ofnoise in this areas.For the inverse calculation of the noisent given the executed action vectorat and the statest we have to usethe inverse sigmoidal function if an internal noise controller was used.

7.1.3 Continuous action Q-Functions

Similar to the continuous action controllers we also create such an interface for Q-Functions, which nowdoes not have to take an action pointer plus action data object as input anymore, but instead the interfacefunction immediately get an continuous action data object as input. This class is already subclass of thegradient Q-Function interface, so its subclasses also has to provide full gradient support.


7.1.4 Interpolation of Action Values

One negative effect of discretizing the action space is that we get a non smooth policy which is usuallysub-optimal. An approach to overcome this problem is linear interpolation of the Q-Values. For example ifwe have three different discretized action vectors [vecamin, veca0,amax], and the Q-Values of two neighboredactions are almost the same, it can be useful to take the average of both action vectors.On the other hand, if we have executed the continuous actionat and its continuous action vector does notmatch any discretized action vector, we can update the Q-Values of nearby discretized action vectors.The action selection part is done by the classCContinuousActionPolicy. The class has quite the samefunctionality as stochastic policies, it takes a set of actions (this time all have to be static continuous actions)and an action distribution. In difference to stochastic actions the class first samples one action accordingto the given action distribution and then it calculates the weighted sum of all static action vectors in theneighborhood of the sampled action.

π(s) =∑

||a∗−ai ||<ε

P(ai)ai (7.6)

wherea∗ is the sampled action. The size of the neighborhoodε that is searched for nearby action vectorscan be specified.An interpolated Q-Function is represented by the classCCALinearFAQFunction. This class encapsulatesanother Q-Function object and uses it for calculating the interpolated values. All actions used for thatencapsulated Q-Function have to be subclasses ofCLinearFAContinuousAction. These action objects arederived from the static continuous action class, so they represent a fixed point in the action space. Addi-tionally they provide a function to calculate an activation factor of the static action, given the currently usedaction vector. This approach is related to linear feature states, which does the same in the state space. TheclassCContinuousRBFActionimplements the RBF-activation function for static actions, but we can alsoimplement classes for linear interpolation easily. At the end the activation factors of the static actions arenormalized (

∑i ai = 1). The interpolation Q-Function class calculates the action activation factors of all

static actions for a given action vector, normalizes these activation factors and then calls the correspondingfunctions of the encapsulated Q-Function. For the update functions, the update value for each static actionis scaled by the corresponding activation factor, for thegetValuefunctions the value is calculated by theweighted sum of the action values.

7.1.5 Continuous State and Action Models

In optimal control tasks our model consists ofn continuous state variables andm continuous control vari-ables. Often we know the model of the state dynamics, or at least we can learn them. The model can berepresented as a transition function

st+1 = F(st ,at)

for discrete time processes or it can be directly represented as the state dynamics of the system

st = f (st ,at)

. The state transition fromst to st+∆t can then be calculated by using a numerical integration method like theRunga-Kutte method. For the discrete time transition function we have already discussed theCTransition-Functioninterface, which already fits our requirements (i.e. can cope with continuous states and actions).


Continuous Time Models

Continuous Time Models are represented by the state dynamicsst = f (st ,at) For calculating state transi-tions with a continuous time model the classCContinuousTimeTransitionFunctionis implemented. Thisclass is subclass of the transition function class and maintains an additional interfacegetDerivationXwhichrepresents the state dynamicsst . With this interface the class can calculate the state transitions, we just haveto specify the simulation time step∆ts. This integration is done by the methoddoSimulationStep. A usualfirst order integration is used for calculating the statest

st+∆t = st + ∆t · st

This method has to be overridden if a second or higher order integration method is needed for certain statevariables. For example it is recommendable to use second order integration for positionsp and to use firstorder integration for velocitiesv = p to get a more accurate simulation result. In order to stay accuratethe simulation time step is much smaller than time steps which are useful for learning. Consequently thenumerical integration of the learning time step∆t is divided into several simulation steps with the simulationtime step∆ts = ∆t

N , the small step integration is done by thedoSimulationStepmethod. The whole integrationstep is done by the interface function of the transition function classtransitionFunction, the number ofsimulation steps per learning time step can be set separately.The classCContinuousTimeAndActionTransitionFunctionrepresents a continuous time model where theaction has to consist of continuous control variables (in difference to the normal continuous time transitionfunction class, where the action can be any action object). The interface for calculating the state derivativeis adapted to take a continuous action data object as input instead of an action object.

Models Linear with respect to the control variables

For the most motor control problems, where a mechanical system is driven by torques and forces, the statedynamics are linear with respect to the control variables, i.e. that the state dynamics can be represented by

f (s,u) = B(s) · u + a(s)

The state dynamics can be completely described by the matrixB(s) and the vectora(s). This is modeled bythe classCLinearActionContinuousTimeTransitionFunction, subclass ofCContinuousTimeAndActionTran-sitionFunction. This class has two additional interface functions for retrievinga(s) andB(s), which has tobe implemented by the subclasses. The state derivationst is then calculated in the described way.All the models used for the benchmark tests can be described in this form, so this class is the super class ofall our model classes.

7.1.6 Learning the transition function

It is often useful to use state predictions like in V-Planning, even if the transition function is not known asfor robotic tasks or external simulators. In this case we can learn the transition functionst+1 = f (st,at). Thisis a supervised learning problem, which is usually much easier than learning a Q or a V-Function. AlthoughQ-Learning methods do not need any model of the MDP, it can be advantageous to learn the transitionfunction and then use V-Planning methods, because we have divided the complexity of the original learningproblem.We can use our already existing supervised learning interface for learning the transition function. The classCLearnedTransitionFunctionrepresents a learned transition function. It is inherited from the agent listener


and also the transition function class, thus it can be used as standard transition function. The class getsa supervised learner object as input, in the agent listener interface, the old state vectorst and the actionvectorat are combined as input vector, the new state vectorst+1 is stored in the output vector, which areboth passed to the supervised learner object. Thus at each step a new training example is created. For thetransition function interface thetestExamplemethod of the supervised learner is used to calculate the learnedoutput vector.Any supervised learning method can be used with this design to learn the transition function, although justsimple gradient descent methods are currently implemented in the Toolbox.

7.2 Value Function Approximation

There is a well developed theory for learning the value function with lookup tables, and also a for guar-anteeing the convergence of supervised learning algorithms on function approximation, but when the twoconcepts are combined the problem becomes more complex. In this chapter we introduce three conceptsfor gradient-based value function approximation, the direct gradient, the residual gradient and the residualalgorithms. These algorithms are discussed in more detail in [6].At first let us fix some notation issues, the real value function is still calledVπ(s), the approximated valuefunction is written asVw. If we want to refer to the gradient of the value function or some other function

with respect to the weights (i.e.dVw(s)dw ) we will use the∇w operator.

For approximating the value function we need to minimize the Mean Squared Error (MSE):

E =1n

∑s

E[Vπ(s) − Vw(s)]2 (7.7)

whereVπ(x) is the real value of the statex andVπ(x) is the approximated value coming from our functionapproximator. Usually we do not know the real value of statex, consequently we estimate it by

Vπ(s) = E[r(s,a) + γ · Vπ(s′)] u E[r(s,a) + γ · Vw(s′)] (7.8)

which is supposed to be a more accurate estimate ofVπ(s′) than V(s′). ConsequentlyVπ(s) is again anapproximation because we have to use the function approximator for the valueV(s′). The resulting errorfunction

E =12n

∑s

E[r(s,a) + γ · Vw(s′) − Vw(s)]2 =12n

∑s

E[residual(s, s′)]2 (7.9)

is called mean squared Bellman residual. We define theresidualto be the inner term of the error function.

residual(s, s′) = r(s,a) + γ · Vw(s′) − Vw(s) (7.10)

Note that this residual is basically the same as the temporal difference value used for TD-learning. For afinite state space this error function is only zero if we have an exact approximation of the value function.Now we can do stochastic gradient descent on this error function, the gradient of the error function in a states is then given by

∇wE = E[residual(s, s′) · ∇wresidual(s, s′)] (7.11)

For deterministic processes we can omit the expectancy value, and directly use the successor states′. Forstochastic process we need an unbiased estimate of the error function in states. The unbiased estimate of

7.2. Value Function Approximation 113

the gradient of a quadratic function∇E[Y2] = E[∇Y2] is given byy1 · ∇y2, wherey1 andy2 are samplesfrom Y (y1 · ∇y1 would be the gradient of the squared expectancyE[Y]2). Consequently we would have tocalculate the gradient from two independent samples ofs′ for stochastic processes.

∇wE = residual(s, s′1) · ∇wresidual(s, s′2)] (7.12)

But if the nondeterministic part of the process is small (for example only a small noise term is added) thenwe can still use7.11as a good approximation of the gradient. Since we use only deterministic processes (orslightly stochastic processes) in this thesis only one sample of the next state is used in the Toolbox for thegradient calculation.An underlying problem of value function approximation is the accuracy of the approximated value function.Even a good approximation of the value function does not guarantee a good performance of the resultingpolicy. For example there exist infinite horizon MDPs which can be proved to have the following properties[12]: If the maximum approximation error is given by

ε = maxs∈S|Vw(s) − Vπ(s)| (7.13)

then the worst case of the expected discounted reward (V(π) =∑

s∈D d(s)Vπ(s)) of the greedy policy ˆπfollowing Vπ (D is the set of al initial states,d(s) is the distribution over these states) is just bounded by:

V(π) = V(π) −2 · γ · ε1− γ

(7.14)

whereV(π) is the real expected discounted reward of the policy. So even for a good approximation, thevalue function can generate bad policies forγ values close to one. Certainly, this is only for a specific MDP(which was build on purpose for the proof) and this is the worst case, but we have to keep in mind that valuefunction approximation can be crucial.

7.2.1 Direct Gradient Algorithm

The direct gradient method is the most obvious algorithm implementing value function approximation, soit was also the first algorithm that was investigated [54]. Again we try to adjust the weights of the functionapproximation system to make the current outputVw(s) closer to the desired outputVw(s′) = r(s) + Vw(s′).So we get the following update for the weights:

∆wD = η · (r(st,at, st+1) + γ · Vw(st+1) − Vw(st)) · ∇wVw(st) (7.15)

If we look at our error function notationE, the direct gradient method neglects the fact that the desiredoutputVπ(s) also depends on the weights of the function approximator, because we useV(s′) to estimate it.Although this is the most obvious way to do value function approximation, this approach is not guaranteedto converge. Tsitsiklis and Van Roy [52] gave very simple examples for a two state MDP where this algo-rithm diverges. For a more detailed discussion about these examples also refer to [6]. A reason why thismethod does not work is that, if we change the value of one state with function approximation, we willusually change the values of other states too, including the value of successor states′. As a result this alsochanges the target valueV(s) which may actually letV(s) move away from the target value and hence causedivergence.


7.2.2 Residual Gradient Algorithm

The residual gradient algorithm calculates the real gradient of the residual given by:

residual(s, s′) = r(s,a) + γ · Vw(s′) − Vw(s)

So we have the following gradient of the error function:

∇wE = (r(st,at, st+1) + γ · Vw(st+1) − Vw(st)) ·(γ∇wVw(st+1) − ∇wVw(st)

)(7.16)

and thus the weight update rule

∆wRG = −η∇wE = −η(r(st,at, st+1) + γ · Vw(st+1) − Vw(st))) ·(γ∇wVw(st+1) − ∇wVw(st)

)(7.17)

Since we do stochastic gradient descent on the error function with the real gradient, this method is guaranteedto converge to a local minima of the error function. The residual gradient algorithm updates both states,sands′ to achieve the convergence on the error function. But these convergence results unfortunately do notnecessarily mean that this algorithm will learn as quickly as the direct gradient algorithm, or the solutionis the solution of the dynamic programming problem. In praxis it turned out that the residual gradientalgorithm is in fact significantly slower and does not find as good solutions as the direct algorithm. Theadvantage of this algorithm is that it is proved to be stable.

7.2.3 Residual Algorithm

The residual algorithm tries to combine the two algorithms to get the advantages of both of them, fast andstable learning. In figure7.3we see an illustration of the two gradients, the direct and the residual gradient.The direct gradient is known to learn fast, the residual gradient always decreases the error functionE. Thedotted line represents a plane perpendicular to the residual gradient. Each vector that is on the same side asthe residual gradient of this hyperplane also decreases the error function. So, if the angle between the twogradient vectors is acute, we can use the direct gradient. If the angle is obtuse, the direct gradient would leadto divergence, but we can use a vector that is as close as possible to the direct gradient but is still locatedon the same side of the hyperplane as the residual gradient (see figure7.3). This can be achieved by using aweighted average of the two gradient vectors [6]. So for aβ ∈ [0,1] we can calculate our new weight updateby:

∆wR = (1− β)∆wD + β∆wRG

= −η(r(st,at, st+1) + γ · Vw(st+1) − Vw(st)

)· [−(1− β)∇wVw(st) + β ·

(γ∇wVw(st+1) − ∇wVw(st)

)]

= η ·(r(st,at, st+1) + γ · Vw(st+1) − Vw(st)

)·(∇wVw(st) − β · γ · ∇wVw(st+1)

)(7.18)

So in the residual algorithm, we additionally attenuate the influence of the successor state with the factorβ. By this definition the residual gradient and the direct gradient algorithm are both special cases of theresidual algorithm. Depending on theβ value, this method is guaranteed to converge to a local minima ofthe error function. There are two methods proposed by Baird for the choice of theβ value:

• Constant weighting: The two gradients are summed up with a constant weight factorβ. Theβ valuecan then be found by trial and error, the smallest value ofβ should be chosen that does not blow upthe value function.


(a) Acute Angle (b) Obtuse Angle (c) Optimal Update Vec-tor

Figure 7.3: In the case of an acute angle we can straightly take the direct gradient weight update, for obtuseangles we have to calculate the new update vectorwr .

• Calculate β: We can alternatively calculate the lowest possibleβ value in the range of [0,1] whichstill ensures that the angle between the epoch-wise residual gradient∆WRG and the residual weightupdate∆WR is still acute (∆WRG · ∆WR > 0). Thisβ value can be found by taking a slightly biggervalue that fulfills the equation∆WRG · ∆WR = 0.

((1− β′)∆WD + β′∆WRG)∆WRG = 0

β′ = −∆WD · ∆WRG

(∆WRG− ∆WD) · ∆WRG

β = β′ + ε (7.19)

If this equation yields aβ value outside the interval [0,1], the angle between the residual gradientand the direct gradient is already acute, so aβ value of 0.0 can be used to provide maximum learningspeed. The disadvantage of this adaptiveβ calculation is that we need estimates of the real epoch-wisedirect gradient and residual gradient. Using the stochastic gradient of only one step (which is used byTD-methods) does not work because these gradient estimates are very noisy.

But we can estimate the epoch-wise gradient incrementally. The epoch-wise calculated weight updates∆WD and∆WRG can be estimated by traces for the direct and the residual gradient. The traces usedfor the direct and the residual gradient weight updates are updated the following way:

∆wd = (1− µ) · ∆wd + µ · residual(st, st+1) · ∇Vw(s) (7.20)

∆wrg = (1− µ) · ∆wrg + µ · residual(st, st+1) ·(∇Vw(s) − γ · ∇Vw(s′)

)(7.21)

Now we can use the traces∆wd and∆wrg as estimates for∆WD and∆WRG for the adaptiveβ calcula-tion.

In general we can not say which algorithm works best. This again depends on the problem and in particularon the used function approximator.

7.2.4 Generalizing the Results to TD-Learning

These three weight update schemes can be generalized to all the discussed value-based learning algorithms,we just need to adapt the choice of our residual function. Each of the proposed algorithms can be imple-


mented with one of the three gradient calculation methods. Here we show the equations for learning eitherthe V-Function or the Q-Function with TD(0)-Learning with the residual algorithm.

• TD V-Function Learning:

∆w = η · (r(st,at, st+1) + γ · Vw(st+1) − Vw(st)) ·(∇wVw(st) − β · γ · ∇wVw(st+1)

)(7.22)

• TD Q-Function Learning:

∆w = η·(r(st,at, st+1)+γ ·Qw(st+1,at+1)−Qw(st,at))·(∇wQw(st,at) − β · γ · ∇wQw(st+1,at+1)

)(7.23)

7.2.5 TD(λ) with Function approximation

We derived the equations for TD(0) learning, but as we have already heard, the use of e-traces can consid-erably improve the learning performance. Can we use eligibility traces with function approximation? Withfunction approximation we do not have any discrete state representation, so we can not calculate the ‘re-sponsibility’ of a state for a TD update. But we can use the current gradient as a sort of state representation,and calculate eligibility traces for the weights of the approximator instead of eligibility traces for states. Forthe direct gradient method we can use the same justification for using eligibility traces as for the discretestate model, we change the value of states which are likely to be responsible for an occurred TD-error. Thedirect gradient method has been extensively studied, the strongest theoretically results from Tsitsiklis andVan Roy [52] prove that the algorithm converges with linear function approximation when learning is per-formed along the trajectories given by a fixed policy (policy evaluation). The policy can then be improvedby discrete policy improvement - policy evaluation steps.But for the residual and residual gradient algorithm, we do not update the value of states anymore, weminimize the error functionE at each step. So can we still use e-traces?. Although this problem hasaccording to our state of knowledge not been addressed in literature, we answer this question with yes. Ifwe minimize the error function at stept, which is given byEt =

12[rt + γV(st+1) − V(st)]2 and the residual

is positive, the value of the current statest will increase (and the value of the next statest+1 decrease), as aresult the residual from stept − 1 will also increase (sinceV(st) increased). Consequently it makes senseupdatingE(t − 1) also with a positive residual error. This suggests that using eligibility traces makes senseeven if we calculate the gradient of an error function, and not of the gradient of a value function.For the eligibility e-traces we can again use replacing or non replacing e-traces:

• Non-replacing e-traces:Here the update of the eligibility traces are straight forwardly summed up:

et+1 = λ · γ · et − ∇wresidual(s, s′) (7.24)

• Replacing e-traces:In this case it becomes more complicated, because we can have now differentsigns in the eligibility traces. We decided on the following approach:

et+1(wi) =

absmax(λ · γ · et(wi),−∇wi residual(s, s′)) , if sign(−∇wi residual(s, s′)

)= sign(et(wi))

−∇wi residual(s, s′) , else(7.25)

If the eligibility traceei for weightwi and the current negative derivation of the residual−dresidual(s,s′)

dwi

have the same sign, the value with the larger magnitude is chosen. So always the largest weight updatefrom the past is saved if the sign does not change. Otherwise, if the e-trace and the derivation showin different directions, just the value of the derivation is used and the old e-trace value is discarded.


This is done because the updates from the past are likely to contradict the current weight update, ifthe updates show in a different direction. This approach was empirically evaluated to work well, inthe most cases better than the accumulating e-traces algorithm.

Both approaches rely on the assumption that∇wresidual(s, s′) is constant over time, which is in general nottrue, because changing the weights in one state usually also changes the gradient of the value function in allother states. But if we assume only small weight changes during one episode we can use that approximation[15]. Only for linear function approximators this assumption is always true.Another approach for eligibility traces is to store the state vectors from the past (or the lastn steps from thepast) and recalculate the gradient of all states at each step again. This approach gets rid of the assumption thatthe gradient is constant gradient over time, but it is obviously computationally considerably more expensive.


Since the different gradient calculation schemes can be used for any value based algorithm, a general imple-mentation of the gradient calculation is needed.

The Residual Functions

In our approach we designed individual interfaces for defining the residual (CResidualFunction) and fordefining the gradient of the residual (CResidualGradientFunction). The residual interface gets the old valueVt, the new valueVt+1, the reward and the duration of the step as input. The residual gradient interfacegets the gradient of the V-Function in the old state∇wV(st), the gradient of the new state∇wV(st+1) and theduration of the step as input. It has to return the gradient of the residual. The duration is used to calculate thecorresponding SMDP updates (exponentiate theγ value). For both interfaces there is no difference whetherthese values com from a Q-Function or a V-Function.For the residual function we define at the moment only one class which calculates the standard residualfunction

residual(Vt, rt,Vt+1) = rt + γ · Vt+1 − Vt

We will define additional residuals later.For the gradient calculation interface we define one class for calculating the direct gradient (just returns thegradient of statest) and the residual gradient.For the residual algorithm, we provide the classCResidualBetaFunctionwhich superimposes the direct andthe residual gradient vector with the variableβ in the described way. Theβ value is calculated by an owninterfaceCAbstractBetaCalculatorwhich also gets the direct and the residual gradient as inputs. There aretwo implementations of this interface.

• CConstantBetaCalculator: Always returns a constantβ value, which can be set by the parameterinterface.

• CVariableBetaCalculator: Calculates the best beta value for the given direct and residual gradient(see7.2.3).

Using Eligibility Traces

The e-traces classes for tracing the gradient are basically the same as for the discrete state representation,because there is no difference in calculating e-traces for state or weight indices in the view of the function-ality of the software. We add functions for directly adding a gradient to the e-traces list instead of adding


a state collection object, the target value function is also updated directly through the gradient update func-tion interface. The update method for replacing e-traces (see7.25) also have to be changed slightly. Theseextensions was done for the e-trace classes for value and action value functions (CGradientVETracesandCGradientQETraces)

The TD-Gradient Learner classes

New learner classes for learning the value (CVFunctionGradientLearner) and the action value function(CTDGradientLearner) with function approximation are created. Both classes are subclasses of the alreadyexisting corresponding TD-Learner class. These classes additionally get a residual and a residual gradientfunction as input. At each step the values of the new and old states are calculated, and then transferred to theresidual function. This residual error value is then used as temporal difference. The gradient at the currentstate and at the next state is also calculated and passed to the residual gradient function object. The result isthen used to update the gradient e-traces object of the learner. The rest of the functionality is inherited fromthe super classes.

The TD-Residual Learner classes

The variableβ calculation is a special case of the gradient learner model because in this case we need anestimate of the epoch-wise gradient, not only the gradient for a single step. If we use the former definedgradient learner classes ( (CVFunctionGradientLearner) and (CTDGradientLearner)) for the residual algo-rithm, only the single step gradients are used to calculate theβ value. This obviously works for a constantβ value, but for the variableβ calculation we have to pass an epoch-wise estimate of the gradient to theβ

calculator interface.We use the traceswd andwrg as discussed for the direct and the residual gradient vector to estimate themepoch-wise, here the usual eligibility traces classes are used and updated in the described way. The residuallearner class also gets a beta calculator as input, so it can calculate the bestβ value for the estimated epoch-wise gradients with the adaptiveβ calculation classCVariableBetaCalculator. The residual learner classes(CVFunctionResidualLearnerandCTDResidualLearner) also maintain two eligibility traces objects insteadof one, one for the direct gradient updateed and one for the residual gradient updateerg. After calculatingtheβ value, the direct gradient update of this e-traces is multiplied by (1− β) and for the residual update byβ for the weight update.

∆w = η ·((1− β)ed + βerg

)· residual(s, s′)

Through this approach always the best estimate ofβ even for error functions from the past (with the use ofe-traces) is used.The residual learner classes are again designed for value (CVFunctionResidualLearner) and action valuefunction learning (CTDResidualLearner).

7.3 Continuous Time Reinforcement Learning

Doya [17] proposed a value based RL learning framework for continuous time dynamical systems withouta priori discretization of time, state and action space. This framework was also extensively used by Coulom[15] in his PhD thesis and Morimoto [30], [29] in his experiments.

7.3. Continuous Time Reinforcement Learning 119

7.3.1 Continuous Time RL formulation

The system is now described by the continuous time deterministic system

s(t) = f (s(t),u(t)) (7.26)

wheres ∈ S ⊂ Rn andu ∈ U ⊂ Rm, S is the set of all states andU is the set of all possible actions.

Continuous Time Value-Functions

Again we want to find a (real valued) policyπ(s) which maximizes the cumulative future reward, but in thiscase the equations are formulated in continuous time. Consequently we have to integrate the future rewardsignal over the time to calculate the value of states.

Vπ(s(t0)) =∫ ∞

t=t0exp(−(t − t0) · sγ)r(s(t),u(t))dt (7.27)

wheres(t) andu(t) follow the system dynamics respectively the given policy.sγ is the continuous discountfactor, an corresponds to the inverse decay time of reward signalr. Again we define the optimal ValueFunctionV∗(s) = maxπVπ(s), for all s ∈ S. If we consider a time step of length∆t we can write the valuefunction in its recursive form.

Vπ(s(t0)) =∫ t0+∆t

t=t0exp(−(t − t0) · sγ)r(s(t),u(t))dt+ exp(−∆t · sγ) · V

π(s(t0 + ∆t)) (7.28)

For small∆t values this equation can be approximated by

Vπ(s(t0)) ≈ ∆t · r(s(t0), π(s(t0))) + (1− sγ · ∆t)) · Vπ(s(t0) + ∆s) (7.29)

with∆s= f (s0, π(s0)) · ∆t (7.30)

The Hamilton-Jacobi-Bellman Equation

This is still similar to the discrete time equations, but now we can subtractVπ(s(t)) on each side and dividethrough∆t.

0 = r(s, π(s)) − sγVπ(s+ ∆s) +

Vπ(s+ ∆s) − Vπ(s))∆t

(7.31)

After performing the lim∆t→0 we get the following equation

0 = r(s, π(s)) − sγVπ(s) +

dVπ(s)dt

= r(s, π(s)) − sγVπ(s) +

Vπ(s)ds· f (s, π(s)) (7.32)

A similar equation can be found for the optimal value function by always performing a greedy action, thisequation is called Hamilton-Jacobi-Bellman equation and given by:

0 = maxu∈U

[r(s,u) − sγV

π(s) +Vπ(s)

ds· f (s,u)

](7.33)

This is the continuous time counterpart of the Bellman Optimality Equation. We also define the HamiltonianH for any value functionVπ to be:

H(t) = r(s(t),u(t)) − sγVπ(s(t)) +

Vπ(s(t))ds

· f (s(t),u(t)) (7.34)

This definition is analogous to the definition of the discrete time Bellman residual or the temporal differenceerror. The Hamiltonian of an estimated value functionV equals only 0 for all statess if the estimateV equalsthe real value functionVπ.


7.3.2 Learning the continuous time Value Function

Now, as we have derived the continuous time residual we can use the same techniques as for discrete time.Basically we want to minimize the error function

E(t) =12

∑t

H(t)2 (7.35)

per step with gradient descent techniques.

Updating the Value and the Slope

The most obvious way is to take the Hamiltonian as it is and calculate either the direct gradient’s, residualgradient’s or the residual algorithm’s weight update. For example the weight update of the residual gradientalgorithm is given by

w = −η∇wE(t) = η · H(t) · [sγ · ∇wVw(st) − ∇wdVw(st)

ds· f (st,ut)] (7.36)

There are two problems that arise for this method. This algorithm can only be used if the process hascontinuous and differentiable state dynamics. But many interesting problems have discrete deterministicdiscontinuities, like a mechanical shock that causes a discontinuity in the velocity. In this case we cannot calculate the derivative ˙s(t) = f (st,ut). The second potential problem is the symmetry in time of theHamiltonian, the value function update is only calculated with the current statest. This symmetry in time isreported by Doya and Coulom to be a severe problem, that causes the algorithm to blow up.

Approximating the Hamiltonian

The HamiltonianH can be approximated by replacing the derivativeV(t) by an approximation. This ap-proximation usually contains the asynchronous time information (thus the value of the next state). In theliterature we can find two different approximation schemes:

• Euler Differentiation: The time derivative ofV is approximated by the difference quotientdV(t)dt =

V(t+∆t)−V(t)∆t . As a result our residual looks the following way:

residual(t) = r(t) +1∆t

((1− sγ · ∆t) · V(t + ∆t) − V(t)

)(7.37)

This method was proposed and used by Doya. By setting a fixed step size∆t, scaling the valuefunction throughVd =

1∆t · V(t) and settingγ = 1 − sγ · ∆t the Euler TD-error coincides with the

conventional TD-errortdd(t) = r(t) + γ · Vd(t + 1)− Vd(t)

• Complete Interval Approximation: Here we additionally approximate the value ofV(t) in the inter-val [t, t + ∆t] by the average of the values from the interval limits:

residual(t) = r(t) − sγ ·V(t + ∆t) − V(t)

2+

V(t + ∆t) − V(t)∆t

(7.38)

= r(t) +1∆t

((1−

sγ · ∆t

2) · V(t + ∆t) − (1+

sγ · ∆t

2)V(t)

)(7.39)

This method was proposed and used by Coulom.


Both methods are just slightly different approximation schemes, so they are supposed to give the sameresults. Again we can use the direct gradient, residual gradient or use the residual algorithm for this residuals.Comparing the residuals to the discrete case we can see that, because of the approximation of the Hamilto-nianH, which eliminates the derivation of the value functionV, the main difference is a different weightingof the current reward to the value function. The magnitude of the continuous time value function is∆t timessmaller than in the discrete case. But the influence of the value function in the residual calculation is there-fore again 1

∆t times higher. As a result of this scale of the value function in the residual calculation we haveto use smaller learning rates in order to avoid divergence of the algorithm. Usually the residual is multipliedby ∆t to achieve that (to calculate∆w from w). Consequently we get the following residual function (e.g.for the Euler residual):

residual(t) = r(t) · ∆t + (1− sγ · ∆t) · V(t + ∆t) − V(t)

If we setγ = (1 − sγ · ∆t) we can see an additional view of continuous RL, that is to say the reward isnormalized by the time step∆t.

7.3.3 Continuous TD(λ)

As a matter of course we can also use eligibility traces for the continuous time formulations. Coulom derivedthe e-traces equations in continuous time for the direct gradient algorithm. In this thesis we will not look atthis derivation, but we will use his results and extend it for the residual gradient and residual algorithm. Thecontinuous time eligibility traces for the direct gradient algorithm are the given by

e= −(sγ + sλ)e+ ∇wVw(t)(s(t)) (7.40)

This equation can be easily extended for the residual or residual gradient algorithms with the same justifica-tions we used for the discrete time case.

e= −(sγ + sλ)e− ∇wH(t) (7.41)

The weight update using e-traces is given by

∆w = η · H(t0) · ∆t0 · e(t0). (7.42)

These equations again assume that the gradient ofH(t) is independent of the weight vector, which is onlytrue for linear approximators.By using a fixed discretization step∆ti = ∆t we get the following discretized equation

e(t) = e(t − 1)+ ∆e= e(t − 1)+ ∆t · e= (1− (sγ + sλ) · ∆t) · e(t − 1)+ ∇wH(st, st+1) · ∆t (7.43)

which is the same as in the discrete time TD(λ) algorithm if we setλ · γ = (1 − (sγ + sλ) · ∆t) and use a1∆t times higher learning rate (which is already included in the residual calculation in the Toolbox, so samerange of learning rates can be used for discrete and continuous time learning).The continuous time TD(λ) equations are almost the same as in the discrete case, the differences are:

• Another set of Parameters is used:sγ andsλ instead ofγ andλ

• The value function is scaled by the factor of1∆t in the residual calculation, resulting in a higher

emphasis of the V-Function in comparison to the reward function.


• The residual (the TD error) is calculated slightly different, depending on the used approximationscheme.

So for value function learning there are only small differences, and there is nothing really new to the discretealgorithm. The big difference to discrete time learning would be the incorporation of the gradient knowledgedV(s)

ds , but this is reported to be not stable in the proposed way, but using the gradient information somehowdifferently could be an approach which is worth investigating. But there are quite a few differences andadvantages at the action selection part, which we will discuss in the following section.

7.3.4 Finding the Greedy Control Variables

Although there is hardly any difference in learning the value function for continuous time learning, wecan use the state dynamics of the system in continuous time RL for action selection, which gives us anadvantage to use the system dynamics as a sort of prior knowledge (comparable to V-Function Planning).For the continuous case we defined the optimal policy to maximize the HamiltonianH.

π(s) = argmaxu∈U

(r(s,u) − sγV

π(s) +Vπ(s)

ds· f (s,u)

)= argmaxu∈U

(r(s,u) +

Vπ(s)ds· f (s,u)

)(7.44)

This is the continuous counterpart of V-Function planning. The advantage is that we do not need to predictall next states, the gradient of the value function (has to be calculated only once) and the state derivativef (s,u) are used instead. So this approach is computationally cheaper (if the gradients can be calculatedeasily), but the action is only optimal for a infinitely small time interval∆t, so it will in general not find asgood solutions as the standard planning technique. We also still have to use either a discretized action set orwe need complex optimization techniques to find the optimal control vector.

Value Gradient Based Policies

If we are using a linear model with respect to the control variables (f (s,u) = B(s) · u+ a(s), see7.1.5) andthe reward signalr(s,u) has certain properties, the optimization problem of7.44has a unique solution andwe can find a closed form expression of the greedy policy. The greedy action is now defined through theequation

π∗(s) = argmaxu∈U

(r(s,u) +

Vπ(s)ds· B(s) · u

)(7.45)

If the reward signal is independent of the actionu (sor(s,u) = r(s) ) and the control vector is limited to theinterval [umin,umax] the greedy action can be easily found by looking on the signs ofVπ(st)

ds · B(s)). If a valueof this vector is positive, the maximum control value should be taken, otherwise the minimum control value.This kind of policy is called the optimal ‘bang-bang’ control policy.

π(s) = umin+sign

(Vπ(s)

ds · A(s))+ 1

2· (umax− umin) (7.46)

The bang-bang control law always chooses the limit values of the control variables. Although the bang-bangpolicy is optimal (for indefinitely small time steps), the control is not smooth and chattering can destroyphysical systems. Therefore we can introduce a smoothed version of the bang-bang policy by smoothing


out thesign(x) function. This is done with thelogsig function, which is a sigmoid function that saturates atlogsig(−∞) = 0 andlogsig(∞) = 1.

π(s) = umin+ logsig

(cVπ(s)

ds· A(s)

)(umax− umin) (7.47)

The vectorc specifies the smoothness of the control, forc→ ∞ the policy is a bang-bang policy again. Thiscontrol law smoothes out the chattering, but it is also less effective.If the reward is linear dependent with respect to the control and/or the control variables are not limited bymore complex conditions but are limited through a convex region the greedy action can be found by using alinear program, or also a quadratic program for a reward with quadratic action costs (e.g. typical for energyoptimal control).


The design of the Toolbox has the same separation of value function learning and optimal action selectionas discussed in the theory section. So we can interchange the different approaches, and use for examplecontinuous time RL for value function learning, but other approach like V-Planning for action selection. Inthe Toolbox a fixed discretization time step∆t is used as it was done by Doya [17] and Coulom [15] in theirexperiments.

Learning the continuous time value function

As we have seen there are only a few differences to the standard TD techniques in continuous time RL.Actually the only difference is how to define the residual. Our design of the TD-Learner classes alreadyallows the definition of own residual function, thus it would be convenient to use the same already existingclasses and just extend new residual classes. In continuous time TD(λ) the update rules of the weights andthe e-traces are given by

∆w = η · H(t) · ∆t · e(t).

e(t) = (1− (sγ + sλ) · ∆t) · e(t − 1)+ ∇wH(t) · ∆t

In order to use the same TD learner classes, and to be able to use the same magnitude of learning rates, wemultiplied the continuous time residualH with the time step∆t. As a result we get rid of the multiplicationwith ∆t in both equations. The difference in the factors used for attenuating the e-traces ((1−(sγ+sλ) insteadof γ · λ) is neglected, because it is just another choice of parameters, and the choice of the parameterλ ismore intuitive anyway. The Toolbox contains two additional continuous time residuals.

• ‘Euler’ Residual: Uses the standard Euler numerical differentiation to approximate the time deriva-tive of V.

residual(r(t),V(t),V(t + 1)) =

(r(t) +

1∆t

((1− sγ · ∆t) · V(t + 1)− V(t)

))· ∆t

= r(t) · ∆t + (1− sγ · ∆t) · V(t + 1)− V(t) (7.48)

∇wresidual(∇wV(t),∇wV(t + 1)) = (1− sγ · ∆t) · ∇wV(t + 1)− ∇wV(t) (7.49)

• ‘Coulom’ Residual: Approximates the Hamiltonian in the entire intervall:

residual(r(t),V(t),V(t + 1)) = r(t) · ∆t + (1−sγ · ∆t

2) · V(t + 1)− (1+

sγ · ∆t

2) · V(t) (7.50)

∇wresidual(∇wV(t),∇wV(t + 1)) = (1−sγ · ∆t

2) · ∇wV(t + 1)− (1+

sγ · ∆t

2)∇wV(t) (7.51)


Action Selection

For continuous time V-Planning with a finite action set (so the actions have been discretized) we use thesame approach as for the discrete time V-Planning method. We build an extra read-only Q-Function classCContinuousTimeQFunctionFromTransitionFunctionwhich implements the continuous time version of theaction values.

Q(s,a) = r(s,a) +Vπ(s)

ds· f (s,a) (7.52)

This Q-Function can again be used for any stochastic policy in the Toolbox.The derivation of the value function with respect to the input state is done numerically (by the classCV-FunctionNumericInputDerivationCalculator) with the three point rule

dV(s)dsi

=V(s+ αi · ei) − V(s− αi · ei)

2 · αi(7.53)

, whereαi is a step size parameter for the dimensioni. The step size can be chosen for each state variableseparately. The class is subclass ofCVFunctionInputDerivationCalculator, so other approaches for calcu-lating this derivation can be added easily. An analytic approach was not used because it would be verycomplex to design that in general for linear feature functions because of the adaptive feature state model ofthe Toolbox. But implementing the analytical approach for a single function approximator scheme wouldbe worth trying because of the expected increase of speed and perhaps also the performance of the policy.The implementation of the smooth value gradient policy is done by the classCContinuousTimeAndAction-SigmoidVMPolicy. The class is subclass ofCContinuousActionController, so it already calculates continu-ous action vectors, and does not choose a particular action object. The same approach for calculating thederivationdV(s)

ds is used as above. Both approaches need a model which is linear with respect to the controlvariables (classCLinearActionContinuousTimeTransitionFunction)

7.4 Advantage Learning

Advantage Learning was proposed by Baird [6] as an improvement of the Advantage Updating algorithm[5] which is not covered in this thesis. Instead of learning the action valueQ(s,a), the algorithm tries toestimate for each state-action pair< s,a > the advantage of performing actiona instead of the performingthe currently considered best actiona∗. Thus this algorithm can usually only be used for a discrete set ofactions, comparable to Q or SARSA learning.The optimal advantage functionA∗(s,a) is defined to be

A∗(s,a) = V∗(s) +E[r(s,a) + γ∆t · V∗(s′)] − V∗(s)

∆t · K(7.54)

Hereγ∆t is the discount factor per time step (consequentlyγ is the discount factor for one second),K is thetime unit scaling factor. The value of states is defined to be the maximum advantage of states (similar tothe definition to the Q-Function).

V∗(s) = maxaA∗(s,a) (7.55)

The Advantage can also be expressed in terms of action values

A(s,a) = V(s) −max′aQ(s,a′) − Q(s,a)

∆t · K(7.56)

7.5. Policy Search Algorithms 125

Under this definition, to provide a better understanding, we can see the advantage of the sum of the valuein the current state plus the expected value at which performing actiona affects the total discounted reward.The second term is obviously zero only for an optimal action and negative for all suboptimal actions.Another important aspect of Advantage Learning is that the advantage gets scaled by the time step. For∆t · K = 1 the algorithm completely coincides with the Q-Learning algorithm. In an optimal control task, aswe choose a smaller∆t value, the Q-Values of states will all approach to the valueV(s) because perform-ing different actions has less consequence for small time steps. But for advantage learning the differencebetween the advantages of the actions will stay the same because they get scaled by the time step∆t.

7.4.1 Advantage Learning Update Rules

Advantage Learning is also a value based algorithm, so we just need to define a residual and we can usethe theory about value function approximation discussed in the previous sections. The residual of advantagelearning can easily be found by subtractingA∗(s,a) on both sides from equation7.54and inserting equation7.55.

residual(t) =(rt + γ

∆tmaxa′A(st+1,a′)) 1∆tK

+ (1−1∆tK

)maxa′A(st,a′) − A(st,at) (7.57)

This gives us the following residual gradient

∇wresidual(t) =1∆t · K

γ∆t∇wmaxa′A(st+1,a′) + (1−

1∆t · K

)∇wmaxa′A(st,a′) − ∇wA(st,at) (7.58)


The structure of the residual of advantage learning differs from our standard residual design because itdepends on three values, additionally the optimal value of the current state is needed. As a result we can notuse our residual gradient framework. For the advantage learning algorithm, we derive an individual classfrom the Q-Function residual learner class. Just the calculation of the residual and the residual gradient ischanged, the rest of the functionality remains the same. As a result we can use the residual algorithm foradvantage learning with either a constant residualβ factor or the optimalβ factor. Again the direct gradientalgorithm is achieved by choosing aβ = 0.0 and the residual gradient algorithm by choosing aβ = 1.0 value.

7.5 Policy Search Algorithms

Policy search approaches try to find a good policy directly in the policy parameter space, no value functionis learned. Often the gradient of the value of a policyV(π) (or some other performance measure) withrespect to the policy parameters is estimated and then used to improve the policy. These methods are usuallyreferred to as policy gradient approaches. But also every other kind of optimization method can be usedfor improving the policy, like genetic algorithms, stimulated annealing or swarm optimization techniques.Policy search algorithms try to avoid to learn the value function, this can be advantageous because learningthe (optimal) value function is usually very difficult. When using value function approximation, even with agood approximation of the value function, a bad greedy policy can be created, which is another disadvantageof value function learning. (see7.2).The disadvantage of policy search methods is, that the increase of performance is harder to estimate whenno value function is learned. When searching in the policy parameter space with gradient descent, we arealso likely to end in a local maximum of the performance measure.


In this thesis we will only discuss policy gradient algorithms. We will consider either a stochasticµ(s,a, θ)or deterministic policyπ(s, θ), which is parameterized byθ. As usual we want to maximize the expectedtotal discount rewardV which is given by

V =∑s∈D

d(s) · V(s) (7.59)

for a finite set of states.D is the set of the initial states, andd(s) is the probability distribution over thesestates. Consequently all policy gradient algorithm want to do gradient ascent onV, and thus we get theweight update rule (we will again refer to the gradientV

dθ as∇θV or short as∇V).

∆θ = η∇θV (7.60)

In the next sections we will at first discuss a method for updating the weights given the gradient directionand the learning rate, then we will cover two different approaches for learning rate adaption, and in the endwe will come to two approaches of estimating the gradient direction, GPOMDP [11] for stochastic policiesusing a discrete action set, and PEGASUS [33] also for real valued policies.

7.5.1 Policy Gradient Update Methods

At first we will discuss methods for updating the weights if we already have a given estimate of the gradient.Baxter [11] proposed the CONJPOMDP algorithm which uses a variant of the Polak-Ribiere conjugategradient descent algorithm for the weight updates. Although Baxter uses the CONJPOMDP algorithm forhis GPOMDP algorithm, it can be used for any other gradient estimation approach. The algorithm is listedin algorithm6.

Algorithm 6 CONJPOMDPg = h = getGradient(θ)while ||g|| ≥ ε doη = getLearningRate(h, θ)θ = θ + η · h∆ = getGradient(θ)γ =

(∆−g)·∆||g||2

h = ∆ + γhif h · ∆ < 0 then

h = ∆end ifg = ∆

end while

The algorithm is terminated when the norm of the gradient is smaller than a given constantε. getGradientreturns an estimate of the gradient direction andgetLearningRateprovides a good choice of the learningrate.

7.5.2 Calculating the learning rate

In this section we will discuss two different algorithm for calculating the learning rate, both are based online search approaches. The estimation of the gradient usually needs a lot of training examples, so we haveto exploit this information optimally. Gradient estimation schemes usually tell us the direction of optimalupdate, but they do not tell us the step size of that update.


Value Based Line Search

This approach is a straightforward line search approach. Given a list of possible step sizes, it applies all stepsizes and estimate the expected discounted rewardV by simulation. At the end we can alternatively searchfurther between the given step sizes or immediately return the best step size. We estimate theV value byndifferent simulation trials with different initial states.The disadvantage of this approach is that the list of step sizes has to be given, and the expected discountedreward estimates can be very noisy. Since we have to estimate the best learning rates, we have to comparethe expected discounted rewardsV(θi), thus we have to calculatesign[V(θi) − V(θ j)]. If the estimates arenoisy, then the variance ofsign[V(θi) − V(θ j)] approaches to 1.0 (the maximum) asθi approachesθ j . But ifwe are using the same initial state set (reduces a lot of noise in the estimates) for all estimations ofV(θi) thiseffect can be reduced at a minimum at least for the deterministic benchmark problems.

Gradient Based Line Search

Gradient Based Line Search (GSearch) was proposed by Baxter [11] and used with the GPOMDP algorithm.GSearch tries to find two pointsθ1 andθ2 in the direction of the current estimated gradient∆ = ∇θdV(θ0),such that

∇θV(θ1) · ∆ > 0

∇θV(θ2) · ∆ < 0 (7.61)

The maximum must lie between these two points. The advantage of this approach is that even for noisyestimates, the variance ofsign(∇θV(θi) · ∆) andsign(∇θV(θ j) · ∆) is independent of the distance of the twoparameter vectors. Calculating many new gradients for updating the parameter vector with a single gradientvector does not seem to be very effective, we can use much more noisy (so less training examples are usedto calculate it) gradients for estimating the step size, than the gradient used for updating the weights.The GSearch algorithm starts with a initial step sizeη0 and calculates the new gradient at this position. Ifthe product with the given update gradient∆ is positive, the algorithm searches with the double step size ateach trial until the product of the new calculated gradient and the update gradient is negative (actually thiscondition is relieved to be below a certainε in order to add a robustness to error of the gradient estimates). Ifthe product of the first gradient calculation with the update gradient is negative, the same procedure is donewith the half of the step size at each trial. The search interval is restricted to a certain search range. Thelast two search pointsηk−1 andηk either bracket the maximum (sign(∇θV(θ1) · ∆) , sign(∇θV(θ2) · ∆)), orno maximum was found in the search interval (then the last tw search points are close to one of the searchlimits). In either case the last two step sizes are used to calculate the maximum.If a maximum was found (if we have a positive and negative gradient product) the two step sizes are thenused to estimate the maximum between them by calculating the maximum of a quadratic which is definedby these two points and the slope information coming from the gradient. If no maximum was found, thealgorithm applies the midpoint of the last two search step sizes.

7.5.3 The GPOMDP algorithm

This algorithm was proposed by Baxter [9] [11] to estimate the gradient of a stochastic, parameterized policyfor a partial observable Markov decision process (POMDP).At first we consider a MDP with finite state spaceS. Given the parameterized stochastic policyµ(s,a, θ) thestochastic matrixP(θ) = [pi j (θ)] defines the transition probabilities of the statessi to sj . For the GPOMDPalgorithm, we have to make some additional assumptions on the Markov chains which are created by the


transition matrixP(θ). Each Markov chainM(θ) following the transition matrixP(θ) has to have a uniquestationary distributiond(θ) = [d(s0, θ),d(s1, θ), . . . ,d(sn, θ)] which satisfies the balance equation

d(θ)T · P(θ) = d(θ)T (7.62)

The stationary distribution gives us the probability of being in states after having done infinite transitionsaccording toP(θ). The stationary distribution is independent of the initial state. The spectral resolutiontheorem [25] states that the distribution of the states converges to the stationary distribution at an exponentialrate, the time constant of this rate is called themixing time. The stationary distribution (if it exists) is thefirst left eigenvector (eigenvector with the highest eigenvalue) of the transition probability matrix and hasthe eigenvalueλ1 = 1. The mixing time is given by the second eigenvalueλ2 of P(θ).The algorithm tries to optimize the average reward criterion

VA(µ) = limT→∞

1T

T∑t=0

rt = d(θ)T · r (7.63)

where r is the reward vector [r(s0), r(s1), r(s2), . . . ]. We can also extend the algorithm to action dependentrewards, but this is not done in this thesis. Note that optimizing the average reward and optimizing the totaldiscount reward is theoretically equivalent, furthermore it can be shown that the discounted reward criterionV can be expressed byVA [9].

V =VA

1− γ(7.64)

so we do not loose any generality by optimizing the average reward instead of the discounted reward.Baxter [9] proved that the gradient of the average reward∇VA(µ) = ∇d(θ) · r is equal to∇βVA(µ), which isgiven by:

∇βVA(µ) = d(θ)T · ∇P(θ) · Vβ

=∑i, j

d(i, θ)∇pi j (θ) · Vβ( j)

=∑i, j

d(i, θ)pi j (θ)∇pi j (θ)

pi j (θ)· Vβ( j) (7.65)

for limβ→1. Vβ measures the merit of states and is equivalent to the definition of the value function.

Vβ(s) = E

∞∑i=0

βi · r(si)|s0 = s

The variableβ sets the bias variance trade off, for β = 1 we have an unbiased estimator of the gradient, butwith high variance, small values ofβ give lower variance, but the estimate∇βVA(µ) might not even be closeto the needed gradient∇VA(µ).The term

∇pi j (θ)pi j (θ)

= ∇ ln pi j can be seen as making the transitionpi j more probable. So this equation can in-tuitively be seen as increasing the probability of state transitions to states with a high performance measure(Vβ) more than to states with smaller performance measure. This gradient can be rewritten for a stochasticpolicy µ(s,a, θ) and action dependent transition probabilitiespi j (a) . We can write for the transition proba-bility pi j (θ) =

∑a pi j (a)µ(si ,a, θ) and thus for∇pi j (θ) =

∑a pi j (a)∇µ(si ,a, θ) Inserting this in equation7.65,

we get the following gradient calculation rule

∇βVA(µ) =∑i, j,a

d(i, θ)pi j (a)∇µ(i,a, θ)µ(i,a, θ)

µ(i,a, θ) · Vβ( j) (7.66)


GPOMDP uses one step samples of this equation to estimate∇βVA(θ), the algorithm is listed in algorithm7.

Algorithm 7 GPOMDP gradient estimationz= 0∆ = 0for eachnew step< st,at, rt, st+1 >, t ∈ 1 . . .T do

zt+1 = β · zt + ∇ logµ(st,at, θ)∆t+1 = ∆t + rt · zt+1

end for∆t =

∆tT

return ∆t

It uses two traces for each weight,zt and∆t. Baxter [9] proved that the series∆t gives an unbiased estimateof equation7.66, i.e. that

limt→∞∆t = ∇βV(θ) (7.67)

In combination with the result that limβ→1∇βV(θ) = ∇V(θ), it is proved that GPOMDP produces unbiasedestimates of∇V(θ) if we setβ to 1.0.

7.5.4 The PEGASUS algorithm

PEGASUS stands for Policy Evaluation of Merit and Search Using Scenarios and was proposed by Ng andJordan [33]. They successfully used the PEGASUS algorithm to control an inverted helicopter flight insimulation and also had good results using the learned policies for a real model-helicopter. The PEGASUSalgorithm can be used to learn any stochastic or deterministic policy, but it makes additional assumptions onthe model in such a way that only simulated tasks can be learned.A big problem for policy search algorithm is the noise in the performance estimate, consequently it isusually hard to decide which of two policies is better. The noise can be introduced by different initial statesamples, different (lucky or unfortunate) noise in the model and also in the controller if we use some sort ofexploration policy. As performance measure we again use the expected value of a policyV(π). If we wantto refer to the value of the policy in the (PO)MDPM we writeVM(π).

Converting stochastic (PO)MDPs to deterministic (PO)MDPs

PEGASUS solves this problem by adding additional assumptions to the model. In optimal control the MDP(or POMDP if parts of the model state are not observable) is defined via a generative modelst+1 ∼ f (st,at)which is usually a stochastic function. For PEGASUS we assume to have a stronger model, we use thedeterministic functiong : S ×A×[0,1]p → S for our state transition. The functiong additionally dependson p random variables, so thatg(s,a,p) with an uniformly distributedp vector is distributed in the sameway as the distribution of the stochastic transition functionf (st ,at). Consequently the functiong has anadditional input vector to specify the internal random process off . The model g is called deterministicsimulative model. From probability theory it is known that we can sample any distribution by transformingone or more samples from the uniform distribution, so we can construct a deterministic simulative modelfor all generative modelsf . The deterministic model is obviously a stronger model than a generative model,but for simulated tasks, which typically use uniform random samples from a random generator to simulatenoise, already have indirectly this interface to the random generator, so this assumption on deterministicsimulative models does not restrict us rigorously, if we are using a simulated model.


Having access to the deterministic modelg, it is easy to transform an arbitrary (PO)MDP M into an equiva-lent (PO)MDPM′ with deterministic transitions. The transformation is accomplished by adding an infinitenumber of uniformly sampled random state variables to the initial states′0 =< s0, p0, p1, · · · >. For sim-plicity we assume scalarp values, the extension to vectors is trivial. The transitionst+1 ∼ f (st,at) is nowreplaced through the transition

s′t+1 =< st+1, pt+1, pt+2, · · · >=< g(st,at, pt), pt+1, pt+2, · · · >

Consequently the randomness of the MDP is fixed at the beginning of each episode. The rest of the MDPMremains the same in the MDPM′, the dependency of the policyπ and the reward functionR is still reliantoriginal state space, not on the additional random variables. The initial state distributionD′ now consists ofthe initial state distributionD of the original MDP and infinitely many uniform distributions for the randomstate variables.The question is whether is makes sense to use the same random numbers for two different policies, becausethe random numberpt might affect the performance measure positively when following the policyπ1, but itmight have a negative effect when following policyπ2 because of the different context. Thus the performanceestimates can still be noisy because the random variablesp do not give the same conditions for all policies.

PEGASUS Policy Search Methods

If only the original state spaces is observed during an episode, one obtains a sequence that is drawn fromthe same distribution as would have generated from the original MDP M. Thus it is also clear that a policyπ will have the same expected value inM as inM′ (VM′(π) = VM(π)), as a result we can optimizeVM′(π)instead ofVM(π).The value of the policyπ is given by

VM′(π) = Es0∼D′ [VπM′(s0)]

The expectancy value can be estimated by choosingn samples from the initial state distributionD′

VM′(π) ≈1n

m∑i=1

VπM′(s

i0) (7.68)

The value of a initial stateVπM′(s

i0) can be calculated by simulation and by summing up the discounted

rewards. The real value can only be calculated by an infinite sum, a standard approximation is to truncatethe sum and use onlyH reward values for the calculation.

VπM′(s0) =

H∑t=0

γtr(st,at, st+1) (7.69)

Since we use a finite horizon H we can restrict our initial state toH random variables, the rest is not neededanyway. For a given approximation errorε, H can be calculated byHε = logγ

(ε(1−γ)2rmax

), wherermax is

the maximum absolute reward value. Due to the fixed randomization of the (PO)MDP, the value of the

policy VM′(π) is a deterministic function. Consequently, we can use any standard optimization technique forfinding a good policy. As in our case, if the state, action and policy parameter space is continuous and allthe relevant quantities are differentiable we can use gradient ascent methods for optimization. A commonproblem for gradient ascent is that the reward signal must be continuous and differentiable, which is not thecase if, for example, we give rewards only for specific target states or regions. One approach to deal with


with that barrier, which is often used in RL for optimal control anyway, is to smooth out the reward signaland use, for example, a distance measure to the target state as reward. This method is often referred to asshapingthe reward function. (For a more detailed discussion about using shaping for RL see [39]).The Toolbox contains two ways of calculating the gradient, which are introduced in the next sections.

Calculating the gradient numerically

One obvious way to calculate the gradient is to use numerical methods. We use the three point rule tocalculate the derivative of the value of policyπ with respect to the policy parameterθi .

∇θV(θ) =V(θ + αei)) − V(θ − αei))

2 · α(7.70)

Thus, the value of the policy has to be estimated for each weight of the policy twice, which is computa-tionally very expensive. This approach is very cost intensive, but gives us quite accurate estimates of thegradient.Ng and Jordan [33] used the numerical gradient for most of their experiments, but no detailed explanationswhere given as to how they calculated the gradient numerically.

Calculating the gradient analytically

The gradient of the value of the policyπ(θ) can also be calculated analytically. For simplicity we assumethat the reward signal just depends on the current state and is also differentiable with respect to the inputstate variables.The value of states0 is given by

VπM′(s0) = r(s0) + γr(s1) + γ2r(s2) + . . . (7.71)

Given the derivation of the reward function and the model and the policy parameter derivation, we can alsocalculate the gradient ofV

π

M′(s0) analytically with considerable savings in computation time.

∇θVπM′(s0) = γ

dr(s1)ds

ds1

dθ+ γ2dr(s2)

dsds2

dθ+ . . . (7.72)

The derivative of the successor statesst+1 with respect to the policy parametersdst+1dθ can be calculated

incrementally given the derivative of statest.

dst+1

dθ=

g(st, π(st),p)dθ

=[

dg(st ,π(st ,θ),p)ds

dg(st ,π(st ,θ),p)da

]·

[ dstdθ

dπ(st ,θ)dθ

](7.73)

The derivationdπ(st ,θ)dθ can be further resolved

dπ(st, θ)dθ

=[

dπ(st ,θ)ds

∂π(st ,θ)∂θ

]·

[dstdθI

](7.74)

whereI is a p × p identity matrix, p being the number of parameters used for the policy. Thus we needto know the input state derivations ofdr(s)

ds , dg(s,a,p)ds , dg(s,a,p)

da and dπ(st ,θ)ds . If an analytical solution of these


gradients is not available, these quantities can also be calculated numerically with only a little loss of per-formance.Note that at our state of knowledge this is the first time the gradient has been calculated analytically in thisway, we will show in the experiments section that the speed of this approach significantly outperforms thenumerical approach at a comparable learn performance.


The design of the policy gradient learner classes matches our description of the structure of policy gradientmethods. Since the policy gradient update scheme does not match the standard per step update in RL, policygradient learner classes cannot be used as pure agent listeners. The policy gradient estimator classes usuallyhave a listener part, to get informed about the sequence of steps, but they also have methods for controllingthe agent class (e.g. to tell the agent to simulaten episodes).

Policy Updater Classes

Policy updater classes (subclasses ofCGradientPolicyUpdater) acquire the estimated gradient as input andhave to update the policy. Therefore the class is supposed to calculate a good learning rate and then directlyupdate the policy via the gradient update function interface of the policy. There are three implementationsof the updater class.

• Constant Step Size Update (CConstantPolicyGradientUpdater): Almost self-explanatory, uses aconstant learning rate for each update.

• Value Based Line Search (CLineSearchPolicyGradientUpdater): Implements the discussed value-based algorithm. For the value estimation, a policy evaluator object is used. Hence we can usethe value or the average reward of a policy as performance measure. We can also set the numberof episodes and steps per episode used for the performance estimation. The algorithm searches atthe given step sizes and stores the performance. If there are any search steps left (searchS teps<maxS teps) after having estimated all learning rates, the algorithm continues the search in the neigh-borhood of the maximum by searching in the middle of two adjacent points. Eventually the learningrate with the maximum value is applied to update the policies parameters.

• Gradient Based Line Search (CGSearchPolicyGradientUpdater): The GSEARCH class has a pol-icy gradient estimator class object as input, thus it can calculate the gradient for a specified learningrate. This gradient estimator is usually a less accurate version of the gradient direction used for up-dating. The search process is then done in the discussed way and at the end the best learning rate isapplied. We can set the search interval [ηmin, ηmax], the start learning rateη0.

Policy Gradient Learner Classes

The task of the policy learner classes is to put the functionalities of the gradient estimator and the gradientupdate classes together. In the Toolbox there is just one implementation, the CONJPOMDP algorithm, butaveraging over the old gradient estimates can be turned off. The class has access to a policy updater classand a policy gradient estimator class. The gradient for the update is calculated according to algorithm6 asthe weighted average over former gradient estimates.


Policy Gradient Estimator Classes

The gradient estimates themselves are calculated by subclasses ofCPolicyGradientCalculator, these classesusually have direct access to the agent. The gradient is again represented by a feature list such as the one wealready used for the gradient calculation of the value function. We will come to the different implementationsof the estimator classes after having discussed the GPOMDP and the PEGASUS algorithm.

GPOMDP Gradient Estimation

The GPOMDP gradient estimator class (CGPOMDPGradientCalculator) implements the policy gradientestimator and also acts as an agent listener interface. The class has access to the agent, in the policy gra-dient estimator interface, the class adds itself to the agent listener list and executes the specified number ofepisodes and steps. In the agent listener interface, the class maintains the two traceszt and∆t for the localgradient of one episode. The gradient∇ log(π(st,at, θ)) is calculated by the stochastic policy interface (see6.2.5). After each episode the local gradient of one episode is added to a global gradient object, which isreturned by the policy gradient estimator interface in the end.

PEGASUS Gradient Estimation

The transformation of the stochastic (PO)MDPM to the deterministic MDPM′ is not done explicitly, rather,we just use an individual random generator functionrlt rand. For our random generator function we canset a list of random variables uniformly distributed in [0,1]. If this list has been specified, a value fromthis list in ascending order, is taken instead of a ‘real’ random number. This has the same effect as thedeterministic simulative model, if we restrict all simulated models to using this random number generatorinstead of the standard random generator. The problem is that we do not know exactly how many randomsamples are needed for one episode. Therefore, we used the following approach, in the first trial of thePEGASUS gradient estimation, no list is used. Instead the list is created simultaneously. For the followingPEGASUS calls this list is always used, if we happen to need more random samples are needed later on, thelist is enlarged again. As a result we get our deterministic (PO)MDPM′

Estimation of the numerical gradient requires the use of the classCPEGASUSNumericPolicyGradientCal-culatorwith the three point method.The analytical algorithm is more complex. We use her again, the policy gradient estimator and the agentlistener interface simultaneously (similar to the GPOMDP algorithm). The policy gradient estimator partadds itself to the agent listener list and then starts the agent for a specified number of episodes and steps perepisode.In the agent listener part we need to calculate the derivative of the reward function, the policy and thetransition functiong. For calculating the derivative of the reward function a new method is added to the statedependent reward interface (CStateReward). In this method the user has to implement the derivative of thereward function if he wants to use the PEGASUS algorithm.For calculating the derivative of the transition function and of the policy (with respect to the state), owndifferentiation classes are used (CTransitionFunctionInputDerivationCalculatorandCCAGradientPolicyIn-putDerivationCalculator). There is one implementation for each of these differentiation classes which dothe differentiation numerically using the three point method, but an analytical solution can easily be includedwith this approach. The described equations could all have been implemented by matrix multiplications, butbecause we are dealing with gradients, which are likely to be sparse, we decided on another approach. Fordstdθ , dst+1

dθ and also fordπ(st ,θ)dθ we maintain an individual data-structure, which consists of a list of gradient

objects (feature lists), one list for each continuous state variable (derivation of the successor states) and one


for each control variable (derivation of the policy). We can write equation7.73in the following vector form:

dsi(t + 1)dθ

=

n∑j=1

dgi(s(t),a, p)dsj

·dsj(t)

dθ+

m∑k=1

dgi(s(t),a, p)dak

·dπk(s(t))

dθ+

m∑l=1

dgi(s(t),a, p)dal

n∑p=1

dπl(s(t))dsp

·dsp(t)

dθ

(7.75)

where the subscriptsi, j, k, l, p are the indices of the state(s) or control variable(s). Hence, all the mathe-matical operations consist of multiplying a list of gradients with a matrix (the derivatives of the policy andtransition function) to get another list of gradients, and add the result to an existing list of gradients. Weimplement an own function for doing that. The function takes the input gradient list of sizen, an outputgradient list of sizemand a multiplication matrix of sizem× n as input, then it calculates the product of theith input vector with theith column of the multiplication matrix and adds the result to the output gradient list

o j = o j + M( j, i) · ui , for all 1 ≤ i ≤ m and 1≤ j ≤ n (7.76)

whereo is the output list,u is the input list andM the multiplication matrix. With this operation, we cancalculatedst+1

dθ from dstdθ with equation7.75. We maintain an individual list of gradients for the derivative of

the policy dπ(st)dθ separately, which is calculated by equation7.74. The gradient feature list ofdst

dθ is alwaysstored for the calculations required in the next step, this gradient list is only cleared at the beginning of anew episode. Finally having calculatedds(t)

dθ , the gradient list is multiplied with the reward gradient vectordr(s(t))

ds and the result is added to a global gradient feature listgrad.

grad = grad+n∑

i=1

γi−1dr (s(t))dsi

dsi(t)dθ

(7.77)

After having executed all the gradient estimation episodes, this global gradient is returned from the gradientestimator interface.

7.6 Continuous Actor-Critic Methods

Actor-Critic methods can be viewed as a mixture of value based and policy search methods. They learn thevalue function while representing the policy in a separate data structure. We have already discussed Actor-Critic methods for a discrete state and action set in chapter4. By using a function approximator for the valuefunction and using the gradient of the policy for the updates instead of the state indices these approachesare easily extended to a continuous state space. But if we already use an individual parametrization for thepolicy, it would be more efficient to use continuous control values for our policy. In this chapter we willpresent two different methods for Actor-Critic learning with continuous control policies. At first we willdiscuss the stochastic real valued algorithm (SRV) [19], and then we will come to a new approach which isproposed in this thesis, which we will call policy gradient Actor-Critic algorithm (PGAC).

7.6.1 Stochastic Real Valued Unit (SRV) Algorithm

The SRV algorithm was proposed by Gullapalli [19] for continuous optimal control problems. The initialdefinition was just for associative reinforcement learning, i.e. the algorithm is only used to optimize the im-mediate performance return. But by taking a learned value function as performance measure, this algorithm

7.6. Continuous Actor-Critic Methods 135

is easily extended to discounted infinite horizon control problems [17]. In this case, as for the Actor-Criticalgorithms, the algorithm is independent of the critic part, so we can use any V-Learning algorithm to learnthe V-Function. For more details about Actor-Critic architecture please refer to section4.6

SRV Units

In the SRV algorithm, we have an SRV unit for each continuous control variable. An SRV unit returns asample from the normal distributionN(µ(st, θ), σ(st,w)).The mean value is defined by the actor’s parametersθ. The actor can be represented by any kind of functionapproximation scheme.σ is a monotonic descending, non negative function depending on the performanceestimate. The varianceσ may depend on the current performance estimate, which in our case is the valueof the current stateVw(st). The better the performance estimate, the lower theσ values that are used. Forexample, we can use a linear scaling of the value function for calculating theσ value

σ(t) = K ·

(1−

V(st) − Vmin

Vmax− Vmin

)(7.78)

whereK is a scaling constant. A multi-valued policy is defined by several SRV units

π(st) = µ(st, θ) + n(t) (7.79)

wheren(t) is the noise vector which is sampled from the distributionN(0, σ(st)). Alternatively we can alsouse a filtered noise signal to have certain continuity properties for the policy. In order to impose the controllimits of the control variables we can use a saturating function as discussed in section7.1.1.

SRV Update Rules

The key idea of the SRV algorithm is to perturb the current policy with a known noise signal (defined byσ).If the performance of the perturbed control signal is better than the estimated performance of the originalpolicy the policy’s output value is adapted to move in the direction of the noise signal. If the performance isworse, the output value is adapted to move in the opposite direction.The old performance estimateVold(st) = Vw(st) uses the value function in statest. The new performanceestimate is calculated in the standard value-based wayVnew(st) = rt + γ · Vw(st+1). The difference of bothcoincides with the temporal difference from TD-learning.Consequently, the parameters of the actor are updated in the following way

∆θt = η · td(t) ·n(t)σ(t)

π(st, θ)dθ

(7.80)

7.6.2 Policy Gradient Actor Learning

Policy Gradient Actor Learning is a new Actor-Critic approach which is proposed in this thesis. It is amixture of the analytical PEGASUS algorithm (see7.5.4) and V-Learning (we need an exact model of theprocess). Again we want to calculate the gradient of the value of the policy with respect to the policy’sparametersdV

dθ . But now we can learn the value function explicitly with any V-Learning algorithm.Furthermore we can assume once more that the reward function is dependent only on the current statest anddifferentiable. We can estimate the value of the policy in statest with the standard value based approachV(st) = r(st) + γ · V(st+1). The successor statest+1 was created by following the policyπ(st, θ), so we have


a dependency onθ in this equation. Thus, an obvious approach is to calculate the derivation of this equationwith respect toθ.

dV(st)dθ

= γdV(st+1)

dsdst+1

dθ(7.81)

dst+1dθ can be calculated in a similar way, as in the analytical PEGASUS algorithm.

dst+1

dθ=

dg(st, π(st, θ))da

·dπ(st, θ)

dθ(7.82)

We try to move the statest+1 in the direction of the gradient of the value functiondV(st+1)ds by updating the

weights of the actor. Thus, the policy is improved, if the value function is correct and an appropriate learningrate was used.We can further extend this approach. At time stept we can lookk steps into the past andl steps into thefuture to estimate the value of in statest−k more accurately.

V(st−k) = r(st−k) + γr(st−k+1) + γ2r(st−k+2) + · · · + γk−1r(st) + · · · + γk+l−2 · V(st+l) (7.83)

Again we can calculate the gradient of this equation with respect toθ, which is now a more accurate versionof equation7.81.By calculating the gradient of equation7.83with respect toθ we get

∆θ = η ·V(st−k)

dθ= η · γ

dr(st−k+1)ds

dst−k+1

dθ+ γ2dr(st−k+2)

dsdst−k+2

dθ+ · · · + γk−1

dr(st)ds

dst

dθ+ · · · + γk+l−1 ·

dV(st+l)ds

dst+l

dθ

(7.84)

sτ can be calculated fromsτ−1 by equations7.73and7.74.Hence, we can choose anl-step prediction and ak-step horizon for the past. The PGAC algorithm is listedin algorithm8.There is only a small difference between using prediction horizons or past horizons, which will be confirmedby our experiments. For the forward horizon, the policy gets updated before really executing the action forthe statest. Another advantage is that the predicted states get recalculated at each step, while for thebackward horizon the states are storedk steps before. Thus, using largek values for the backwards horizoncan be risky, because the policy parameters change at each state. As a result, the stored state sequence mightnot be representative for the current parameter setting any more. Using anl-step prediction horizon is relatedto the presented V-Planning method by using a search tree over the value function. This helps to reduce theeffect of imprecise value functions.The advantage of the new approach is that the computational costs only increase linearly (instead of theexponential costs of the search tree) with the prediction horizon; so we can predict more steps into thefuture easily. In the experiment section, we will show the immense advantage in computation speed of thisapproach in comparison to V-Planning. Of course another advantage is that we can produce a continuousaction vector instead of using a discrete action set.An approach for improvement would be using an adaptable learning rate for the actor, since we know∇wst+k,the new states′t+k, which is reached by the agent if we update the weights and simulatek steps, can easilybe estimated by

s′t+k = st+k + ∇θst+k · ∆θ

Thus, a line search can be implemented to search onV(s′t+k) for a good learning rate for∆w.

7.6. Continuous Actor-Critic Methods 137

Algorithm 8 The Policy Gradient Actor-Critic algorithmstates= list of lastk states.for eachnew step< st,at, rt, st+1 > do

Putst+1 at the end ofstatesPredictl − 1 states fromst+1 and put them at the end ofstates∇s= 0grad = []for i = t − k to t + l − 1 do

s← states(i)grad = grad+ dr(s)

ds · ∇s

∇π =[

dπ(s,θ)ds

dπ(s,θ)∂θ

]·

[∇sI

]∇s=

[dg(s,π(s,θ))

dsdg(s,π(s,θ))

da

]·

[∇s∇π

]end fors← states(t + l)grad = grad+ dV(s

ds · ∇s∆θ = ηgraddismiss predicted states andst+k from states

end for

Another idea for improvement is to use different time intervals for the updates during learning. E.g. in thebeginning of a learning trial large prediction/backwards horizons can be used because the value functionestimate is very noisy in this stage of learning. The time intervals can then be reduced again at a laterlearning phase where the value function estimate is already more reliable.


SRV Algorithm

The SRV algorithm fits perfectly into our Actor-Critic architecture, so we can implement the actor as anerror listener of the TD-error. The actor maintains a continuous action gradient policy object, which isalso supposed to be used as agent controller. Continuous action controllers already contain an own noisecontroller, thus SRV units are already implicitly implemented. The dependency of the random controller’svarianceσ on the value function can be modeled by our adaptive parameter approach, here we can use theCAdaptiveParameterFromValueCalculatorclass.

From the policy object, the SRV algorithm (CActorFromContinuousActionGradientPolicy) can retrieve both,the noise vectorn(t) and the usedσ value (see section7.1.1). In this approach, the noise vector is alwaysrecalculated by calculating the difference between the executed control signal and the control signal withoutnoise. This has to be done because the noise vector is not be stored with the action object. Through thisapproach, it is also theoretically possible to use another controller as the agent controller (e.g. imitationlearning), because the difference of the policies output to the executed control signal is always used as thenoise signaln(t). With this information, the update of the actor is straightforward and given by equation7.80


PGAC Algorithm

Policy gradient Actor-Critic learning is implemented by the classCVPolicyLearner. This class implementsthe discussed policy update rules given a differentiable value function as critic and a differentiable policyas actor. The updates are done in the agent listener interface, hence the algorithm consists of two agentlisteners, one for the critic updates and one for the policy updates. The critic updates have to be done beforethe update of the policy, consequently the critic learner has to be added before the policy learner to thelistener list. We maintain a list of states (state collection objects) for< st−k, st−k, . . . st+l >. The backwardshorizonk and the prediction horizonl can both be set with the parameter interface of the Toolbox.Thek latest past states are always stored in the list. At each new step, the predictedl future states are addedto the state list, these future states are deleted from the list again at the end of the update. Corresponding tothat list of states, a list of vectors containing the derivativesdr(st)

ds is maintained. Since these derivatives aretime independent for a certain state, we do not have to calculate the reward derivatives for the whole statelist again, just the derivatives for the new, predicted future states. Additionally we can implement a functionfor calculatingdst+1

dθ from dstdθ , which is done in a similar manner to the analytical PEGASUS algorithm (see

7.5.4). Now we can use this function, the state list< st−k, st−k+1, . . . st+l > and the list of reward derivatives< dr(st−k)

ds , dr(st−k+1)ds , . . . , dr(st+l)

ds > to calculate the gradient given in7.83. This gradient calculation is done ateach time step, so it is quite time consuming for bigger update intervals [t − k, t + l].

Chapter 8

Experiments

In this chapter we will test the RL Toolbox for continuous control tasks on three benchmark problems: thependulum swing up task, the cart-pole swing up task and the acrobot swing up task. These three tasks arestandard benchmark problems for optimal control, with a relatively small state space (two resp. four statevariables) and only one continuous control variable. The benchmark tests were done quite exhaustively,which meant we had to choose tasks with a rather small state and action space to reduce the computationtime required. The simulation time step was set to 31

3 milliseconds for our experiments, which was a goodtradeoff between accuracy and computation speed. The time step used for learning was set to 0.05 secondsif not stated otherwise. For all tests, the average height of the end point was taken as performance measure.Learning was stopped everyk episodes, and the average height was measured forl episodes, following thefixed policy of the learner. Then learning was continued again; this was repeated until a fixed number ofepisodes had been reached. Hence, the plotted learning curves show the average height measured everykepisodes. For one learning curve the whole process was repeatedn times and averaged to get a more reliableestimate of the learning curve. If we talk about the performance of a specific test-suite (a specific algorithmwith a fixed parameter setting), the average height during learning is always meant. This is obtained byaveraging all the average reward measure points of all the learning trials using the same test-suite.We will begin by defining the system dynamics of the tasks and discuss their properties in the context oflearning. Then we will come to the comparison of our algorithms. We will compare different value functionlearning algorithms, in combination with different action selection policies. The influence of the used timestep∆t on the performance of the different time steps is also evaluated. Additionally different type ofeligibility-traces have been used. This will all be done for grid-based constant GSBFN networks, FF-NNsand also Gaussian Sigmoidal networks. We will also investigate the improvement in performance of thesemethods if we use a prediction horizon greater than one with V-Planning or if we use directed explorationstrategies.After this we will investigate Q-Function based algorithms which do not require knowledge of the model.These algorithms are Q-Learning and Advantage Learning. Basically we ran the same tests for the used timesteps as for V-Learning. Additionally, we tried the use of the Dyna-Q approach and discuss the performanceimprovement.After this, we will come to the Actor-Critic methods. Firstly we will test out the standard Actor-Criticapproaches for a discrete action set, then we will come to the continuous action algorithms. For the SRV al-gorithm, the performance was tested with different kinds of noise; the policy gradient Actor-Critic algorithmwas tested with different backward and forward prediction horizons. This was done for the constant GSBFNnetworks and also for the FF-NNs. We also investigated an intermixing of the function approximators, forexample using a GSBFN as policy and an FF-NN to represent the value function. This test can illustrate

139

140 Chapter 8. Experiments

whether it is helpful to use FF-NNs for the value representation even if we use good representations for thepolicy, which are easier to learn.Then we will look at the policy gradient methods (GPOMDP [11] and PEGASUS [33]). Both were testedfor FF-NNs and the constant GSBFN network. At the end of each test there will be a discussion about theresults and how these results can be further improved. The last section will be a general conclusion aboutthe Toolbox and the algorithms.In all experiments with a discretized action space, either a soft-max policy were used to incorporate ran-dom exploration to the action selection. If a real valued policy was used, a noise controller was used forincorporating exploration. A filtered gaussian noise was used as noise if not stated otherwise:

n(t) = α · n(t − 1)+ N(0, σt)

σt was scaled by the value of the current state.

σt = σVmax− V(t)Vmax− Vmin

When using FF-NNs the standard preprocessing steps were applied to the input state (scale all state variables,use cos(θ) and sin(θ) as input for all angles). The learning rates for the output weights were scaled accordingto the Vario-η algorithm by the factor of1m, m being the number of hidden neurons. All weights of the FF-NNs were initialized with a standard deviation of1

k , k being the number of inputs of the neuron. This wasalso done for the sigmoidal part of the GS-NNs.

8.1 The Benchmark Tasks

All benchmark tasks are mechanical models which are linear with respect to the control variableu. Hence,all models are implemented by derivating the classCLinearActionContinuousTimeTransitionFunctionandspecifying the matrixB(s) and the vectora(s) for the model ˙s= B(s)u+ a(s). For all benchmark problems,own reward functions were implemented; the reward is always only dependent on the current statest. Thederivative of the reward functiondr(st)

ds (needed for the analytical policy gradient calculation and the policygradient Actor-Critic algorithm) was also implemented, using the interface of the classCStateRewardFunc-tion. For algorithms which need a discrete action set the action space was discretized into three differentactions for all benchmark problems, the minimum torqueamin, the maximum torqueamaxand a zero torqueactiona0.

8.1.1 The Pendulum Swing Up Task

In this task we have to swing up an inverted pendulum from the stable down positionsdown to the up-positionsup (see figure8.1). We have two state variables, the angleθ and its derivative, the angular velocityθ, andwhich is limited to|θ| < 10 in our implementation (higher absolute values are not relevant). We can applya limited torque|u| < umax at the fixed joint, since the torque is not sufficient to directly reach the goal statesup, so the agent has to swing up the system and decelerate again if the goal state is reachable.A few experiments with the pendulum swing up task can be found in the articles by Coulom [15] withFF-NNs and Doya [17], who ran different experiments with continuous time RL and the SRV algorithm.Generally we can say that this task is not trivial because of the swing up, but it is still relatively easy tolearn. The advantage of this task is that we can learn it very quickly, so we can do many experiments evenwith many trials for averaging. Even though the results cannot be directly transferred to more complex highdimensional tasks, they can indicate what works well and what works poorly.

8.1. The Benchmark Tasks 141

In the experiments one trial was simulated for 10 seconds, and a discretization time step of∆t = 0.05 wasused, resulting in 200 steps per episode. If we have an average reward per episode of 0.5 (empiricallyevaluated) for randomly chosen start states the swing up has been successfully learned.

Figure 8.1: Pendulum, taken from Coulom [15]

Parameters of the system

Name Symbol ValueMaximal torque umax 10Gravity acceleration g 9.81Mass of the pendulum m 1.0Length of the pendulum l 1.0Coefficient of friction µ 1.0

Unlike to Coulom [15] and Doya [17], we used a higher friction coefficient µ of 1.0. This value wasintuitively found to be more realistic. It also makes the swing up task a little more complicated to learn dueto the extenuated velocity of the system.

System Dynamics

The pendulum has the following dynamics:

θ =1

ml2(−µθ +mglsinθ + u) (8.1)

From this equation the matrixB(s) and the vectora(s) can be found easily.

Reward Function

The reward was simply given by a measure of the height of the pole. In order to incorporate exploration dueto optimistic value initialization, we chose to give the negative distance of the pole to the horizontal planelocated on the top position.

r(θ, θ) = cos(θ) − 1 (8.2)


Used Function Approximators

For the constant GSBFN network, the RBF centers were uniformly distributed within a 15× 20 grid overthe state space. For the sigma values the ruleσi =

1pi ·2

is generally used, wherepi is the number of centersused for dimensioni.For the FF-NN we used 12 hidden neurons, which results in a network of 4·12+13= 61) weights (rememberthat we have three input states for the neural network because of the angleθ, plus one weight for the offsetper node).As localization layer for the Gaussian-Sigmoidal NN 10 RBF-centers were distributed uniformly over eachstate variable. Thus, the input state for the FF-NN of the sigmoidal part of the GS-NNs has 20 inputvariables. We used 10 nodes in the hidden layer, which gave us an NN with 21· 10+ 11= 221 weights.

8.1.2 The Cart-Pole Swing Up Task

Again, the task is to swing a pole upwards, now hinged on a cart (as illustrated in8.2). We have four statevariables: the position of the cartx, the pole angle with respect to the vertical axisθ and their derivativeswith respect to time ( ˙x, θ). We can apply a limited forceu to the cart in the x-direction. This task is verypopular in the general optimal control and reinforcement learning literature, because it is already complexenough to be quite challenging, but it is still manageable by many algorithms from optimal control (fuzzylogic, energy based control, see [1] for a description of an optimal control approach using energy-basedconstraints). In the area of RL, this task has been solved by a few researchers for example by Doya [17],Coulom [15] and Miyamoto [28]. There is also another, simpler, task called ‘The Cart-Pole Balancing task’.In this case, the pole just has to be balanced, beginning at an initial position near the goal state. This taskcan be seen as subtask of the swing up task.In our implementation, a fifth state variableθ′ was introduced. This represents the angle rotated up to now(thus it contains the same information asθ but it is not periodic). This variable is only used to prevent thepole from over-rotating. If the pole was rotated more than five times (|θ′| > 10π), or if the cart left the track,the learning trial was aborted. In the experiments each episode lasted for 20 seconds, a time step of 0.05swas used if not stated otherwise. High episode lengths were chosen in order to investigate whether the polecould be balanced for long enough. As the measure of performance, once again the average height minus1.0 was used, if an episode was aborted because of over rotating the pole or leaving the track, the minimalheight measure of−2.0 was used for the remaining time steps. An average performance measure better than−0.3 during 20 seconds can be considered as successful swing up and balancing behavior.


Name Symbol ValueMaximal force umax 5Gravity acceleration g 9.81Mass of the pole mp 0.5Mass of the cart mc 1.0Length of the pole l 0.5Coefficient of friction of the cart on track µc 1.0Coefficient of friction of pivot µp 0.1Half length of the track L 2.4

Again, higher friction coefficients were used than in [15] or [17] for a increased realism.


Figure 8.2: The Cart-pole Task, taken from Coulom [15]

System Dynamics

The system can be described by a set of differential equations to be solved (taken from the appendix of [15]):[lmp cosθ −(mc +mp)

43 l −cosθ

]︸︷︷︸

C

[xθ

]=

lmpθ2 sinθ + µcsign(x)

gsinθ −µpθ

lmp

︸︷︷︸d

+

[u0

](8.3)

By multiplying this equation withC−1, we can again split this equation into aB(s) · u and ana(s) part.

s= C−1(s) · d(s)︸︷︷︸a(s)

+C−1 ·

[10

]︸︷︷︸

b(s)

·u (8.4)

Reward Function

The negative distance to the horizontal plane from the top position was again used as the reward signal.Since the reward function is very flat in the region of the goal state, we added a reward term specifying thedistance to the upwards position (θ = 0). This peak in the reward function resulted in a better performancefor all algorithms, so it was preferred to the ‘flat’ reward function. Additionally, we punished over-rotatingand leaving the track by incorporating the distance of the significant state variables (x for leaving the track,θ′ for over-rotating) in the reward function.

r(x, x, θ, θ) = (cos(θ) − 1)+ exp(−25 · θ2)︸︷︷︸target peak

−100· exp((|x| − L) · 25)︸︷︷︸punishment for leaving track

− 20 · exp(|θ′| − 10 · π)︸︷︷︸punishment for over-rotating

(8.5)

By using the exponential function, the original reward function is only disturbed by the ‘punishment’ termsin the relevant areas.



For the constant GSBFN network we used a 7×7×15×15 grid resulting in 11025 weights. For the adaptiveGSBFN, a grid of 5× 5 × 7 × 7 was used for the initial center distribution. The FF-NN has 20 hiddenneurons, thus 620+ 21 = 141) weights. For the GS-NNs we used the same number of partitions as for theconstant GSBFN (but partitioning each state variable separately), so we have 44 input states to the FF-NN.The FF-NN itself used 20 hidden neurons, resulting in 945 weights.

8.1.3 The Acrobot Swing Up Task

The acrobot has two links, one attached at the end of the other (see figure8.3). There is one motor at themiddle of the two links which can apply a limited torque (|u| < umax). The task is to swing up both links tothe top position. Again we have one control variable and four state variables (θ1, θ2 and its derivativesθ1,θ2).

Figure 8.3: The Acrobot Task, taken from Yoshimoto [57]

If the task is simply to balance the acrobot from an initial position in the neighborhood of the goal state wetalk about the acrobot balancing task. The acrobot task is also a standard benchmark problem in the fieldof optimal control [14], fuzzy controllers, energy based control. Also genetic algorithms have been used tosolve this task too. By using a pure planning approach (similar to predictive control) Boone [13] was ableto get probably the best results of anyone in this field.The task is also very popular for testing RL algorithms and has been solved for different physical parameterswith different levels of difficulty for the learning task. Sutton gave an example of the acrobot learning taskusing Tile Coding architecture and Q-Learning in [49], but in this case the task was just to swing the firstleg up to 90 degrees. Coulom [15] used FF-NNs to learn the task with continuous time RL; the acrobotmanaged to swing up and reach the goal position at a very low angular velocity, but it could not keepbalance. Coulom used a maximum torque ofumax = 2.0Nm, which is at to knowledge the most difficultconfiguration of the acrobot task that has been used until now. Solving the task with FF-NNs is also the onlyapproach to learning the task which has a flat, non-hierarchic architecture. There was no further informationgiven about the learning time, but as was shown for the other experiments Coulom did, the learning timewas high. Nishimura et al. [34] used RL to learn a switching between several predefined local controllers.The task was successfully learned, but a maximum torque of 20Nm was used. Thus, even though other


physical parameters were used, the task is likely to be simpler than the configuration that Coloum used. Thebalancing task was also investigated intensively, since when of more distant initial states are chosen, the taskis already quite complex. Here, we should mention the work of Yoshimoto [57] using an NG-net and anActor-Critic algorithm.Even though we still have just four continuous state variables, depending on the physical parameters, thestandard acrobot task is already very difficult to learn, as can be seen from the literature. Only a fewexperiments could be made with this task due to the lack of time available for optimizing the parameters.Usually, many steps are needed to reach the goal, moreover the difference caused by executing differentactions can have very small immediate effects. Therefore, an accurate value function is needed, which makesthe use with RBF-networks almost impossible. Another severe problem of this task is a very tempting localmaxima in the value function, which is to balance the second link upwards, swinging the first link slightly.This solution is easy to find, and has a considerably better value than trying to learn how to swing up theacrobot. Almost all approaches tried in our experiments only found this solution, and did not recover fromthis local maxima.None of the standard approaches which worked well for the previous tasks lead to any success and onlyfound the sub-optimal solution as local minima. Experiments have been done for different time scales andmaximum torques. The task was only solved forumax > 10, which simplified the task drastically. Theseexperiments are not shown in this thesis.


Name Symbol ValueMaximal torque umax 2.0Gravity acceleration g 9.81Mass of first link m1 1.0Mass of second link m2 1.0Length of first link l1 0.5Length of second link l2 0.5Coefficient of friction for the first joint µ1 0.05Coefficient of friction for the second joint µ2 0.05

System Dynamics

The system can again be described by a set of differential equations to be solved (taken from the appendixof [15]):

C(s) ·

[θ1

θ2

]= d(s) +

[−uu

](8.6)

With

C =

[(4

3m1 + 4m2)l21 2m2l1l2 cos(θ2)2m2l1l2 cos(θ2) 4

3m2l22

](8.7)

d =

2m2l1l2θ22 sin(θ2) + (m1 + 2m2)l1gsinθ1 − µ1θ1

2m2l1l2θ12 sin(−θ2) +m2l2gsinθ2 − µ2θ2

(8.8)


Again we can multiply this equation withC−1 split into aB(s) · u and ana(s) part.

s= C−1(s) · d(s)︸︷︷︸a(s)

+C−1 ·

[−11

]︸︷︷︸

B(s)

·u (8.9)

Reward Function

Again, as reward signal the distance of the pole to the horizontal plane at the top position was used. Addi-tionally, a peak is added to the reward function at the goal state.

r(θ1, θ1, θ2, θ2) = l1 · cos(θ1) + l2 · cos(θ1 + θ2) + 0.5 · exp((−θ21 − θ

22) · 25)︸︷︷︸

target peak

−l1 − l2 (8.10)


Several grid-based and also more sophisticated RBF positioning schemes, have been used for the acrobottask with very limited success. For the FF-NN a 30 neuron (241 weight) network was used. No tests weredone for the GS-NN due to the lack of time and the already bad results for the cart-pole task.

8.1.4 Approaches from Optimal Control

An energy based control scheme for the cart pole can be found in [1]. Another interesting approach is takenby Olfati-Saber [35], who uses a fixed point controller. A good, but unfortunately old overview of existingapproaches from optimal control can be found in [14], where several control strategies are discussed for theacrobot. Usually two different controllers are used in all approaches, one for balancing and one for swingingup the acrobot. For the balancing task, a linear quadratic controller (LQR) or fuzzy controller is used. Bothapproaches require fine tuning of the parameters, which is either done by hand or by a genetic searchalgorithm. The swing up task is controlled by a PD controller, working on the linearized system, usingfeedback linearization. With feedback linearization, a controller can be designed that pumps on averageenergy into joint one during each swing, resulting in a swing up. Again, parameter tuning is needed for thisapproach.These approaches worked fine for an acrobot with a limited torque of|u| < 4.5Nm, and with physicalparameters other than those used in this thesis. Hence, optimal control already has good working solutionsfor these problems. The advantage of RL is obviously that no parameter tuning is needed (at least not quiteas many parameters as would be needed in optimal control) and that it is a general framework.

8.2 V-Function Learning Experiments

In this section, we will investigate the discrete time V-Learning and continuous time V-Learning algorithms.We will look at the performance of the different gradient calculation schemes for the three different functionapproximators. These are constant grid-based GSBFNs, FF-NNs and GS-NNs. The influence of the eligi-bility traces with differentλ settings is investigated, and we will also test the algorithms for different timescales∆t.Finally, we will try to improve the performance of the V-Learning algorithms by using a higher predic-tion horizon for V-Planning, incorporating directed exploration information in to the policy and by usinghierarchic learning architectures.

8.2. V-Function Learning Experiments 147

For all experimentsγ = 0.95, sγ = 1.0, λ = 0.9 andβ = 20 (for the soft-max distribution) were used unlessstated otherwise. The time discretization used was∆t = 0.05s. As default replacing e-traces were used.

8.2.1 Learning the Value Function

Continuous RL and standard, discrete time RL formulation are compared in this section. We want to estimatewhich algorithm works best for learning the value function, hence we will use all the tested algorithms withthe same policy, which is a one step lookahead V-Planning policy using a soft-max distribution for actionselection. For the continuous time algorithms, we had to scale the V-Function by1

∆t for the V-Planning partbecause the continuous time value function is1

∆t times smaller than the discrete value function.

Constant Grid-Based RBF network

This FA has the best performance, and it is simple and easy to learn. V-Learning methods managed to learnthe Pendulum task after five to ten episodes. In figure8.4(a), we can see three learning curves, for thediscrete time residual, the Euler residual and the residual used by Coulom. There is no significant differencein the performance of these three algorithms. This is also due to the setting of the specific parametersγto 1.0 and∆t = 0.05, resulting in an equivalent discrete time discount factorγd = 1 − sγ · ∆t = 0.95 forthe continuous time algorithm. Figure8.4(b) shows the comparison between the direct gradient, residualgradient and the residual algorithm (with variable and constantβ). Only the performance of the residualgradient algorithm (β = 1.0) falls off, as expected from the theory. All other gradient calculation algorithmsdo not differ significantly for the RBF network. Each learning trial lasted for 50 episodes, and the plots areaveraged over 10 trials.

(a) Pendulum (b) Pendulum

Figure 8.4: (a)Learning curve for the discrete time and continuous time algorithms for the RBF network.η = 2.0 was used for all three algorithms. (b) Average reward during the learning for different gradientcalculation algorithms, plotted over varyingη. The discrete time V-Learning algorithms were used for thisillustration.

In figure 8.5(a) and (b), the same results are shown for the cart-pole task. The RBF network manages tolearn the task in approximately 400 to 500 episodes. The results and conclusions are almost the same asfor the pendulum swing up task, but this task is already much more complex since learning one trial takes


about 570s in real time (for 2000 episodes), in comparison with 30s for the pendulum task (50 episodes).The performance of the direct gradient algorithm already stands out slightly, as the direct gradient algorithmseems to be best suited for the use of linear feature states.These experiments also illustrate the insensitivity of the RBF network with respect to the choice of thelearning rate. The algorithm manages to learn the pendulum and cart-pole task for a large range ofη valuesf.

Figure 8.5: Average height during learning the cart-pole task for different gradient calculation algorithms,plotted over varyingη. The discrete time V-Learning algorithms were used for this illustration, and theresults were averaged over 5 trials, each trial lasting for 2000 episodes.

For the acrobot task, different resolutions (15×15×15×15 and 20×20×20×20) were used for the grid basedRBF positioning scheme. More sophisticated positioning schemes were also used, with a finer resolutionfor small velocities and around the neighborhood ofθ1 = π andθ2 = 0. It was only a limited success. Onlya suboptimal solution was found for the acrobot task; if the resolution around the downwards position is notoptimized accurately enough, even this solution could not be found when starting in this position. In thiscase, the agent did not manage to leave the downwards position due to the small difference in the state spacewhen executing different actions.


FF-NNs are already difficult to learn for the pendulum task. On average, more than 400 episodes are neededfor a successful swing up (with an optimized parameter setting) and finding for a good parameter setting (η,λ, β for the residual algorithm) is more difficult because the algorithms work only in a very small parameterregime. We used a trial length of 3000 episodes for the pendulum task in order to investigate the long-termconvergence behavior of the algorithms, the results are averaged over 10 trials. Figure8.6 illustrates theperformances of the discrete time and the continuous time algorithm using the Euler residual for differentgradient calculation schemes. Surprisingly, the performance of the continuous time algorithm is significantlyworse than that of its discrete time counter-part. Even using the best empirically ascertained parametersetting (β = 0.3, η = 16, see figure8.6 (b)) the algorithm managed to learn the swing up in only fourout of 10 trials. How can this behavior be explained? For the continuous time residual, the influence ofthe value function to the received reward is much higher than in the discrete time algorithm (in our case1∆t = 20 times higher). In the case of the linear approximator, this does not make any difference, becausethe V-Function is linear in the weights and thus just1

∆t times smaller. But for FF-NNs, where we have a


random initial weight vector, the influence of this initial weight vector is much higher for the continuous timealgorithm. With the same arguments the different learning rates intervals can also be explained. We have touse higher learning rates in order to increase the influence of the received reward for the weight update. Wealso ran one experiment to check these ideas, using the continuous time algorithm with a reward functionwhich is 20 times higher than the original one (so we have the same weighting for the reward function andthe value function). Now the algorithm should do exactly the same as the discrete time algorithm, and infact, the results were almost the same as in figure8.6 (a) for the discrete time algorithm. Consequently,the performance of FF-NNs additionally depends on the relationship between the weighting of the valuefunction and the reward function in the residual calculation. A question which still has to be examined iswether we can optimize this relationship for a given learning task and how this optimization can improvethe performance of FF-NNs.


Figure 8.6: Performance with different gradient calculation schemes for the (a) discrete time and the (b)continuous time algorithm (Euler residual) for learning the value function with an FF-NN.

For the discrete time algorithm, plots of the different gradient calculation schemes are more expressive.Here, we can see that using the residual algorithm with a constantβ factor clearly outperforms the directgradient algorithm. The best performance of the residual algorithm is significantly better (average heightof 0.55 instead of 0.8) and also the range of good working learning rates is higher. The residual gradientalgorithm (β = 1.0) has a worse performance than the residual algorithm, but still outperforms the directgradient algorithm. Surprisingly, the residual algorithm with the adaptiveβ calculation also falls off inrespect to the constantβ configuration. Apparently, the approximations of the real epoch-wise gradient withthe traceswD and wRG were not good enough. This experiment was done for the averaging parametersµ = [1.0,0.95,0.9,0.7,0.0], and the plot used the most efficient parameter settingµ = 0.9. In figure8.7wesee typical learning curves for the direct gradient and the residual gradient algorithm with a constantβ valueof 0.3. The direct gradient algorithm also learns quickly, but then it unlearns the swing up in almost everytrial again very quickly (if it is learned at all). This is a consequence of the poor convergence behavior ofthe direct gradient algorithm and coincides with the theory. The residual algorithm with constant beta alsounlearns the swing up behavior again, but not so often and quickly.Cart-pole task learning with an FF-NN is already a very hard task. Over 50000 episodes are needed tolearn the task, and the algorithm only works within a very small range of parameter settings. One learning



Figure 8.7: Learning curves of five different trials of the (a) direct gradient and the (b) residual algorithm(with constantβ = 0.3) with an FF-NN as function approximator. The thick red line represents the averageof the five trials.

trial with 100000 episodes lasted for 24000 seconds, which made a exhaustive search for good parametersettings almost impossible. Nevertheless, we tested the use of FF-NNs with the discrete and continuous timealgorithms with different gradient calculation schemes. Plots of the results can be seen in8.8. The resultsare averaged over 3 trials, consequently we have to consider that the results are quite noisy. In general,residual algorithm tends to works best again, with a highβ setting, but surprisingly not with the discretetime algorithm. For this task only the continuous time algorithm clearly outperformed the discrete timealgorithm. Even if the discrete time algorithm also manages to learn the task, its learning performance ismore unstable and it only works for an even smaller parameter range. Again, if a smaller value for thelearning rate is used for the discrete time algorithm, with the same parameter setting as for the continuoustime algorithm, learning is not possible at all. Seemingly, the relationship between the influence of the valuefunction and the reward function of the continuous time algorithm is the correct (or at least better) one forthe cart-pole task. These results suggest that this relationship chosen should depend on the complexity ofthe learning task, i.e. an algorithm with an adaptableship relation between the value function and the rewardfunction could be preferable. This presumption obviously needs more investigation.


GS-NNs are the intermixing of the localizing RBF networks and sigmoidal FF-NNs. They are supposed tobe easier to learn as the FF-NN, and require fewer weights than the RBF networks. For the pendulum task,the difference in the number of weights is, unfortunately not given. Even worse, the GS-NN requires moreweights for this small dimensional task (200 instead of 150 for the RBF network). But for the Cart-Pole taskwe need only 945 weights instead of the 11025 weights we needed with the RBF network. Unfortunately,computing the GS-NN is quite slow, so the tests could not be done exhaustively. For the pendulum task, theresults (figure8.9(a) and (b) for the discrete time algorithm) show a comparable performance to FF-NNs.Again, the residual algorithm with constantβ significantly outperforms the direct and residual gradientalgorithm, and also the variableβ calculation falls off in its performance. The continuous time algorithmagain suffers from the same difficulties as the FF-NN; these plots are not shown here. The disadvantage in


(a) Cart-Pole (b) Cart-Pole

Figure 8.8: Performance of the (a) continuous time and (b) discrete time RL algorithms with an FF-NN asfunction approximator.

comparison to FF-NNs is the increased computational complexity, which makes learning very slow. For thecart-pole task, a successful parameter regime has not yet be found. One reason for this is the long learningtime needed for the cart-pole task. Another reason is that the GS-NN approach does not scale up to morecomplex tasks easily, at least it becomes even more sensitive to accurate parameter settings. We even triedto learn the value function if the agent followed an already learned policy, but learning was done for 30000episodes (70000 seconds learning time) without any success.

Figure 8.9: Performance of the GS-NN network for different gradient calculation algorithms for the pendu-lum task.

8.2.2 Action selection

Basically we have three different kinds of policies: stochastic policies using a one step forward-prediction(discrete time V-Planning), stochastic policies using continuous time system dynamics (continuous time V-


Planning) and the real valued value-gradient based sigmoidal policies (see7.3.4). For the stochastic policiesa soft-max action distribution is used. In8.10(a) we see the performance of the three policies with roughlyoptimized parameters (β = 20 for the stochastic policies andC = 100 for the value-gradient based policy).The performance was only tested for the constant RBF network. As expected the discrete time planning

(a) Pendulum (b) Cart-Pole

Figure 8.10: (a) Average reward during the learning for different policies, plotted over varyingη. The Eulerresidual was used for this illustration. (a) Pendulum Task (b) Cart-Pole Task

algorithm slightly outperforms the other two approaches for the pendulum task, because it uses the mostaccurate estimates of the value of the next state. The two continuous time approaches perform equallywell for the pendulum task. For the cart-pole task the difference is more drastic. Both continuous timeapproaches manage to learn the task, but need significantly more learning steps to do it. For the cart-poletask the difference is even greater. While all three different policy learning schemes manage to learn the task,the discrete time V-Planning approach is significantly more efficient due to the more accurate estimates ofthe values of the next states. An interesting question is how these algorithms would behave if we were to usea learned, inaccurate model of the system dynamics. In this case, the difference is likely to be not that great,but this has not been investigated. We can also see from the results in8.10(b) that the value gradient-basedpolicy outperforms the continuous time V-Planning policy.

8.2.3 Comparison of Different Time Scales

We tested the discrete time algorithm with discrete time V-Planning, the continuous time algorithm withdiscrete time V-Planning and also with the value-gradient based policy for different time scales. Accordingto theory, continuous time RL should work best for small time steps. For larger time steps the approximationsof the Hamiltonian and in particular of the value gradient based policy get inaccurate, hence the performancegets worse. Since continuous time RL uses an equivalent discrete discount factor ofγd = 1 − sγ ∗ ∆t, wealso tested the discrete time algorithm with this adaption law of the discount factor, just to be sure that betterperformance is not just due to ‘cheating’ with a preferable discount factor setting.The pendulum task results are shown in figure8.11(a). The discrete time algorithm with the constantγ

setting of 0.95 does not manage the swing up for small time steps, but by adapting theγ value, this algorithmhas almost the same performance as the continuous time algorithm. Therfore the only advantage of usingcontinuous time RL for the value function is a better, time scale dependent choice of the discount factor.


The value gradient based policy has the best performance for small time steps, but, as expected, could notlearn the task for larger time steps.


Figure 8.11: Performance plots for different∆t. discGammarefers to the discrete time algorithm with setγ

to 1− sγ ∗∆t (usually aγ value of 0.95 is used).contEulerrepresents the learning curve of the value gradientbased policy. (a) Pendulum Task, one trial had 100 episodes, one episode lasted for 10s (b) CartPole Task,2000 episodes, 20s per episode

8.2.4 The influence of the Eligibility Traces

The λ parameter has different effects for different function approximators. The succeeding plots (figure8.12and8.13)show the average height during the whole learning trial plotted with varying learning ratesfor λ = [0.0,0.5,0.7,0.9,1.0]. There is always one plot for replacing e-traces, and one for accumulatinge-traces. In the case of linear approximators, where the gradient∇wV(s) does not depend onw, a highλvalue results in a better learning performance. For global non-linear function approximators like FF-NNs,the results show that using highλ values is rather dangerous, because we rely on the assumption that∇wV(s)does not change during an episode. Due to the high number of different test cases, this experiment was onlydone for the pendulum task.

Constant Grid-Based GSBFNs

Figures8.12and8.13show the performance of the direct gradient and residual algorithms with differentλsettings. For the linear approximator, the results are nearly the same for the different gradient calculationschemes as expected, but the results already show that using e-traces for the residual and residual gradientalgorithm can be advantageous. There is also hardly any difference between replacing and accumulatinge-traces when using a linear function approximator, non replacing e-traces need a lower learning rate forobvious reasons.



Figure 8.12: Performance plots for differentλ settings for the Pendulum Task with RBF networks, using thedirect gradient. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over 10trials with 50 episodes


Figure 8.13: Performance plots for differentλ settings for the Pendulum Task with RBF networks, usingthe residual algorithm with the variableβ calculation scheme. (a) Replacing e-traces (b) Non-Replacinge-traces. Average reward is determined over 10 trials with 50 episodes



In figures8.14, 8.15and8.16we can see the performance plots of the direct gradient, the residual algorithmwith constantβ = 0.3 and also with the residual algorithm with variable beta. The performance of thedirect gradient algorithm could be improved slightly, since accumulating e-traces with a lowλ value seemsto have a better performance than the used standard parameters (λ = 0.9, replacing e-traces). Also, theperformance of the residual algorithm with variable beta calculation could be improved in the same way,using low λ values or no e-traces at all. Interestingly this is not the case for an empirically optimizedconstantβ value, as in this case high usingλ values with replacing e-traces can significantly improve theperformance. Surprisingly, accumulating e-traces have a significantly worse performance for the residualalgorithm with a constantβ setting. Also, this result suggests that eligibility traces are useful when used withthe residual algorithm, in particular, that our implementation of replacing e-traces for weights of a functionapproximator is justified.


Figure 8.14: Performance plots for differentλ settings for the Pendulum Task with FF-NNs, using the directgradient. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over 10 trialswith 3000 episodes


GS-NNs behave a bit differently to FF-NNs when changing theλ parameter. As a consequence of the lo-calization layer, highλ values have a good performance for all gradient algorithms, even if they narrowthe area of good working parameter settings forη (particularly the accumulating e-traces). The results areshown in figure8.17for the direct gradient, and in figure8.18for the residual algorithm with a constantβsetting of 0.6. Our replacing e-traces approach clearly outperforms the standard accumulating e-traces ap-proach, which, in combination with the results from the FF-NN suggests that replacing e-traces are generallypreferable for TD(λ) learning with function approximation

8.2.5 Directed Exploration

In our experiments with different exploration strategies, we used a counter based local and distal explorationmeasure. For the counter and the exploration value function, again, we used a function approximator, in this



Figure 8.15: Performance plots for differentλ settings for the Pendulum Task with FF-NNs, using theresidual algorithm with the constβ = 0.6. (a) Replacing e-traces (b) Non-Replacing e-traces. Averagereward is determined over 10 trials with 3000 episodes


Figure 8.16: Performance plots for differentλ settings for the Pendulum Task with FF-NNs, using theresidual algorithm with the variableβ calculation scheme. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over 10 trials with 3000 episodes

case the standard constant RBF network was used. Both approaches are model based, using the generativemodel for the prediction of the exploration measure of the next state. For distal exploration, a standard TDV-Learner was used with a learning rate of 0.5. For all other parameters, the standard values were used. Wetested the benefits of directed exploration for the three function approximators (RBF, FF-NN and GS-NN),the results are illustrated in8.20for the pendulum task. The plot shows the performances of the algorithmswith an ascending exploration factorα. Due to higher exploration measures when using distal exploration(the measure is the expected future local exploration measure), we used smallerα values than for the local



Figure 8.17: Performance plots for differentλ settings for the Pendulum Task with GS-NNs, using the directgradient algorithm. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward is determined over10 trials with 3000 episodes


Figure 8.18: Performance plots for differentλ settings for the Pendulum Task with GS-NNs, using theresidual algorithm withβ = 0.6. (a) Replacing e-traces (b) Non-Replacing e-traces. Average reward isdetermined over 10 trials with 3000 episodes

exploration. But the results of the local and distal exploration measure do not differ significantly anyway.For learning the real value function, the discrete time algorithm was used with V-Planning as the policy. Thebest learning rates from the previous experiments were always used.

The RBF network already uses a sort of directed exploration in the optimistic value initialization, usingdirected exploration could not improve the performance. But for FF-NNs and GS-NNs (note that unlike theprevious experiments, only 1000 episodes were used for learning), the performance is significantly improvedby directed exploration. This experiment was also done for the cart-pole task with the FF-NN and the RBF-



Figure 8.19: Plots with varying exploration factorα, showing the performance of the different FAs. (a)Local Exploration (b)Distal Exploration. For the RBF network one learning trial has 50 episodes, for thenon-linear FAs one trial has 1000 episodes. The results are averaged over 10 trials.

network. The results show again, that using a directed exploration scheme with the RBF-network does notresult in any benefits, but the performance of FF-NN was again considerably improved. In this experiment,we can also see the benefit of distal directed exploration over local directed exploration. Learning with theFF-NN was still very unstable, but good policies were found after 20000 episodes, which is a considerableimprovement in performance.

(a) CartPole (b) CartPole

Figure 8.20: Plots with varying exploration factorα, showing the performance of the different FAs. (a)Local Exploration (b)Distal Exploration. For the RBF network one learning trial has 2000 episodes, for theFF-NN FAs one trial has 50000 episodes. The results are averaged over 10 respectively 5 trials.

These experiments show, on the one hand, that the poor performance of the non-linear approximation meth-ods partially come from their poor exploration ability. The drawback of this experiment is that we used a


linear approximator for the counter, so we suffer again from the curse of dimensionality, which should beavoided by the use of a non-linear FA. But intuitively, the function approximator of the counter does notneed to be as exact as for the value function, so fewer features can be used. Another possibility is a memorybased counter representation, storingn states from the past and counting the experienced states in the regionof the current state. But these issues were not investigated any further. Using directed exploration has alow complexity and hardly affects the computation speed of the simulation, so its use should be consideredespecially for non-linear FA’s.

8.2.6 N-step V-Planning

Having knowledge of the model, planning methods can be used with a higher a prediction horizon (in ourexperiments we uses two, three or five time steps). The experiments were again done for our three commonFAs. The results for the pendulum task are plotted in figure8.21(a). We can see a significant improvement inperformance, especially for the FF-NN and the GS-NN. With the RBF-network, no significant improvementwas observed due to the simplicity of the learning task. We also tested the V-planning approach for thecart-pole task, but only with RBF-networks. The result is illustrated in8.21(b): in this case performanceincreased dramatically, resulting in the best performance seen for the cart-pole algorithm.

The disadvantage of this approach is clearly the computational costs, which are exponential in the predictionhorizon. For example of the pendulum task, it took 30s for one evaluation with search depth one, 90s withsearch depth two, 300s with three prediction steps and over 2400s with five prediction steps. For the cart-pole task, one trial lasted 570 seconds if a search depth of 1 was used, and over 40000 seconds with a searchdepth of 5. Thus, large prediction horizons cannot be used at least for real time control, but the speed cancertainly be further improved by a faster implementation, or by adding some other kind of heuristics to prunethe search tree, but we will not get rid of the exponential time dependency.


Figure 8.21: Average Reward for different search depths using V-Planning. (a) Pendulum Task (b) Cart-PoleTask


8.2.7 Hierarchical Learning with Subgoals

In this experiment we investigate the use of subgoals (similiar to the approach used by Morimoto [29], seesection5.2.4) for continuous control tasks. Hierarchical learning was applied for the cart-pole task, sincethe pendulum task was considered to be too simple for adding a hierarchy. In our subgoal model, for eachsubgoal a target region and a failed region can be defined. Additionally, we defined a sequential order ofthe subgoals, so if subgoalg1 has reached its target area, subgoalg2 will be activated. If subgoalg2 fails(reaches the failed region), subgoalg1 is activated again. Each subgoal has an own reward function, and forevery subgoalgi an own value functionVi is learned. The reward function of a subgoal is an exponentialfunction of the distance to the target area and the failed area.

rt = RC + r1 exp(−disttτ1

) − r2 exp(−distfτ2

) (8.11)

wheredistt is the minimum distance to the target area anddistf to the failed area.RC is a constant, negativereward offset,r1 andr2 specify the influence of the target and failed area in the reward function,τ1 andτ2

specify the attenuation of the reward with the distance to the specified area.In addition to the subgoal V-Function a global V-FunctionVg is learned with the standard reward function.The global value function is needed because we define the target areas of the subgoals very coarsly, so thisV-Function is needed to determine good intersection points of the two subgoals for achieving the global goalof swinging up and balancing the system. For the policy, we used the standard 1-step V-Planning algorithm,the value is calculated as the weighted sum of the current subgoal’s value function and the global valuefunction

V = αVg + (1− α)Vi (8.12)

Our experiments will test the hierarchical structure with differentα values (α = 0 corresponds to purehierarchic learning,α = 1 to the standard flat learning approach).For the Cart-Pole task we divided the whole task into three different subgoals. The first subgoal is to swingthe pole up to an angle ofπ2, the velocities and x position were not further specified. The second subgoalneeds to swing the pole up to an angle ofπ

9. We also already restricted the velocities in the x-direction andof the angle to the interval [−2,2]. The second subgoal has failed if the pole is in a downwards position(θ > 0.5π or θ < −0.5π) with a low velocity (if the absolute angular velocity of the pole is lower than 0.1).The third subgoal is the balancing task and has no target area. The subgoal has failed if the absolute angleof the pole is higher thanπ6. For the reward functions, we used the following values:r1 = r2 = 40 andτ1 = τ2 =

1300 for every subtask. All three subgoal value functions and the global value function use the

standard RBF network as function representation, except for the third subgoal task. In this case, another gridfor θ and θ was used.θ was divided into five partitions within the interval [−π9 ,

π9 and θ into 10 partitions

within the interval [−5,5]. This was considered to be the relevant area for the balancing task.In figure8.22(a), the results are illustrated for differentα values. The hierarchical architecture significantlyoutperforms the flat approach (α = 1.0) for a broad range ofα values. The pure hierarchical approachwithout the global value function or using smallα values also has a bad performance due to the lack ofgood interconnection points between the subgoals. The learning curves of the flat, the pure hierarchic andthe intermixed approach withα = 0.6 are illustrated in figure8.22 (b). With, setting ofα = 0.6, theagent successfully learned the swing up within 150 episodes, which is more than twice as good as the flatapproach. The performance of the learned policy is also remarkably good, comparable with policies after2000 learn-episodes for the flat approach.This approach was also tried for the acrobot task, using individually optimized RBF-networks for eachsubtask, four subtasks were used in this case. The agent managed to escape the locally optimal solution



Figure 8.22: (a) Average Reward for differentα factors for hierarchic learning. (b) Learning Curves of theflat architecture (α = 0.0), the pure hierarchic architecture and the intermixing withα = 0.6

mentioned earlier, and was also able to reach the upwards position, but with a non-zero velocity so it couldnot hold the balance. Further optimization of the subgoals and the RBF-networks would have been needed,which is already a lot of work for the acrobot task. This is the biggest disadvantage of the hierarchicalapproach.

8.2.8 Conclusion

Learning the value function is quite stable, it works with all the used function approximator schemes, atleast for the pendulum task. RBF-networks have a reasonable performance, but they cannot be used for highdimensional control tasks. Even for 4-dimensional control tasks, and if a highly accurate value function isneeded as for the acrobot task, it was not possible to learn the swing up. Many different accuracies (numberof centers) and positioning strategies for the centers have been tried without success. The advantages ofRBF-networks, or in general of linear approximators, are that they are much easier and faster to learn. Theyalso work for a broad range of parameter regimes and also for all the algorithms we used, which is crucial ifa long learning time is needed, and consequently, an appropriate search for good parameter settings is veryhard. Another advantage is the ability to use use optimistic value initialization, which spares us having tousea more sophisticated exploration strategy.Non-Linear function representations have a very poor performance in comparison to linear function repre-sentations, so they are not useful for real applications. Our experiments show that, even for low dimensionalsimulated tasks, as we use them in our thesis the use of non-linear FA’s is fraught with difficulty. FF-NNsare able to scale up to high dimensional problems (as shown by Coulom [15]), and are also more accuratethan RBF-networks. But due to their long learning time and their sensitivity to the parameter settings of thelearning algorithm, they become impractible to use. After a long parameter search, we managed to learn thecart-pole task with a neural network, but thanks to the learning time needed we think that using FF-NNs isnot efficiently applicable. Another problem is the instability of learning. The algorithm may unlearn a goodpolicy, and performance is also highly dependent on the initial weight vector of the FF-NN. The residualalgorithm, with an appropriate constantβ setting, can partially solve this problem and leads to considerablybetter results, but with the drawback of an additional parameter that must be optimized. E-traces are also


not that easily applied as for linear approximators, which is another reason for the loss of performance. Anapproach for an automatic learning rate detection is, in our opinion a very promising to making non-linearfunction approximators more useable for RL.GS-NNs, as intermixing of the two approaches, also did not quite meet our expectations. Although it waspossible to learn the pendulum task in less episodes than with the FF-NN, the real learning time was in-creased due to the additional complexity of the GS-NN. Moreover, the performance impact was not suchthat we could say GS-NN behave better than FF-NN. The non-linear weight updates still seems to be prob-lematic with GS-NNs. We did not manage to find a good parameter setting for the cart-pole task, which isa consequence of the very long learning time of a GS-NN. We even tried to learn the value function alone,even given an policy that had already been optimized. Learning was done in this configuration for over30000 episodes without any success. In retrospect, we think that memory based representations like locallyweighted regression, or adaptive GSBFNs (normalized RBF networks), are the most promising approachesfor RL in continuous state spaces. Although adaptive GSBFNs are built in the Toolbox, there was not enoughtime for testing it thoroughly.The use of a directed exploration scheme, a higher planning horizon or using an actor to stabilize learning aresome approaches which address the difficulties with FF-NNs and GS-NNs. Unfortunately, there was onlytime to experiment with these approaches with the pendulum task. Although the results are quite encourag-ing, further experiments are needed with more complex tasks to verify the benefits of these approaches. Aprincipal benefit of incorporating the knowledge of the system dynamics, is the ability to plan, which dras-tically improved the performance of value function learning. On the other hand, the system dynamics canalso be used for the continuous time value-gradient based policy, which provides a real valued policy. Butthe experiments show that performance already falls away for the cart-pole task due to a less accurate valueprediction, so planning approaches seem to be more effective, albait at the expense of discretized actionsand a higher computation time.The comparison between the continuous time algorithm and the discrete time algorithm is a little ambivalent.While there is no significant preference shown in the experiments for the RBF network, the results differquite notably for the FF-NN and GS-NN function representation schemes. The discrete time algorithmworks well for the pendulum task, but did not manage to learn the cart-pole task. The opposite could beobserved for the continuous time algorithm. These results suggest that the optimal solution is an adjustablerelationship between the weighting of the value function and the reward function in the residual calculation.This optimal relationship is likely to differ for different tasks and function approximators.Hierarchical RL helps us to improve the speed of learning. It is also sometimes also the only way toprevent the algorithm from getting stuck in a local minima. Our hierarchical approach was very simple,with predefined, succesive subgoals, but this approach is already very difficult to apply for the acrobot task.More complex approaches with an automatic detection, or at least adaption, of the hierarchic structure arevery promising in this context, which is undoubtedly necessary for a more sophisticated learning system formore complex tasks. But a lot of further research has to be done in this area.

8.3 Q-Function Learning Experiments

In this section, we will compare the performance of Q-Function learning algorithms, which are Q-Learningand Advantage Learning, to each other and also to V-Learning algorithms in order to determine the benefitsof using the system dynamics as prior knowledge. The basic disadvantage of Q-Learning is the need for adiscretization of the action set, but this causes no problems for the low dimensional control spaces of thebenchmark tasks. In our experiments for Q-Learning, only the RBF network and two different FF-NNsarchitectures were used.

8.3. Q-Function Learning Experiments 163

8.3.1 Learning the Q-Function

In these experiments, we compared Q-Learning to Advantage Learning. Again different gradient algorithmswere used and plots are shown for different learning rates (figure8.23). For action selection, a soft-maxdistribution was used, with aβ = 20 setting.

Constant Grid-Based GSBFNs

For the pendulum task one trial lasted for 200 episodes. All plots are averaged over 10 trials. The bestconfiguration of the Q-Learning algorithm managed to learn the task in approximately 40 episodes. The re-sults show that Q-Learning slightly outperforms Advantage Learning for the time scale factorK = 1, whichmight be a result of better optimized parameters for Q-Learning, because the difference is not significant.For advantage learning a discount factor ofγA = 0.95

1∆t = 0.3585 (∆t = 0.05s) was used, which is equiva-

lent to the discount factor used by Q-Learning. Experiments with a higher discount factor (e.g.γA = 0.95,γ = 0.95∆t = 0.9974) resulted in a significantly worse performance.


Figure 8.23: Performance plots of the (a) Q-Learning algorithm (b) Advantage Learning algorithm (K = 1.0)for different gradient calculation schemes.

For the cart-pole tasks, only the direct gradient and the residual gradient were tested. This can be seenin figure 8.24(b). For the advantage learning algorithm again a discount factor ofγA = 0.3585 was used.One learning trial took 4000 episodes. The results are averaged over five learning trials. In this task, theadvantage learning algorithm already had a significantly worse performance than standard Q-Learning. Thetime scale factor (K = 1.0 was used) was not further optimized, but these results indicated that, at leastwithout this optimization ofK, advantage learning has no advantage over Q-Learning.

FF-NNs

The use of FF-NNs for Q-Learning was only investigated for the pendulum task due to the long learningtime. We tested two possibilities using FF-NNs to represent the Q-Function. The first representation usesa FF-NN which takes the action value as additional input. The FF-NN was created with 12 neurons in thehidden layer (resulting in 5· 12+ 13= 73 weights). The second approach uses an individual FF-NN (again


(a) CartPole (b) CartPole

Figure 8.24: Performance plots of the (a) Q-Learning algorithm (b) Advantage Learning algorithm (K = 1.0)for different gradient calculation schemes.

with 12 hidden neurons) for each discretized action. Since there are three different discretized actions, wehave 61· 3 = 183 weights. The first approach might benefit from the generalization capabilities of theFF-NN, but is also intuitively harder to learn. In the experiments, one trial lasted for 5000 episodes, and theplots are averaged over 10 trials.


Figure 8.25: Performance plots for learning the Q-Function with an FF-NN (a) one single FF-NN (b) anindividual FF-NN for each action.

The results show that FF-NNs are even more difficult to use for Q-Functions than for V-Functions. Forcertain initial (random) configurations of the neural network, and good configurations of the residual al-gorithm, learning was actually successful and sometimes even quite fast, but this was not reproducible forall initial configurations of the neural network. A directed exploration strategy is likely to attenuate theinfluence of poor initial configurations of the FF-NN. Again, learning with the direct gradient algorithm was

8.3. Q-Function Learning Experiments 165

unsuccessful for both types of FF-NNs. The residual algorithm with a highβ setting or even the residualgradient algorithm (β = 1.0) are more promising. Learning was successful in 7 out of 10 cases for the bestconfiguration of the residual algorithm with the single FF-NN approach. The first FF-NN approach also per-forms slightly better than the second approach with three individual FF-NNs. This indicates that using thegeneralization capability of the FF-NN between actions is a better approach then using separated FF-NNsfor the actions.

8.3.2 Comparison of different time scales

This experiment is similar to the time scale experiments in the V-Learning section. We compare Q-Learningwith a constantγ setting of 0.95, Q-Learning with an adaptiveγ of 0.3585∆t, which is equivalent to theγ value used by advantage learning, and advantage learning (time scale factorK was set to 1.0). Thisexperiment is also done only for the pendulum task. The results are illustrated in figure8.26.The Q-Learning algorithm outperforms the advantage learning algorithm for larger time scales, but advan-tage learning is better for small time steps because the advantage values are scaled by the inverse time step1∆t . But using such small time steps like∆t = 0.005 is typically not useful for learning, so this benefit ofadvantage learning is not very useful.Another surprising result is that the Q-Learning algorithm with the adaptedγ parameter performs worse forall time scales. High discount factors seems to be good in general (for V-Learning the adaptive discountfactor setting with higher discount factors outperforms the constant discount factor setting of 0.95), but donot work in this case.

(a) Pendulum (b)

Figure 8.26: Experiments with Q-Function Learning using (a) different time scales, (b) the Dyna-Q algo-rithm with a different number of planning updates

8.3.3 Dyna-Q learning

In this section, we will investigate the use of simulated experience from the past to update the Q-Functionas it is done by the Dyna-Q algorithm (see4.8.1. At each step, we update the Q-Function with 0, 1, 2, 3or 5 randomly chosen experiences from the last 40 episodes. We tested the Dyna-Q algorithm for both thependulum and the cart-pole task using our standard RBF-network. As the learning algorithm, the standard


Q-Learning algorithm was used, for the Dyna-Q updates a Q-Learner was also used, but without eligibilitytraces. The learning rate used for both learning algorithms was 0.75. The results are plotted in figure8.26(b). The plot is averaged over 20 trials for the pendulum and over 10 trials for the cart-pole task. For thependulum task, using the Dyna-Q planning updates did not have any effect on the performance, this taskseems to be too simple. The results for the cart-pole task do show a slight improvement in performance.Surprisingly, the performance gets worse again if we use more planning updates. The reason for this mightbe that too many off-policy planning updates disturb the approximation of the Q-Function.Our experiments with Dyna-Q Learning unfortunately do not show a clear advantage for this planningapproach. Additional experiments are needed to illustrate the benefits of Dyna-Q Learning.

8.3.4 Conclusion

Learning the Q-Function succeeded for all algorithms using the RBF-network for the pendulum and thecart-pole task, thus Q-Learning can solve problems as complex as the V-Learning approach. The advantageof V-Learning is the superior learning speed. But when using FF-NNs, the comparison looks different.Learning the Q-Function is in this case more important than learning the V-Function, resulting in a veryunstable learning performance. Advantage Learning does not seem to have a significant advantage overQ-Learning, at least with the use of RBF-networks. Further tests with FF-NNs or GS-NNs have not beendone. The benefit of Q-Learning, is that it can be used even if the system dynamics are not known, we alsohave the possibility of using the Dyna-Q algorithm, or other approaches, to incorporate experience from thepast, which is not possible for V-Learning because V-Learning is an on-policy algorithm. An interestingapproach would be to incorporate the planning part used by V-Learning for Q-Learning (e.g. by calculatingthe value of a state byV(s) = maxaQ(s,a)). We ran some tests for this first naive approach (using planningfor action selection instead of the Q-Values), which only resulted in a divergent behavior of the Q-Function.The reason for this is that the Q-Values which are considered best do not have to be taken at all, and so thesevalues do not get updated. A planning approach which simultaneously updates the Q-Values in the planningphase would be appropriate.

8.4 Actor-Critic Algorithm

This section covers the two Actor-Critic algorithms for a discrete action set introduced in section4.6, thestochastic real valued algorithm (SRV) and the new policy gradient Actor-Critic algorithm (PGAC).

8.4.1 Actor-Critic with Discrete Actions

In our experiments with the discrete Actor-Critic algorithm, an RBF network was used for the actor andthe critic. We tested two different algorithms (introduced in section4.6). The first algorithm uses thetemporal difference for the update, whereas, the second algorithm also weights the update of the actor bythe probability of taking the current action as well: at least half of the learning rate is used for the updateif the probability is very high. The tests were done both with and without eligibility traces for the actor.For creating a policy from the actor’s action value function, a standard soft-max policy was used withβ = 20. Since these methods also use a discrete action set, they are comparable to the Q-Function Learningalgorithms.In figure 8.27(a), we can see the results for the pendulum task, and in8.27(b) for the cart-pole task. Inboth benchmark tasks, the algorithms perform well. For the pendulum task, the algorithms performed sig-nificantly better than the Q-Learning approach, although the performance for the cart-pole task was almost

8.4. Actor-Critic Algorithm 167

identical. Interestingly the results of the tests where we used e-traces are drastically different for the twobenchmark problems. While using e-traces had a significantly worse performance for the pendulum task, itperformed well in the cart-pole task. This indicates that using e-traces for the actor considerably improvesthe performance of more complex tasks. The comparison of the two Actor-Critic algorithms tends slightlyto the second approach, consequently an emphasis on updates of actions with low probability is a goodstrategy.


Figure 8.27: Experiments with discrete Actor-Critic algorithms for the (a) pendulum tasks (200 episodes,averaged over 10 trials) (b) cart-pole task (4000 episodes, averaged over 5 trials)

We also investigated the use of FF-NNs for the actor and the critic simultaneously, which did not work at alleven if we use the best known parameter configuration for the critic from the previous experiments.

8.4.2 The SRV algorithm

The SRV uses a continuous policy as actor, which is a clear advantage over Q-Learning and the precedingActor-Critic algorithms. For the SRV algorithm, noise plays a significant role for learning, so we tested theSRV for different amounts of noise and also for different smoothness levels of noise. A disadvantage of thisalgorithm is that the optimal learning rate depends on the noise of the controller, so the learning rate has tobe optimized for each different noise setting. We used a sigmoidal policy (see7.1.2) as the actor, which iseither implemented as an RBF network or a FF-NN network. For the critic, again either an RBF networkor an FF-NN is used, both with the most efficient algorithm and parameter configuration determined in theprevious experiments. Our experiments with different function representations of the actor and the criticillustrate, whether it is useful to use a non-linear FA for the value function (due to the high dimensionalstate space), if we know an easy to learn representation of the policy (for example the RBF network orparameterized controllers from optimal control).For the pendulum task, one learning trial lasted for 200 episodes, if an RBF network was used for the actorand the critic, and 3000 episodes if one or more FF-NNs were used. All results are averaged over 10 learningtrials. The tests were done forα values of [0.0,0.7,0.9] andσ values of [15.0,10.0,7.5,5.0,1.0]. Note thatthe limit of the control variable is [10,−10], so for high noise values, often just the limits of the controlvariable were taken. Thus for filtered noise, often the same limit value is chosen for several time steps,which ensures a certain smoothness in time.


For each noise setting, the learning rate of the actor was roughly optimized in the interval [1, 4] for RBFnetworks, and in the interval [0.001, 0.025] for FF-NN networks.For the pendulum task, we can see the performance of the RBF actor using an RBF critic in figure8.28,and in figure8.28 (b) the use of an FF-NN critic. In the case of the RBF critic, high exploration factorsseem to be advantageous. The results for different noise smoothness factors do not significantly differ. Butfor the FF-NN critic, which cannot track the evolution of the actor that quickly, we see a huge differencein the performance for different smoothness factors. Obviously, the smoothness in time of the filtered noisesignal resembles the smoothness in time of the optimal policy, resulting in a good performance of this noisedistribution. The result even competes with the V-Planning policy using a FF-NN, which is quite remarkable,because the V-Planning policy uses the system dynamics as prior knowledge. It appears that the RBF-actorcan stabilize the learning process of the FF-NN because its parameter changes only affects the actor locally.


Figure 8.28: Performance plots of the SRV learning algorithm using an RBF actor with (a) RBF critic (b)FF-NN critic for different noise signals.

The use of a FF-NN as actor is very problematic, resulting in a very unstable learning performance. Sur-prisingly we get better results for the FF-NN critic than for the RBF-critic. But perhaps the reason for thisis that the parameters (learning rate of the actor) were not optimized accurately enough. The performancecurve of the FF-NN critic is plotted in figure8.30(a), in8.30(b) we can see the learning curves of 10 trialsusing the most efficient noise signal and learning rate. The algorithm managed to learn the swing up in twoout of 10 cases, even in these two cases the performance was still very unstable. Seemingly, when usingan FF-NN to represent the policy , the learning of even simple tasks becomes difficult, in fact, even moredifficult than using an FF-NN for the value function. In particular, we are likely to end in a local minima,for all the gradient descent/ascent algorithm for the policy representation, which might be the reason for thebad performance. Adding some restrictions to the FF-NN, like certain smoothness criteria, or limiting thenorm of the weights may help to overcome this problem, but were not further investigated.For the cart-pole task, the SRV algorithm experiments were only done for the RBF-actor, with either an RBFor an FF-NN critic. In the tests with the RBF-critic, 4000 episodes were used as in the Q-Function Learningexperiments. The SRV approach could not affirm the promising results from the pendulum task. For thependulum task, the algorithm could outperform Q-Learning methods. Now, the performance is significantly



Figure 8.29: (a) Performance of the SRV learning algorithm using an FF-NN actor and an FF-NN critic. (b)Learning curves of the same algorithm, using the best determined configuration. The algorithm managed tolearn the task in two out of 10 trials. The thick line represents the average learning curve.


Figure 8.30: (a) Performance of the SRV learning algorithm using an FF-NN actor and an FF-NN critic. (b)Learning curves of the same algorithm, using the best determined configuration. The algorithm managed tolearn the task in two out of 10 trials. The thick line represents the average learning curve.


worse than in the Q-Learning algorithm. This means that the SRV algorithm is more difficult to scale upto more complex tasks than other algorithms; a potential problem is the choice of the noise signal for morecomplex tasks.The experiments with the FF-NN critic did not lead to any good results for all noise settings, but one learningtrial had 10000 episodes in order to compare the results to the PGAC algorithm. Thus a fair comparisonwith V-Planning approaches is not possible, because the V-Planning method needed over 50000 episodes tolearn the task.

(a) Cart-Pole

Figure 8.31: (Performance of the SRV learning algorithm using an RBF actor and an RBF critic. One triallasted for 4000 episodes and the plots are averaged over 5 trials

8.4.3 Policy Gradient Actor-Critic Learning

In this section we test the new proposed Policy Gradient Actor-Critic (PGAC) algorithm. This algorithmalso uses a continuous valued policy, but unlike to the SRV algorithm, it requires the knowledge of thesystem dynamics for the policy updates. We will compare this algorithm to the SRV algorithm, because itis also an Actor-Critic algorithm, and also to V-Planning policy, because in this case the system dynamicsare also used. Again, we tested the algorithm for different combinations of representations (FF-NNs or RBFnetworks) for the actor and the critic. The actor is, once more, a sigmoid policy, the random noise controllerwas disabled in these experiments. Each actor-critic configuration was tested for different prediction andbackwards horizons. For the critic part, the most efficient algorithms and parameters were always used.For the pendulum task one trial took 50 episodes if only RBF networks were used, and 3000 episodes if oneor more FF-NNs were used(same trial length as for V-Planning). All plots are averaged over 10 trials forRBF networks and over 5 trials when an FF-NN was used. The results where a forward prediction horizonwas used, are shown in figure8.32(a) and the results where a backwards horizon was used in figure8.32(b).Learning with an RBF actor was successful, but using a bigger prediction or backwards horizon only slightlyimproved the performance, at least with the FF-NN as critic. With the RBF-network as critic, using a biggertime interval for the updates did not seem to result in any improvement for the pendulum task; the task seemsto be too simple to benefit from bigger time windows for computing the gradient of the policy. Comparing itthe SRV algorithm, the performance is almost the same. For the an FF-NN critic, the SRV even outperformsthis algorithm slightly. We think this is due to the optimized noise level in the SRV experiments. The PGAC


algorithm can also compete with the V-Planning approach for both kinds of critics. The use of an FF-NN asactor is also critical with the PGAC algorithm critical and it does not work properly even for the pendulumtask, probably for the same reasons as for the SRV algorithm.


Figure 8.32: Performance of the PGAC algorithm for the pendulum task using (a) a forward predictionhorizon or (b) a backwards view horizon

We ran the same tests for the cart-pole task, the results are illustrated in figure8.33. For the RBF critic,one trial lasted for 4000 episodes, for the FF-NN critic 10000 episodes. For this task we can already see aconsiderable difference between the different time intervals used for the update; with a time interval of 1.0neither the RBF-Critic nor the FF-NN critic can learn the task, but performance can be drastically improvedby using a larger time interval for the updates. The RBF-critic approach learns the cart-pole task quite well,and outperforms the SRV for a time interval length of two. The algorithm has a performance comparable tothe Q-Learning approaches, with the benefit of a continuous valued-control. The PGAC algorithm could notcompete with the discrete time V-Planning approach, which gives particularly good solutions with a higherprediction horizon. In particular, the learned policy of the V-Planning approach is very efficient, which couldnot be achieved with the PGAC algorithm, because such sharp decision boundaries cannot be expressed withthe RBF-actor.For the FF-NN critic, we did experiments with a large backwards horizon (7 was used) to illustrate thestrength of the PGAC approach, the learning curve of one parameter setting is illustrated in figure8.34. Thealgorithm managed to learn the task after 10000 episodes with a stable performance, which is a considerableimprovement over the standard V-Learning approach. Again, the RBF-Actor managed to stabilize the FF-NN critic. In comparison with the SRV algorithm, this result also shows the performance advantage ofthe PGAC algorithm over the SRV algorithm. A disadvantage of the PGAC approach, is the long learningtime when using a large backwards or forwards horizon (one trial with 20000 episodes and a backwardshorizon of 7 took 120000 seconds) due to the complex gradient calculation, and further optimization of theimplementation is needed.The prediction horizon was supposed to outperform the backwards horizon approach, because the actor isupdated with future information. But both approaches lead to almost the same results for the pendulum andthe cart-pole task, so the backward horizon approach should be preferred due to the lower computationalcosts (no state prediction is needed).



Figure 8.33: Performance of the PGAC algorithm for the pendulum task using (a) a forward predictionhorizon or (b) a backwards view horizon

Figure 8.34: Learning curve of the PGAC algorithm for the cart-pole task using a FF-NN critic. A backwardshorizon of 7 was used. One trial lasted for 20000 episodes. The plots are averaged over 2 trials

8.5. Comparison of the algorithms 173

The PGAC algorithm outperforms the SRV algorithm, and almost reaches the performance of the V-Planningalgorithm if we choose larger time intervals for the actor updates. Compared with the V-Planning algorithm,the PGAC algorithm has the significant advantage that for action selection, no planning is needed. Thismeans action selection is very fast. Using a higher prediction (or also backwards horizon) for the weight up-dates is also possible, which can drastically improve the performance, as it does when using V-Planning, butunlike V-Planning, the computation time does not depend exponentially on the prediction horizon (O(|A|n)).This dependency is linear (O(n)) for the PGAC algorithm. In our experiments for the pendulum task, aforward prediction horizon of five time steps took about twice as long as using no prediction horizon at all.In comparison with V-Planning with a search depth of five, which needed approximately 120 times longer,this is a considerable saving of computation time. For the cart-pole task, a similar benefit in computationtime could be observed, but in this case the policy gradient calculation is more complex, resulting in fourtimes as much computation time. For a prediction horizon of five, the V-Planning approach took approx-imately 80 times as long as the standard approach with a search depth of one. For the PGAC algorithm,even higher prediction/backwards horizons can be used, particularly at the beginning of learning, due to thelinear computation time.

8.4.4 Conclusion

Actor-Critic algorithms are very promising, and have in our opinion, a great potential. The discrete actionalgorithms compete with Q-Learning, and almost reach the performance of V-Planning approaches withoutrequiring knowledge of the system dynamics. The continuous action algorithms both perform quite well;the SRV algorithm performs well for the simple pendulum task, but for the cart-pole task an exact tuningof the noise signal is necessary for an acceptable performance, so scaling it to more complex tasks is likelyto be quite difficult. In this context, the PGAC algorithm is more promising, because its learning ability iseasily scalable by extending the size of the time interval used for the updates. The PGAC algorithm couldeven outperform the 1-step V-Planning approach considerably for difficult constellations using an FF-NN ascritic for the cart-pole task. This is actually a very promising result, indicating the power of this approach.Another interesting aspect is that because both algorithms use different information to update the policy, butmay use the same representation of the policy, it is possible to combine the SRV and the PGAC algorithmfor the policy updates. Whether this leads to superior performance has yet to be investigated.

8.5 Comparison of the algorithms

In this section, we will compare the results of the individual algorithms to each other. We will alwaysuse the best configurations of the algorithm. The plots show the averaged learning curves. Firstly we willdo the comparison for the RBF-network. In figure8.35(a), we can see the comparison for the methodsusing the system dynamics (V-Planning, value gradient-based policy, PGAC). The results show a slightlyslower learning process for the value gradient policy, the PGAC algorithm (5 steps forward horizon is used)performs equally as well as the V-Planning approach. For the cart-pole task8.36(a) these results can be seeneven more clearly. The performance of the value-gradient based policy falls off considerably, the PGACapproach ( a forward horizon of five was used) has the same learning speed as the V-Planning approach, butcannot reach the quality of the learned policy. This can be explained by the use of the real-valued RBF actor,which cannot represent hard decision boundaries as with the V-Planning approach. In this plot, we can alsosee the power of V-Planning approach if we plan for more than one step. Using five steps for the prediction,the algorithm finds an optimal solution almost four times as fast as with the standard one step prediction.The quality of the learned solution is also considerably better.


In figure8.35(b), we can see the results for model free approaches (Q-Learning, Advantage-Learning, Actor-Critic Learning, SRV). All three Actor-Critic approaches outperform the Q-Learning approach. Advantagelearning learns as fast as Q-Learning but the quality of the learned policy does not seem to be as good asfor the other algorithms. The results for the cart-pole task can be seen in8.36(b). Q-Learning and thediscrete Actor-Critic learning algorithms have the best performance in this case. Surprisingly, the quality ofthe learned policy for Q-Learning is even better than for actor-critic learning. The SRV algorithm struggleswith the complexity of this task, and, on average, does not manage to learn a good policy in 4000 episodes.


Figure 8.35: (a) Learning Curves of the V-Planning (one step prediction), the value gradient policy, andthe PGAC algorithm. (b) Comparison of Q-Learning, Advantage Learning, the two Actor-Critic Learningapproaches and the SRV algorithm. All algorithms use the RBF network. Plots are averaged over 10 trials.


Figure 8.36: (a) Learning Curves of the V-Planning (one and five step prediction), the value gradient policy,and the PGAC algorithm. (b) Comparison of Q-Learning, Advantage Learning, Actor-Critic Learning andthe SRV algorithm. All algorithms use the RBF network, plots are averaged over 5 trials.

8.6. Policy Gradient Algorithm 175

Figure 8.37: Learning Curves of the V-Learning (one step prediction), V-Planning (five step prediction), thePGAC and the SRV algorithm using FF-NNs for the pendulum task, plots are averaged over 10 trials.

8.6 Policy Gradient Algorithm

This section will cover the policy gradient algorithms included in this thesis, which are CONJPOMDP, andtwo variants of the PEGASUS algorithm using gradient ascent, namely the numerical and the analyticalpolicy gradient calculations. The tests in this section are unfortunately not that extensive due to the lack oftime.

8.6.1 GPOMDP

The GPOMDP algorithm was tested with the RBF network and also with the FF-NN for the pendulumtask. Due to the huge learning time needed for this task, learning was not tried on more complex tasks. Astochastic policy with a soft-max distribution (β = 20) is used to represent the policy. We only tested theCONJPOMDP algorithm with the original setting, so GSEARCH was used for determining the optimumlearning rate. Different numbers of episodes (5, 20, 50 and l00) were tried for the gradient estimation, whichis plotted in8.6.1. The GSEARCH algorithm always uses1

5 episodes for its gradient estimation, whichdoes not need to be that accurate. We limited the learning rate calculated by the GSEARCH algorithm by[0.1,160] for the RBF policy, and by [0.005,5.0] for the FF-NN policy. The start learning rates used were10.0 and 0.5.In figure8.6.1(b), we can see the learning curves for the RBF network, using different numbers of episodesfor the gradient estimation. After each weight update, the average height of 20 episodes was recorded. Aswe can see, the CONJPOMDP algorithm needs a huge number of gradient updates to converge. Learningwas only successful using 100 episodes per gradient estimation, and the algorithm needed approximately3000 weight update steps, resulting in about 80 million (!!) learn steps, or over 40000 seconds for learninga task which can also be learned in 10 episodes with 200 steps, or 5 seconds using the same RBF network asfunction approximator for the value function. In the huge variance of the performance estimate, we can alsosee that learning is quite unstable. This is also a consequence of theβ value of 0.95 used for the GPOMDPalgorithm. Thisβ value specifies the bias-variance trade-off of the gradient estimation, highβ values havea large variance, but a small bias to the real gradient. Learning with the FF-NN did not lead to any successwithin 5000 weight updates, so continuing the learning was not considered to be useful.



Figure 8.38: (a) Performance of the GPOMDP algorithm using an RBF network or an FF-NN (b) Learningcurve for the RBF network with different number of gradient estimation episodes.

These results resemble the results shown by Baxter [11], where a vast number of trials were needed to learnto navigate a puck on a plateau. Perhaps the choice of differentβ values for the GPOMDP algorithm (0.95was used) might have improved the performance of the algorithm, but due to the poor performance and thelong learning time this algorithm was not investigated any further.

8.6.2 The PEGASUS algorithm

The PEGASUS algorithm was tested with an FF-NN or the standard RBF network representing the contin-uous policy. Again, a sigmoidal policy was used to limit the control values, and the noise controller wasdisabled. We tested two approaches for calculating the optimal learning rate in this case, the GSEARCHalgorithm and the standard line search algorithm discussed in section6.2.5. Again we tested the algorithmswith a different number of episodes used for the gradient estimation.The analytical algorithm must calculate the gradient of the policy and the transfer function, which is donenumerically. The step-size for the three-point method used was 0.005.On the other hand, the numerical algorithm, which needs to calculate the gradient of the value of a policydirectly, used a differentiation step size for the weights of 0.05. Both differentiation step sizes (for thenumerical and the analytical algorithm) were empirically chosen, and only roughly optimized.We tested the algorithm for the pendulum task with the following parameters: For the RBF network theinitial learning rateη0 of the GSEARCH algorithm was set to 10, and the learning rates were limited to[0.1,160]. For the value based line search algorithm, we used the learning rates [0.1,1.0,5.0,10.0,30.0,60.0,120.0,240.0]as search points. Learning was done for 50 weight updates. Using the FF-NN, the start learning rateη0 wasset to 0.1 and limited to [ 1

160,160], and the values [0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0] were used assearch points for the value based line search. For each value evaluation of the line search algorithm, 50episodes were simulated, using the same initial states at each evaluation. Learning was done for 1000weight updates for the FF-NN.We can see the results in8.39(a) for the RBF network, using a different number of gradient estimationepisodes. All algorithms managed to learn the RBF-policy, even with only five gradient estimation episodes.Usually, learning was successful already after 10 weight updates, which is an immensely different from the

8.6. Policy Gradient Algorithm 177

GPOMDP algorithm. The line search algorithm for estimating the optimal learning rate seems to be morerobust than the GSEARCH algorithm in this case, because we eliminated most of the noise in the valueestimates due to the PEGASUS framework, this the GSEARCH algorithm’s advantage of being insensitiveto noise is therefore eliminated. The numerical solution falls off performance, and it only managed tolearn the task in 7 out of 10 cases. If it learns the solution correctly, the results are as good as for theother two algorithms. Optimizing the numeric step size would probably have solved this problem, but thiswas not done due to the high learning time required by the numeric solution. For the pendulum task, thenumeric solution took about 800 seconds of computation time, whereas the analytical solution needed just30 seconds, which is a considerable improvement in computation time. Especially for policies with manyweights (like a policy based on an RBF network), the numerical solution has a huge speed disadvantage.The results for the FF-NN are, unfortunately, not encouraging at all. Learning was not successful for onelearning trial after 1000 weight updates for all three tested algorithms. Many different magnitudes for theinterval of the optimal learning rate were tried without success. The exact reasons for this have yet tobe investigated, as unfortunately, there was no time left for doing exhaustive tests for the policy gradientalgorithms, but intuitively we got stuck in a local minima with our gradient approach very quickly. Ingeneral, we can say that the analytical version of the gradient estimation works very efficiently and alsomore accurately than the numerical solution. Complementing this gradient method with other optimizationmethods like genetic algorithms or stimulated annealing, which avoid local maxima, would be a promisingapproach, e.g. for learning the policy with an FF-NN.


Figure 8.39: (a) Performance of the PEGASUS algorithm using a RBF network (b) Learning curve for theRBF network with the different PEGASUS approaches.

8.6.3 Conclusions

Our tests with policy gradient algorithms were unfortunately not as exhaustive as for the other learningalgorithms. The GPOMDP algorithm is only theoretically important. The poor performance of this algo-rithm makes it impracticle. However several extensions or related algorithm of the GPOMDP algorithmexist which are supposed to have good performance, but were not tested in this thesis. The Pegasus algo-rithm with the analytical gradient estimation is more promising, as it is able to calculate the gradient quiteaccurately, while having a good speed performance. Policy gradient algorithms are also often used with


open loop control, like controlling the gait of a 4-legged robot [24]. It would be interesting to see how theanalytical gradient estimation approach works for these policies with a low parameter space. The analyticalgradient algorithm does basically the same calculations as the PGAC algorithm, but without the use of thevalue function (it can be seen as special case of the PGAC algorithm with a very long time interval forthe weight updates). In order to find good learning rates, and exact gradients, the policy gradient approachusually needs more simulation steps to learn the task than value based methods, thus, the PGAC should bepreferred, if the complexity of the learning task allows the learning of a value function.

8.7 Conclusions

RL is still quite tricky to use for continuous optimal control tasks. While it is very promising for smalltoy examples like the pendulum swing up task, it does not scale up well to tasks with more dimensions andgreater complexity like the acrobot task. The effort required to use RL for these fairly small benchmark taskswas drastically underestimated, resulting in a considerable delay in finishing this thesis. Some experimentswith the used benchmark problems, or even new benchmark problems, could not be done due to the lackof time. Nevertheless, this thesis hopefully gives a good overview of the use of RL algorithms for optimalcontrol, their strengths and weaknesses and possible applications. This thesis is also, to our knowledge,the most extensive collection of comparative benchmark data for comparing the different RL algorithms.In many areas, more extensive tests would be needed in order to draw more detailed conclusions, and theimplementation and comparison of other function approximation schemes (locally weighted learning, NG-nets, echo-state networks) and a few more algorithms, like additional policy gradient algorithms or Actor-Critic algorithms would be interesting. The RL Toolbox provides a very good framework for additionalexperiments with RL already, and it is already in use by approximately 20 researchers all over the worldand will hopefully be used by more users in future. Generally we can say that the pure RL algorithmsused to learn the value function or to estimate the policy gradient are already quite sophisticated, but nearlyall of them lack the ability to scale up to more complex tasks. This is a consequence of the functionapproximators used. RBF-networks scale badly due to the curse of dimensionality, and they are also difficultto apply if a high accurate value function is needed. FF-NNs and GS-NNs have comparatively poor learningperformance, even if good parameter settings have been found, therefore it is very difficult to use them formore complex tasks. A sophisticated learning system which uses RL for complex high dimensional taskswould need to combine all the benefits of the different algorithms presented here in this thesis, and obviouslysolve some of the other problems, like autonomous sub goal detection or finding good representations forthe value function. Many approaches introduced in this thesis, like directed exploration, using planningor adding a hierarchic structure to the task showed very promising results. Further development of theseideas, and combining them appropriately, will hopefully help us to cope with at least a few of the problemsoccurring in RL.

Appendix A

List of Abbreviations

RBF Radial Basis FunctionDP Dynamic ProgrammingEM Expectation MaximationESN Echo State NetworkE-Traces Eligibility TracesFA Function ApproximatorFF-NN Feed Forward Neural NetworkGSBFN Gaussian Soft-Max Basis Function NetworkGS-NN Gauss Sigmoid Neural NetworkHAM Hierarchy of Abstract MachinesLMS Least Mean SquareLQR Linear Quadratic RegulatorLWL Locally Weighted LearningMDP Markov Decision ProcessMSE Mean Squared ErrorNG-net Normalized Gaussian networks with linear regressionPEGASUS Policy Evaluation of Goodness and Search Using ScenariosPGAC Policy Gradient Actor Critic AlgorithmPOMDP Partially Observable Markov Decision ProcessPS Prioritized SweepingQ State Action ValueRARS Robot Auto Racing SimulatorRL Reinforcement LearningRLT Reinforcement Learning ToolboxSARSA State Action Reward State Action learningSMDP Semi-Markov Decision ProcessSRV Stochastic Real Valued AlgorithmSTL Standard Template LibraryTBU Truck-Backer-Upper TaskTD Temporal DifferenceV State Value

179

Appendix B

List of Notations

< · · · > Tuple[. . . ] VectorA Set of all actionsAs Set of actions available in statesa,at Actionβ Controls the softness of the soft-max policyβ Weghting factor for the residual algorithmβ Bias-Variance trade-off factor in the GPOMDP algorithmβ Termination condition of an optionC(s) Number of visits of statesχ Exploration measurecritique TD comming from the criticd(s) Probability of initial statesD Set of all initial states∆ Change in a certain valuee(si) E-Trace for statesi

e(wi) E-Trace for weightwi

E Error FunctionE[] Expectation operatorη Learning ratef (s,a) State transition functiong(s,a, p) Deterministic simulative modelγ Discount FactorH HamiltonianΓ Selective attention factorλ E-Trace attenuation factor∇ Gradient with respect to the weights (w or θ)∇βV GPOMDP estimation of the policy gradient)p(s) Action value in actor-critic learningπ Policy, state to action mappingπ(s) Deterministic Policyπ(s,a), µ(s,a) Stochastic Policyφi Activation function of featurei

180

181

Φ Activation vector of all veatureso OptionO Set of all OptionsP(x = X) Probability that the random variableX has the valuexP(x = X|y = Y) Conditional probability thatx = X if y = Y is already knownψπ(s), ψπ(s,a) Exploration (action) value functionQw Approximated action value functionQπ(s,a) Action value, when taking actiona in state s and then followingπQ∗ Optimal Q-functionr, rt Rewardr(s,a, s′) Reward functionro

s Option rewardresidual(s, s′) Error of the Bellman equationσ(s) sigmoidal squashing function (logsig)σ Variance, Variance of the noiseS Set of all statess, st States0 Initial states′ Successor statesγ Continuous time discount factorsλ Continuous timeλtd Temporal Differenceθ Parameter vector of the policy∆t Used time stepu Continuous control vectorU Set of all control valuesVw Approximated valuefunctionVπ

A(t),VπA(st) Value (future average reward) in time stept when followingπ

VA(π) Expected future average reward beginning in a typical initial stateVπ(t),Vπ(st) Value (future discounted reward) in time stept when followingπV(π) Expected future discounted reward beginning in a typical initial stateV∗ Optimal value functionVk,Qk (Action) Value Function at thekth iteration of DPw Weight vector of the value function∆wD Direct Gradient weight update∆wR Residual weight update∆wRG Residual Gradient weight update∆WD Epoch-wise Direct Gradient weight update∆WR Epoch-wise Residual weight update∆WRG Epoch-wise Residual Gradient weight update

Appendix C

Bibliography

Bibliography

[1] P. Absil and R. Sepulchre. A hybrid control scheme for swing-up acrobatics. European Conference onControl ECC 2001.

[2] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping.NIPS 97, 1997.

[3] C. Atkenson, W. Moore, and S. Schaal. Locally weighted learning.Artificial Intelligence Review,11(1-5):75–113, 1997.

[4] C. Atkeson, W. Moore, and S. Schaal. Locally weighted learning for control.Artificial IntelligenceReview, 11(1-5):11–73, 1997.

[5] L. Baird. Reinforcement learning in continuous time: Advantage updating. InInternational Confer-ence on Neural Networks, June 1994.

[6] L. Baird. Reinforcement Learning Through Gradient Descent. PhD thesis, School of Computer Sci-ence, Carnegie Mellon University Pittsburgh, 1999.

[7] A. Baron. Universal approximation bounds for superpositions of sigmoidal functions.IEEE Transac-tions on Information Theory, 1993.

[8] A. Barto and R. Sutton. Neuron-like adaptive elements that can solve difficult learning control prob-lems. InIEEE Transactions on Systems, Man, and Cybernetics, 1983.

[9] J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning: 1. gradient estimation algo-rithms. Technical report, CSL, Australian National University, 1999., 1999.

[10] J. Baxter, A. Trigdell, and L. Weaver. Knightcap: a chess program that learns by combining TD(λ)with game-tree search. InProc. 15th International Conf. on Machine Learning, pages 28–36. MorganKaufmann, San Francisco, CA, 1998.

[11] J. Baxter and L. Weaver. Direct gradient-based reinforcement learning: 2. gradient ascent algorithmsand experiments. Technical report, CSL, Australian National University, 1999., 1999.

[12] D. Bertsekas and J. Tsitsiklis.Neuro Dynamic Programming. Athena Scientific, 1998.

182

BIBLIOGRAPHY 183

[13] G. Boone. Minimum-time control of the acrobot. International Conference on Robotics and Automa-tion, 1997.

[14] S. Brown and K. Passano. Intelligent control for an acrobot.Intelligent and Robotic Systems, 1996.

[15] R. Coulom.Reinforcement Learning using Neural Networks. PhD thesis, Institut National Polytech-nique de Grenoble, 2002.

[16] T. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.1998International Conference on Machine Learning, 1998.

[17] K. Doya. Reinforcement learning in continuous time and space.Neural Computation, 12, 1999.

[18] P. Fidelman and P. Stone. Learning ball acquisition on a physical robot. In2004 International Sympo-sium on Robotics and Automation (ISRA), August 2003.

[19] V. Gullapalli. Reinforcement Learning and its Application to Control. PhD thesis, Graduate School ofthe University of Massachusetts, , 1992.

[20] J. Izawa, T. Kondo, and K. Ito. Biological arm motion through reinforcement learning. In2002 IEEEInternational Conference on Robotics and Automation (ICRA’02), 2004.

[21] H. Jaeger. The echo state approachto analysing training recurrent neural networks.GMD Report 148,2001.

[22] H. Jaeger. A tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the ”echo statenetwork” approach. 2002.

[23] S. Kakade. A natural policy gradient. InNIPS. Advances in Neural Information Processing Systems,2000.

[24] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. InTheNineteenth National Conference on Artificial Intelligence, pages 611–616, July 2003.

[25] P. Lancaster and M. Tsmenetsky.The Theory of Matrices, with Applications. Academic Press, SanDiego, 1984.

[26] Yann Le Cun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Muller. Efficient backprop. InNeuralNetworks, Tricks of the Trade, Lecture Notes in Computer Science 1524. Springer Verlag, 1998.

[27] R. Makar, S. Mahadevan, and M. Ghavamzadeh. Hierarchical multi-agent reinforcement learning. InAGENTS ’01: Proceedings of the fifth international conference on Autonomous agents, pages 246–253,New York, NY, USA, 2001. ACM Press.

[28] H. Miyamoto, J. Morimoto, K. Doya, and M. Kawato. Reinforcement learning with via-point repre-sentation.Neural Networks 17 (2004), 2004.

[29] J. Morimoto and K. Doya. Hierarchical reinforcement learning of low-dimensional subgoals and highdimensional trajectories. InThe 5th International Conference on Neural Information Processing, vol-ume 2, pages 850–853, 1998.

184 Chapter C. Bibliography

[30] J. Morimoto and K. Doya. Reinforcement learning of dynamic motor sequence: Learning to stand up.In IEEE/RSJ International Conference on Intelligent Robots and Systems, volume 3, pages 1721–1726,1998.

[31] J. Morimoto and K. Doya. Robust reinforcement learning.Advances in Neural Information ProcessingSystems 13, pages 1061–1067, 2004.

[32] A. Ng and A. Coates. Autonomous inverted helicopter flight via reinforcement learning. InInterna-tional Symposium on Experimental Robotics, 1998.

[33] A. Ng and M. Jordan. Pegasus: A policy search method for large mdps and pomdps approximation. InUncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference, 2000.

[34] M. Nishimura, J. Yoshimoto, and S. Ishii. Acrobot control by learning the switching of multiplecontrollers. volume 2. Ninth International Symposium on Artificial Life and Robotics, 2004.

[35] R. Olfati-Saber, editor.Fixed Point Controllers and Stabilization of the Cart-Pole System and theRotating Pendulum, 1999.

[36] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. InAdvances in NeuralInformation Processing Systems, volume 10. The MIT Press, 1997.

[37] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics.Third IEEE-RAS International Conference on Humanoid Robots, 2003.

[38] M. Pfeiffer. Machine learning apllications in computer games. Master’s thesis, Institute of ComputerScience, TU-Graz, 2003.

[39] J. Randlov.Solving Complex Problems with Reinforcement Learning. PhD thesis, Niels Bohr Institute,University of Copenhagen, 2001.

[40] R.Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstrac-tion in reinforcement learning.Artificial Intelligence 112, pages 181–211, 1999.

[41] J. Schaeffer, M. Hlynka, and V. Jussila. Temporal difference learning applied to a high-performancegame-playing program.International Joint Conference on Artificial Intelligence (IJCAI), pages 529–534, 2001.

[42] K. Shibata, M. Sugisaka, and K. Ito. Hand reaching movement acquired through reinforcement learn-ing. Proc. of 2000 KACC (Korea Automatic Control Conference), 2000.

[43] J. Si and Y. Wang. On-line learning control by association and reinforcement. volume 3. IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN’00), 2000.

[44] S. Singh and R. Sutton. Reinforcement learning with replacing eligibility traces.Machine Learning,22, 1996.

[45] W. Smart.Making Reinforcement Learning Work on Real Robots. PhD thesis, Department of ComputerScience, Brown University, 2002.

[46] W. Smart and L. Kaelbling. Practical reinforcement learning in continuous spaces. InProc. 17thInternational Conf. on Machine Learning, pages 903–910. Morgan Kaufmann, San Francisco, CA,2000.

BIBLIOGRAPHY 185

[47] W. Smart and L. Kaelbling. Reinforcement learning for robot control.Mobile Robots XVI, 2001.

[48] R. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dy-namic programming. InProceedings of the Seventh International Conference on Machine Learning,pages 216–224, 1990.

[49] R. Sutton and A. Barto.Reinforcement Learning: An Introduction. MIT press, 2004.

[50] G. Tesauro. Temporal difference learning and td-gammon.Communications of the ACM, 38(3), 1995.

[51] S. Thrun. Selective exploration. Technical Report CMU-CS-92-102, Carnegie Mellon University,Computer Science Department, Pittsburgh, 1992.

[52] J. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation.Technical Report LIDS-P-2322, 1996.

[53] H. Vollbrecht. Hierarchic task composition in reinforcement learning for continuous control problems.In ICANN98. Neural Information Processing Department, University of Ulm, 1999.

[54] R. Williams. A class of gradient-estimating algorithms for reinforcement learning in neural networks.Proceedings of the IEEE First Anual International Conference on Neural Networks, 1987.

[55] J. Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis, Department ofArtificial Intelligence, University of Edinburgh, 1997.

[56] T. Yonemura and M. Yamakita. Swing up control of acrobot. SICE Annual Conference in Sapporo,2004.

[57] J. Yoshimoto and S.Ishii. Application of reinforcement learning to balancing of acrobot. In1999 IEEEInternational Conference on Systems, Man and Cybernetics, pages 516–521, 1999.

DiplomArbeit

Documents

Transcript of DiplomArbeit