Reinforcement Learning Control Based on Multi-Goal ...

8
Reinforcement Learning Control Based on Multi-Goal Representation Using Hierarchical Heuristic Dynamic Programming Zhen Ni and Haibo He Department of Electrical, Computer and Biomedical Engineering University of Rhode Island Kingston, RI 02881, USA. Email: {ni, he}@ele.uri.edu Dongbin Zhao State Key Laboratory of Management and Control for Complex Systems Institute of Automation Chinese Academy of Sciences Beijing 100190, China. Email: [email protected] Danil V. Prokhorov Toyota Research Institute NA, TTC, Ann Arbor, MI 48105, USA. Email: [email protected] Abstract—We are interested in developing a multi-goal genera- tor to provide detailed goal representations that help to improve the performance of the adaptive critic design (ACD). In this paper we propose a hierarchical structure of goal generator networks to cascade external reinforcement into more informative internal goal representations in the ACD. This is in contrast with previous designs in which the external reward signal is assigned to the critic network directly. The ACD control system performance is evaluated on the ball-and-beam balancing benchmark under noise-free and various noisy conditions. Simulation results in the form of a comparative study demonstrate effectiveness of our approach. I. I NTRODUCTION Adaptive dynamic programming (ADP) has been employed as one of the key approaches in seeking the optimal solutions for the general systems [1], [2], [3], [4], [5], [6], [7], [8], [9], and therefore becomes a critical research topic in the commu- nities of reinforcement learning, computational intelligence, and adaptive control in the past decades. Recent literature surveys on ADP [10], [11] have presented a comprehensive overview of its history from Bellman equation to dynamic programming, and applications from benchmark tests to real industry applications. While ADP is a powerful technique in the domain of machine intelligence, there is still an open question that how to generally assign an instant reinforcement signal/reward based on the system’s behavior. Many scientists and researchers have contributed their work on this topic. For instance, Prokhorov et al proposed to use the square of difference between the actual output and desired output as the reinforcement signal for tracking system [12], [13], Si et al used discrete values “0” or “-1” (“0”, “-0.4”, or “-1”) as the reward for the controller [14], [15], [16], Zhang et al proposed to employ the linear/generalized quadratic form as the performance in- dex that evaluates the system’s behaviors [17], [18], [19], Venayagamoorthy et al proposed to use different weights for different terms to define the utility function [20], [21], among others. Normally, reinforcement signals are defined according to problem specifics, and such signals are crafted manually according to prior knowledge or domain expertise, e.g., [22], [23]. There seems to be an interest in designing the reinforcement signal with as little human intervention as possible, to achieve adaptivity and efficiency in spite of possible changes in the operating environment. In this paper, we propose a hierarchical structure of goal generator networks to assign the reinforcement signal adap- tively according the system’s behavior with no prior knowl- edge or past experience of the control system. This is the first time we implement the hierarchical goal generator net- works proposed in [24], illustrating its promising performance. Specifically, we build the goal generator with multiple neural networks to generate the internal goals that cascade the rein- forcement signal. In our design, we keep the advantage of model-free action dependent heuristic dynamic programming (HDP) of [14]. Our motivation of the design is to employ the goal generator networks to learn from the external reinforcement signal and provide the critic network with the detailed internal goal representation (reinforcement signal). Our approach is to develop a mechanism to represent the usually discrete reinforcement signal (e.g., “0” and “-1”) by continuous values. In this way, our goal generator can provide more effective goal representation. This internal goal representation is expected to carry more information, and it can be adjusted adaptively to help the system’s decision making process. The goal generator networks observe the state vectors and the control value as their inputs, together with the internal goal from the network of a higher hierarchical level (or the external environment at the top level), while providing the network below with its internal goal. In order to show the improved performance of this goal generator structure, we test the ball and beam balancing benchmark with three algorithms: the typical HDP approach (without any goal generator network), our proposed approach with one goal generator network, and our proposed approach with multiple (three) goal generator U.S. Government work not protected by U.S. copyright WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

Transcript of Reinforcement Learning Control Based on Multi-Goal ...

Page 1: Reinforcement Learning Control Based on Multi-Goal ...

Reinforcement Learning Control Based onMulti-Goal Representation Using Hierarchical

Heuristic Dynamic ProgrammingZhen Ni and Haibo He

Department of Electrical, Computerand Biomedical EngineeringUniversity of Rhode IslandKingston, RI 02881, USA.Email: {ni, he}@ele.uri.edu

Dongbin ZhaoState Key Laboratory of Management

and Control for Complex SystemsInstitute of Automation

Chinese Academy of SciencesBeijing 100190, China.

Email: [email protected]

Danil V. ProkhorovToyota Research Institute NA, TTC,

Ann Arbor, MI 48105, USA.Email: [email protected]

Abstract—We are interested in developing a multi-goal genera-tor to provide detailed goal representations that help to improvethe performance of the adaptive critic design (ACD). In this paperwe propose a hierarchical structure of goal generator networksto cascade external reinforcement into more informative internalgoal representations in the ACD. This is in contrast with previousdesigns in which the external reward signal is assigned to thecritic network directly. The ACD control system performanceis evaluated on the ball-and-beam balancing benchmark undernoise-free and various noisy conditions. Simulation results in theform of a comparative study demonstrate effectiveness of ourapproach.

I. INTRODUCTION

Adaptive dynamic programming (ADP) has been employedas one of the key approaches in seeking the optimal solutionsfor the general systems [1], [2], [3], [4], [5], [6], [7], [8], [9],and therefore becomes a critical research topic in the commu-nities of reinforcement learning, computational intelligence,and adaptive control in the past decades. Recent literaturesurveys on ADP [10], [11] have presented a comprehensiveoverview of its history from Bellman equation to dynamicprogramming, and applications from benchmark tests to realindustry applications.

While ADP is a powerful technique in the domain ofmachine intelligence, there is still an open question that how togenerally assign an instant reinforcement signal/reward basedon the system’s behavior. Many scientists and researchers havecontributed their work on this topic. For instance, Prokhorovet al proposed to use the square of difference between theactual output and desired output as the reinforcement signalfor tracking system [12], [13], Si et al used discrete values“0” or “-1” (“0”, “-0.4”, or “-1”) as the reward for thecontroller [14], [15], [16], Zhang et al proposed to employthe linear/generalized quadratic form as the performance in-dex that evaluates the system’s behaviors [17], [18], [19],Venayagamoorthy et al proposed to use different weightsfor different terms to define the utility function [20], [21],among others. Normally, reinforcement signals are defined

according to problem specifics, and such signals are craftedmanually according to prior knowledge or domain expertise,e.g., [22], [23]. There seems to be an interest in designingthe reinforcement signal with as little human interventionas possible, to achieve adaptivity and efficiency in spite ofpossible changes in the operating environment.

In this paper, we propose a hierarchical structure of goalgenerator networks to assign the reinforcement signal adap-tively according the system’s behavior with no prior knowl-edge or past experience of the control system. This is thefirst time we implement the hierarchical goal generator net-works proposed in [24], illustrating its promising performance.Specifically, we build the goal generator with multiple neuralnetworks to generate the internal goals that cascade the rein-forcement signal.

In our design, we keep the advantage of model-free actiondependent heuristic dynamic programming (HDP) of [14].Our motivation of the design is to employ the goal generatornetworks to learn from the external reinforcement signaland provide the critic network with the detailed internalgoal representation (reinforcement signal). Our approach isto develop a mechanism to represent the usually discretereinforcement signal (e.g., “0” and “-1”) by continuous values.In this way, our goal generator can provide more effective goalrepresentation. This internal goal representation is expected tocarry more information, and it can be adjusted adaptively tohelp the system’s decision making process.

The goal generator networks observe the state vectors andthe control value as their inputs, together with the internal goalfrom the network of a higher hierarchical level (or the externalenvironment at the top level), while providing the networkbelow with its internal goal. In order to show the improvedperformance of this goal generator structure, we test the balland beam balancing benchmark with three algorithms: thetypical HDP approach (without any goal generator network),our proposed approach with one goal generator network, andour proposed approach with multiple (three) goal generator

U.S. Government work not protected by U.S. copyright

WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

Page 2: Reinforcement Learning Control Based on Multi-Goal ...

Fig. 1. The schematic of hierarchical HDP structure

networks, under the same simulation conditions (with andwithout noise).

The rest of this paper is organized as follows. In sectionII, we provide the detailed description of the hierarchicalgoal generator network structure design together with itsassociate weights tuning rules. In section III, three algorithmsare tested on the ball and beam balancing problem with thesame environment settings. The comparison of the simulationresults, typical trajectories of state vectors, and the discussionare also presented in the same section. Finally, conclusion andfuture work are in section IV.

II. HIERARCHICAL HDP STRUCTURE DESIGN

The schematic of the hierarchical HDP design is presentedin Fig.1, where one can see that we maintain the typical model-free action dependent HDP as in [14]. Our main contributionis to introduce the goal generator with hierarchical neuralnetworks to cascade the external reinforcement signal andprovide the critic network with hopefully improved internal

goal representation of the external reinforcement signal. Thegoal generator can critique the system’s behaviors and thecontrol action, and then generate the adjustable internal goalsignal automatically. In this section, we describe the learningand adaptation of the goal generator networks, the criticnetwork and the action network.

A. Learning and Adaptation in Goal Generator

From Fig.2, one can see that the parameters in the goalgenerator networks are adjusted independently. The top goalgenerator network l will first learn to approximate the dis-counted total future reward-to-go based on r, and then providethe goal generator network l − 1 with the updated internalreinforcement signal sl. In this way, r provides a top-downguidance for the sl. The internal goal sl should follow the“guidance” that the external reinforcement signal r provides.When m = 1, the internal goal s1 will follow the guidanceof s2 through the goal generator network 1, and pass the goalinformation to the critic network. The goal generator networks

Page 3: Reinforcement Learning Control Based on Multi-Goal ...

Fig. 2. Cascading weights tuning path inside the goal generator

form a cascade to represent the external reinforcement rinternally. Here we would like to note that the input of thegoal generator network l only contains the state vectors andthe control value, while the other goal generator network m(1 ≤ m < l) takes the state vectors, control action and theinternal goal sm+1 as the inputs.

We first consider the top neural network l in the goalgenerator. The output of this network is to approximate thediscounted total future reward. Specifically, it approximatesRl(t) at time instance t

Rl(t) = r(t+ 1) + αr(t+ 2) + α2r(t+ 3) + ... (1)

where Rl(t) is the future accumulative reward/cost-to-go valueat the time instance t, α is the discount factor (0 < α < 1) forthe infinite Markov decision process (MDP), and r(t + 1) isthe external reinforcement signal value at time t+1. The goalgenerator networks m (1 ≤ m < l) are to approximate thetotal future internal reward/goal signal sm+1 from the abovegoal generator network m+ 1.

The signal sl is to approximate the Rl expressed in (1), theerror of this goal generator network can be defined as

erl(t) = αsl(t)− [sl(t− 1)− r(t)] (2)

and the objective function is defined as

Erl(t) =

12e2rl

(t) (3)

The weights tuning rule for the goal generator network l ischosen as the gradient descent rule as

ωrl(t+ 1) = ωrl

(t) + ∆ωrl(t) (4)

where

∆ωrl(t) = lrl

[−∂Erl

(t)∂ωrl

(t)

]= lrl

[−∂Erl

(t)∂sl(t)

∂sl(t)∂ωrl

(t)

] (5)

For the goal generator network m (1 ≤ m < l), it isdesigned to approximate Rm defined in (6) with sm+1

Rm(t) = sm+1(t+1)+αsm+1(t+2)+α2sm+1(t+3)+... (6)

Therefore, the error function of goal generator network m canbe defined as

erm(t) = αsm(t)− [sm(t− 1)− sm+1(t)], (7)

and the objective function is defined as

Erm(t) =

12e2rm

(t) (8)

The weights tuning rule for the goal generator network m isalso chosen as the gradient descent rule as

ωrm(t+ 1) = ωrm(t) + ∆ωrm(t) (9)

where

∆ωrm(t) = lrm

[−∂Erm

(t)∂ωrm

(t)

]= lrm

[−∂Erm(t)∂sm(t)

∂sm(t)∂ωrm

(t)

] (10)

B. Learning and Adaptation in the Critic Network

Unlike the regular critic network in the typical HDP designin [14], the inputs of the critic network here not only includethe system state vectors and the control action, but alsocontain the internal goal signal s1. In this way, the totalcost-to-go J is more closely associated with this informativegoal/reinforcement signal than before. The error function ofthe critic network here is defined as follows

ec(t) = αJ(t)− [J(t− 1)− s1(t)], (11)

and the objective function is defined as

Ec(t) =12e2c(t) (12)

The weights updating rule for the critic network is chosen asthe gradient descent rule as

ωc(t+ 1) = ωc(t) + ∆ωc(t) (13)

where

∆ωc(t) = lc

[−∂Ec(t)∂ωc(t)

]= lc

[−∂Ec(t)∂J(t)

∂J(t)∂ωc(t)

] (14)

C. Learning and Adaptation in the Action Network

The weights tuning in the action network is similar as thatin [14]. The error is between the desired ultimate objective Uc

and the J function as defined in (15)

ea(t) = J(t)− Uc(t), (15)

The objective function for the action network is defined in(16)

Ea(t) =12e2a(t) (16)

Page 4: Reinforcement Learning Control Based on Multi-Goal ...

The weights tuning rule for the action network is chosen asthe gradient descent rule as

ωa(t+ 1) = ωa(t) + ∆ωa(t) (17)

where

∆ωa(t) = la

[−∂Ea(t)∂ωa(t)

]= la

[−∂Ea(t)∂J(t)

∂J(t)∂u(t)

∂u(t)∂ωa(t)

] (18)

III. SIMULATION

In this paper, we would like to evaluate our proposedalgorithm in two aspects: one is with only one goal generatornetwork (or the three-network architecture as discussed in[9]), which we defined as Algorithm1; the other is withmultiple (three) goal generator networks, which we defined asAlgorithm2. The motivation is to test these two algorithms,together with the ACD in [14] (without any goal generatornetwork), which we defined as Algorithm0. These threealgorithms are tested and compared on the ball and beambalancing problem in the same simulation environment.

The ball-and-beam system is a popular laboratory model[25], [13] to test various control approaches. The schematicof the system is presented in Fig.3. This system consists ofa long beam which can be tilted by a servo or electric motortogether with a ball rolling back and forth on top of the beam.The driver is located in the center of the beam. The angle ofthe beam to the horizonal axis is measured by an incrementalencoder, and the position of the ball can be obtained with thecameras mounted on top of the system.

Fig. 3. The schematics of the ball and beam system simulated in this paper.

A. Problem Formulation

From [25], we can obtain the motion equations from theLagrange equation as follows:

(m+Ibr2

)x′ + (mr2 + Ib)1rα−mx′α2 = mg(sinα) (19)

[m(x′)2 + Ib + Iω]α+ (2mx′x′ + bl2)α+Kl2α

+(mr2 + Ib)1rx′ −mgx′(cosα) = ul(cosα)

(20)

wherem : 0.0162kg, the mass of the ball;r : 0.02m, the roll radius of the ball;Ib : 4.32× 10−5kg ·m2, the inertia moment of the ball;b : 1Ns/m, the friction coefficient of the drive mechanics;l : 0.48m, the radius of force application;lω : 0.5m, the radius of beam;K : 0.001N/m, the stiffness of the drive mechanics;g : 9.8N/kg, the gravity;Iω : 0.14025kg ·m2, the inertia moment of the beam;u : the force of the drive mechanics;

In order to simplify the system model function, we definethat x1 = x′ represents the position of the ball, x2 = x′

represents the velocity the ball, x3 = α is the angle of thebeam with respect to the horizontal axis, and x4 = α is theangular velocity of the beam. In this way, the system function(19) and (20) can be transformed into the following form:

(m+Ibr2

)x2 + (mr2 + Ib)1rx4 = mx1x

24 +mg(sinx3) (21)

(mr2 + Ib)1rx2 + [mx2

1 + Ib + Iω]x4 = (ul +mgx1) cosx3

−(2mx2x1 + bl2)x4 −Kl2x3

(22)

We re-write the (21) and (22) into a matrix notation:[A BC D

]·[x2

x4

]=

[PQ

](23)

where

P = mx1x24 +mg(sinx3) (24)

Q = (ul+mgx1) cosx3− (2mx2x1 + bl2)x4−Kl2x3 (25)

A = m+Ibr2

(26)

B = (mr2 + Ib)1r

(27)

C = (mr2 + Ib)1r

(28)

D = mx21 + Ib + Iω (29)

Therefore, we can obtain the general form of this problemas described in (30)[

x2

x4

]=

[A BC D

]−1 [PQ

], (30)

and the other two terms in the state vector can be expressedas x1 = x2 and x3 = x4, with the state vector X =[x1 x2 x3 x4

]

Page 5: Reinforcement Learning Control Based on Multi-Goal ...

B. Experiment Configuration and Parameters

In the implementation, we employ multi-layer perceptron(MLP) structure for all the neural networks in our design. Asthe control system has 4 state vectors, we adopt the actionnetwork with 4-6-1 structure (i.e., 4 input neurons, 6 hiddenlayer neurons, and 1 output neuron) and the critic network with5-6-1 structure. The top goal generator network l is with 5-6-1 structure, while the other goal generator networks are with6-6-1 structure. The parameters used in the experiment aresummarized in Table I, and the notation is defined as follows:

TABLE ISUMMARY OF THE PARAMETERS USED IN THE SIMULATION

Para. lc (0) la (0) lr (0) lc (f) la (f) lr (f) *value 0.3 0.3 0.3 0.005 0.005 0.005 *

Para. Nc Na Nr Tc Ta Tr αvalue 80 100 50 0.05 0.005 0.05 0.95

α : discount factor;lc (0) : initial learning rate of the critic network;la (0) : initial learning rate of the action network;lr (0) : initial learning rate of the goal generator network;lc (t) : learning rate of the critic network at time t, which

is decreased by 0.05 every 5 time steps until it reach lc (f)and stay thereafter;la (t) : learning rate of the action network at time t, which

is decreased by 0.05 every 5 time steps until it reach la (f)and stay thereafter;lr (t) : learning rate of the goal generator network at time

t, which is decreased by 0.05 every 5 times step until it reachlr (f) and stay thereafter;Nc : internal cycle of the critic network;Na : internal cycle of the action network;Nr : internal cycle of the goal generator network;Tc : internal training error threshold for the critic network;Ta : internal training error threshold for the action network;Tr : internal training error threshold for the goal generator

network;

We keep all the goal generator networks with the samelearning rate, error threshold and the maximum internal iter-ation cycle number in the simulation. For the ACD approachin [14], we set the parameters the same as that in this table,except the terms belonging to the goal generator networks.

All simulation results presented in this experiment are basedon 100 runs with random initial neural network weights. Theinitial conditions of the the ball-and-beam system are set up asfollows: The ball position (x1) and the angle of the beam withrespect to the horizonal axis (x3) are uniformly distributed inthe range of [−0.2m, 0.2m] and [−0.15rad, 0.15rad], respec-tively, and the ball velocity (x2) and the angular velocity (x4)are set to be zero. For fair comparison, in each run we alsoset the neural network initial weights and initial conditionsof the beam and ball system to be the same for all threemethods discussed here. The objective of the task is to keep

balancing the ball on the beam for a certain period of time.Specifically, each run consists of a maximum of 1000 trials,and a trial will be considered successful if it can maintain thebalance of the ball for 10, 000 time steps (the ball remainson the beam and the angle of the beam with respect to thehorizontal axis is under the maximum value). In simulations,the Euler integration method is used with the fixed step sizeof 0.02s. The range of beam is [−0.48m, 0.48m] and therange of the angle of the beam to the horizontal axis is[−0.24rad, 0.24rad]. The external reinforcement signal is setto be “0” if the ball is on the beam and the angle of the beamto the horizontal axis is within the range, otherwise it is setto be “-1”, which means “failure” and we should start a newtrial.

C. Simulation Results and Analysis

Fig.4 shows a typical trajectory of the position of the ball(x1) and the velocity of the ball (x2) in a successful run underthe noise-free condition for Algorithm2. From this figure onecan see that ball starts at a random position and rolls forth andback at the early stage. As our system continues to learn tocontrol the ball, the trajectory of x1 is like a typical dampingsinusoid wave, which converges as time goes by.

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

time step

y

x

1(m)

x2(m/s)

Fig. 4. The typical trajectory of x1 and x2 with Algorithm 2

The variation of the x1 is also shown in Fig.5, whichindicates that the ball is always around the center point underproper control. Fig.6 shows the angle of the beam with respectto the horizontal axis (x3) and the beam angular velocity (x4),which also clearly shows that the control system balances theball quickly.

Fig.7 shows the typical trajectory of the control action andthe total cost-to-go, both of which indicate how the systemlearns to appropriately adjust the force to balance the task withthe minimum cost. The internal goal s3, s2, and s1, togetherwith the external reward r in a typical successful run are shownin Fig.8, which shows that the internal goal signals s1 - s3 arethe damping sinusoid signals rather than the zero value ofr all the way in this trial. Once again, the zero value of r

Page 6: Reinforcement Learning Control Based on Multi-Goal ...

−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.250

1000

2000

3000

4000

5000

6000

7000

x1 (m)

Fig. 5. The histogram of x1 in a typical successful run with Algorithm 2

Fig. 6. The typical trajectory of x3 and x4 with Algorithm 2

means the ball is on the beam and the angle of the beam tothe horizontal axis is within the range . Further observationsindicate that the internal goal signals are with different phases,which may suggest that the internal goals are trying to fit thetotal future cost and provide the networks below with a morerefined goal representation. The variation of s1 - s3 are alsopresented in Fig.9, indicating that there are some variances inthe goal signals.

Table.II demonstrates the successful rate, the required av-erage number of trials to learn the balancing task and itsassociated standard deviation for the three approaches tested in100 random runs. Specifically, for the required average numberof trials, we count the first successful trial number (i.e., 10000steps of balancing) in each run, and then take the average overthem (i.e., 100 random runs). In this table, the 1st columnindicates the noise types under which the algorithms are tested;the 2nd column presents the statistical results of the successfulruns with Algorithm0; the 3rd column and the 4th column

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.4

−0.2

0

0.2

0.4

0.6

time step

u

u

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

time step

J

J

Fig. 7. The typical trajectory of the control action and the total cost-to-gosignal with Algorithm 2

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

time step

y

s

3s

2s

1

r

s3

s1

s2

r

Fig. 8. The typical trajectory of internal goal signals with Algorithm 2

present the statistical results of the successful runs with ourproposed Algorithm1 and Algorithm2, respectively.

Under the condition of noise free, one can clearly see thatboth of our proposed approaches achieve higher successful ratewith lower average trial number and lower standard deviationthan those of the baseline Algorithm0. And Algorithm2 canobtain better results than Algorithm1 in terms of the averagenumber of trials and the standard deviation. We also presentthe boxplot of the required number of trials in 100 randomruns with the three algorithms under the noise-free conditionsin Fig.10. Here we do the ANOVA analysis for the statisticalresults with Algorithm0, Algorithm1 and Algorithm2. Wefind out that the average number of trials required to learn thebalancing task with Algorithm2 is significantly different fromthat of Algorithm0/Algorithm1, with the confidence level is99.99%/98.21% (i.e., p = 7.25e−5/p = 0.0179), respectively.

We add 5% uniform noise on the actuator (u) and the

Page 7: Reinforcement Learning Control Based on Multi-Goal ...

TABLE IISIMULATION RESULTS ON BALL-AND-BEAM BALANCING TASK. THE 1st COLUMN IS WITH THE NOISE TYPE. THE 2nd COLUMN IS WITHAlgorithm0, WHILE THE 3rd COLUMN IS WITH Algorithm1 AND THE 4th COLUMN IS WITH Algorithm2. THE NUMBER OF TRIALS AND

STANDARD DEVIATION ARE CALCULATED BASED ON THE SUCCESSFUL RUNS

Algorithm0 Algorithm1 Algorithm2

Noise type Success rate ] of trial σ Success rate ] of trial σ Success rate ] of trial σ

Noise free 98% 43.4 71.9 100% 21.9 29.3 100% 13.5 20.0

Uniform 5% a.∗ 98% 46.5 59.2 98% 21.3 30.4 99% 17.6 44.8

Uniform 5% x.† 96% 65.3 113.7 100% 23.8 77.2 100% 16.2 18.4

σ : standard deviation∗ a. : actuators are subject to the noise† x. : sensors of positions are subject to the noise

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.50

2000

4000

6000

s3

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.50

2000

4000

6000

8000

s2

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.50

2000

4000

6000

8000

s1

Fig. 9. The histogram of the internal goals with Algorithm 2

algorithm0 algorithm1 algorithm2

0

50

100

150

200

250

300

350

400

Fig. 10. The boxplot of the required number of trials in 100 random runswith Algorithm 0, Algorithm 1 and Algorithm 2.

sensor of position of the ball (x1) respectively. While theactuator is under 5% uniform noise, one can see that ourproposed Algorithm1 and Algorithm2 can both obtain thelower average number of trials and the standard deviation thanthose of Algorithm0. Fig.11 clearly shows that the controlvalue (with Algorithm2) now is not as smooth as that in Fig.7.Here we also do the ANOVA analysis for the statistical results

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−0.1

−0.05

0

0.05

0.1

0.15

time step

u

u

Fig. 11. The typical trajectory of control action with Algorithm 2 under 5%uniform noise on the actuator

with Algorithm0 and Algorithm2. The results show thatour proposed Algorithm2 can obtain significantly differentaverage number of trial compared with that of Algorithm0in 99.8% (i.e. p = 0.002) confidence. While the sensor of theball position is under 5% uniform noise, one can also see thatour proposed Algorithm1 and Algorithm2 can both obtainlower average number of trial and standard deviation thanthat with Algorithm0. Fig.12 shows the typical trajectoriesof x1 and x2. The control task becomes complicated sincethe observed state vector x1 is not as smooth as that in Fig.4.Similarly, we do the ANOVA analysis for the statistical resultswith Algorithm0 and Algorithm2 here. We get 99.99% (i.e.p = 1.76e−5) confidence that our proposed Algorithm2 canachieve statistically significant improvement compared withthat of Algorithm0.

IV. CONCLUSION

We proposed a hierarchial goal generator network structurein order to cascade the external discrete reinforcement signalinto continuous internal goal signals, which appears to beuseful for improved performance. We also tested successfullyour approach on the ball and beam balancing benchmark andcompared with alternatives.

Future work may include theoretical analysis of the resultsobtained in this paper, as well as experiments with other con-trol problems and larger numbers of goal generator networks.

Page 8: Reinforcement Learning Control Based on Multi-Goal ...

Fig. 12. The typical trajectory of the state vectors x1 and x2 with Algorithm2 under 5% uniform noise on the sensor of the position of the ball

ACKNOWLEDGMENT

This work was supported by the National Science Foun-dation (NSF) under Grant ECCS 1053717, National NaturalScience Foundation of China (NSFC) under Grant 60874043,60921061, and 61034002. The authors are also grateful to thesupport from Toyota Research Institute NA.

REFERENCES

[1] P. J. Werbos, “Using ADP to understand and replicate brain intelligence:the next level design,” in IEEE Int. Symposium on Approximate DynamicProgramming and Reinforcement Learning, pp. 209–216, 2007.

[2] P. J. Werbos, “Intelligence in the brain: A theory of how it works andhow to build it,” Neural Networks, pp. 200–212, 2009.

[3] J. Si, A. G. Barto, W. B. Powell, and D. C. Wunsch, eds., Handbook ofLearning and Approximate Dynamic Programming. IEEE Press, 2004.

[4] W. B. Powell, Approximate Dynamic Programming: Solving the Cursesof Dimensionality. Wiley-Interscience, 2007.

[5] F. Y. Wang, N. Jin, D. Liu, and Q. Wei, “Adaptive dynamic programmingfor finite-horizon optimal control of discrete-time nonlinear systems withε-error bound,” IEEE Transactions on Neural Networks, vol. 22, no. 1,pp. 24–36, 2011.

[6] J. Fu, H. He, and X. Zhou, “Adaptive learning and control for mimosystem based on adaptive dynamic programming,” IEEE Transactionson Neural Networks, vol. 22, no. 7, pp. 1133–1148, 2011.

[7] W. Qiao, G. Venayagamoorthy, and R. Harley, “DHP-based wide-areacoordinating control of a power system with a large wind farm andmultiple FACTS devices,” in Proc. IEEE Int. Conf. Neural Netw.,pp. 2093–2098, 2007.

[8] S. Ray, G. K. Venayagamoorthy, B. Chaudhuri, and R. Majumder,“Comparison of adaptive critics and classical approaches based widearea controllers for a power system,” IEEE Trans. on Syst. Man, Cybern.,Part B, vol. 38, no. 4, pp. 1002–1007, 2008.

[9] H. He, Z. Ni, and J. Fu, “A three-network architecture for on-linelearning and optimization based on adaptive dynamic programming,”Neurocomputing, vol. 78, no. 1, pp. 3–13, 2012.

[10] F. Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming:An introduction,” IEEE Comput. Intel. Mag., vol. 4, no. 2, pp. 39–47,2009.

[11] F. L. Lewis, and D. Vrabie., “Reinforcement learning and adaptivedynamic programming for feedback control,” IEEE Circuits Sys. Mag.,vol. 9, no. 3, pp. 32–50, 2009.

[12] D. V. Prokhorov and D. C. Wunsch, “Adaptive critic designs,” IEEETrans. on Neural Netw., vol. 8, no. 5, pp. 997–1007, 1997.

[13] P. H. Eaton, and D. V. Prokhorov, and D. C. Wunsch II., “Neurocon-troller alternatives for fuzzy ball-and-beam systems with nonuniformnonlinear friction,” IEEE Trans. Neural Netw., vol. 11, no. 2, pp. 423–435, 2000.

[14] J. Si and Y. T. Wang, “On-line learning control by association andreinforcement,” IEEE Trans. on Neural Netw., vol. 12, no. 2, pp. 264–276, 2001.

[15] J. Si, L. Yang, and D. Liu, Handbook of Learning and Approximate Dy-namic Programming, ch. Direct Neural Dynamic Programming, pp. 125–151. IEEE Press, 2004.

[16] L. Yang, J. Si, K. S. Tsakalis, and A. A. Rodriguez, “Direct heuristicdynamic programming for nonlinear tracking conrol with filtered track-ing error,” IEEE Transactions on Systems Man and Cybernetics PartB-Cybernetics, vol. 39, no. 6, pp. 1617–1622, 2009.

[17] H. G. Zhang, Q. L. Wei, and Y. H. Luo, “A novel infinite-time optimaltracking control scheme for a class of discrete-time nonlinear systemsvia the greedy hdp iteration algorithm,” IEEE Transactions on System,Man and Cybernetics, Part B, vol. 38, no. 4, pp. 937–942, 2008.

[18] H. G. Zhang, Y. H. Luo, and D. Liu, “Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems withcontrol constraints,” IEEE Transactions on Neural Networks, vol. 20,no. 9, pp. 1490–1503, 2009.

[19] H. G. Zhang, Q. L. Wei, and D. Liu, “An iterative approximatedynamic programming method to solve for a class of nonlinear zero-sumdifferential games,” Automatica, vol. 47, no. 1, pp. 207–214, 2011.

[20] G. K. Venayagamoorthy, R. G. Harley, and D. C. Wunsch, “Dualheuristic programming excitation neurocontrol for generators in a multi-machine power system,” IEEE Trans. on Industry Applications, vol. 39,no. 2, pp. 382–394, 2003.

[21] G. K. Venayagamoorthy and R. G. Harley, Handbook of Learning andApproximate Dynamic Programming, ch. Application of ApproximateDynamic Programming in Power System Control, pp. 479–515. IEEEPress, 2004.

[22] D. V. Prokhorov, R. A. Santiago, and D. C. Wunsch, “Adaptive criticdesigns: A case study for neurocontrol,” Neural Networks Letter, vol. 8,no. 9, pp. 1367–1372, 1995.

[23] D. V. Prokhorov, “Adaptive critic designs and their applications,” Ph.D.dissertation, Texas Tech. Univ., Lubbock, 1997.

[24] H. He and B. Liu, “A hierarchical learning architecture with multiple-goal representations based on adaptive dynamic programming,” inIEEE International Conference on Networking, Sensing and Control(ICNSC’2010), 2010.

[25] T. L. Chien, C. C. Chen, Y. C. Huang, and W.-J. Lin, “Stability andalmost disturbance decoupling analysis of nonlinear system subject tofeedback linearization and feedforward neural network controller,” IEEETrans. Neural Netw., vol. 19, no. 7, pp. 1220–1230, 2008.