Flying High: Deep Imitation Learning of Optimal Control ...

KTHSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

ROYAL INSTITUTE OF TECHNOLOGY

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 HPSTOCKHOLM, SWEDEN 2018

Flying High: Deep ImitationLearning of Optimal Controlfor Unmanned Aerial Vehicles

LUDVIG ERICSON

X (m)

2.52.0

1.51.0

0.5Y (m)2.0 1.5 1.0 0.5 0.0

Z (m

)

2.0

1.5

1.0

0.5

0.0

School of Electrical Engineering and Computer ScienceKTH Royal Institute of Technology114 40 Stockholm, Sweden

Degree Project in Computer Science and Communication

Flying High: Deep Imitation Learning ofOptimal Control for Unmanned Aerial

Vehicles

Ludvig Ericson

August 15, 2018

Supervisor: Patric JensfeltExaminer: Joakim Gustafsson

Abstract

Optimal control for multicopters is difficult in part due to the low processing poweravailable, and the instability inherent to multicopters. Deep imitation learning is amethod for approximating an expert control policy with a neural network, and has thepotential of improving control for multicopters. We investigate the performance andreliability of deep imitation learning with trajectory optimization as the expert policy byfirst defining a dynamics model for multicopters and applying a trajectory optimizationalgorithm to it. Our investigation shows that network architecture plays an importantrole in the characteristics of both the learning process and the resulting control policy,and that in particular trajectory optimization can be leveraged to improve convergencetimes for imitation learning. Finally, we identify some limitations and future areas ofstudy and development for the technology.

Sammanfattning

Optimal kontroll för multikoptrar är ett svårt problem delvis på grund av den vanligtvislåga processorkraft som styrdatorn har, samt att multikoptrar är synnerligen instabilasystem. Djup imitationsinlärning är en metod där en beräkningstung expert approxime-ras med ett neuralt nätverk, och gör det därigenom möjligt att köra dessa tunga expertersom realtidskontroll för multikoptrar. I detta arbete undersöks prestandan och pålitlighe-ten hos djup imitationsinlärning med banoptimering som expert genom att först definieraen dynamisk modell för multikoptrar, sedan applicera en välkänd banoptimeringsmetodpå denna modell, och till sist approximera denna expert med imitationsinlärning. Vårundersökning visar att nätverksarkitekturen spelar en avgörande roll för karakteristikenhos både inlärningsprocessens konvergenstid, såväl som den resulterande kontrollpolicyn,och att särskilt banoptimering kan nyttjas för att förbättra konvergenstiden hos imita-tionsinlärningen. Till sist påpekar vi några begränsningar hos metoden och identifierarsärskilt intressanta områden för framtida studier.

ii

Contents

1. Introduction 11.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2. Background 32.1. Multicopter Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2. PID Controller for Multicopters . . . . . . . . . . . . . . . . . . . . . . . . 52.3. Trajectory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1. Differential Dynamic Programming . . . . . . . . . . . . . . . . . . 62.3.2. Damping the Backward Pass . . . . . . . . . . . . . . . . . . . . . 8

2.4. Deep Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5.1. Alternative Optimal Control Techniques . . . . . . . . . . . . . . . 92.5.2. Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 102.5.3. Behavioral Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5.4. Guided Policy Search and Related Methods . . . . . . . . . . . . . 112.5.5. Deep Learning Control via Risk-Aware Learning . . . . . . . . . . 11

3. Deep Imitation Learning of Optimal Trajectories 123.1. Finding Optimal Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.1. Differentiation of Dynamics . . . . . . . . . . . . . . . . . . . . . . 123.1.2. Differentiation of Loss . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.3. Damping Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.1.4. Cost Matrix Design . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.5. Integration of Differential System of Equations . . . . . . . . . . . 15

3.2. Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1. Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2. Learning Ahead and Strided Optimization . . . . . . . . . . . . . . 163.2.3. Trajectory Continuation . . . . . . . . . . . . . . . . . . . . . . . . 17

4. Experimental Results 184.1. Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2. Network Architecture versus Strided Optimization . . . . . . . . . . . . . 19

4.2.1. Performance of Wide versus Narrow Networks . . . . . . . . . . . . 204.3. Improving Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4. Striding and Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5. Rollouts and Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

iii

4.6. Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5. Discussion 235.1. Deep Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1. Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 235.1.2. Training Time and Strided Optimization . . . . . . . . . . . . . . . 23

5.2. Trajectory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.1. The Expert Requires An Expert . . . . . . . . . . . . . . . . . . . 245.2.2. Reference Trajectory Considerations . . . . . . . . . . . . . . . . . 245.2.3. Alternative Optimization Methods . . . . . . . . . . . . . . . . . . 245.2.4. Other Optimizer Improvements . . . . . . . . . . . . . . . . . . . . 25

5.3. Societal and Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . 255.3.1. Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6. Conclusion 276.1. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A. Thrust Force as Function of PWM Signal 31

B. Distributions of Trajectories 32

C. Analytical Jacobian of Loss Function 34

iv

1. IntroductionDrones, and in particular multicopters, have lately become popular in a wide range ofarenas, from recreational sports racing, to search-and-rescue missions in difficult terrain,to their increasing use as platforms for research. This rapid popularity growth is fueledin part by the rise of inexpensive multicopters in consumer markets around the world,making them attractive for both industrial and research applications. Autonomous flightfor multicopters is of interest to both these groups, and is necessary for some of theusecases of multicopters.

One of the challenges of multicopter flight is ensuring a safe and stable flight behaviorfor these particularly unstable systems, and especially recovering from unforeseen events,like sudden gusts of wind, or shifting cargo. Multicopter control has been the attentionof recent research, in particular due to the advent of deep learning, leading to the ap-plication of neural networks as regulators for these dynamical systems as in Anderssonet al. (2017) and Levine and Koltun (2013).

A neural network-based approach enables additional improvements compared to theclassical approaches. Online adaptation is one example, by teaching control conditionedon some parameterization of the dynamics, the controller should in theory be able tocontrol optimally given current circumstances. For example, it would be desirable fora package delivery drone to be able to adapt its control policy to its current load. Bycontrast, a linear feedback-based controller would need to be specifically engineered tohandle such a scenario ahead of time, whereas a neural network could in principle learnto account for arbitrary changes to the vehicle dynamics.

Trajectory optimization methods are a family of methods for finding the optimal courseof actions given some start and goal state, and typically perform substantially betterthan a real-time controller could but have the drawback that they require offline com-putation due to their computational cost. It is therefore attractive to approximatetrajectory optimization methods in real-time, and such an approach to combining tra-jectory optimization with real-time control is imitation learning where an agent, e.g. anartificial neural network in the case of deep imitation learning, is taught to mimic theactions of an expert, in this case the trajectory optimizer. Imitation learning is in ageneral sense applicable to a wide array of domains, and knowledge of its requisites andlimitations is therefore useful in a broader sense. However, previous work on imitationlearning does not consider its properties as a deep learning method, therefore this de-gree project will investigate the properties of the deep imitation learning process undervarious conditions.

1

1.1. Problem StatementIn this work we aim to investigate the strengths and weaknesses of deep imitation learn-ing as a means of controlling multicopter aircraft, and quadrotors in particular. Wewill study the relationship between deep learning and imitation learning of a trajectoryoptimizer in the multicopter control setting. To this end, we seek to answer the followingquestions:

• How do network architecture and parameters impact the ability to imitate theexpert and the controller’s behavior?

• Is it possible to improve learning convergence by taking advantage of having opti-mized trajectories as opposed to single states?

• What can be said of the performance and reliability of the learned control policiesin general, and for particular choices of architecture and parameters?

1.2. OutlineThe dynamics of multicopter flight will be defined in a rigid body physics model, alongwith a baseline control method, a trajectory optimization scheme, and the imitationlearning technique used in this work are presented in chapter 2. Our method and its par-ticularities are presented in chapter 3, its experimental results in the following chapter 4.Finally, we discuss our method and its results from a broader perspective in chapter 5followed by a conclusion and direction for future work in chapter 6.

The relationship of this work’s contribution to the broader computer science researchcommunity is seen primarily by the use of machine learning as a tool for digital regulation,and the use of numerically-heavy methods for in the training of our neural networks aswell as the trajectory optimization.

2

2. BackgroundThis chapter defines the motion model for multicopters used in this work; a linear feed-back regulator, as a baseline both for optimization and comparison; the trajectory op-timization technique DDP, which is what we aim to imitate; and the deep imitationlearning technique used in our method. Finally, we present the relevant related work inthe field.

2.1. Multicopter DynamicsMulticopters are a class of rotorcraft, where lift is generated by pushing air down throughat least two propellers in the case of multicopters. These flying machines are inherentlyunstable, and require constant regulation to stay in the air.

Multicopters come in many configurations, with a number of different design param-eters. The most suitable choice for these depends on application, intended load, flighttime, maneuverability, etc. The perhaps most important characteristic of a multicopterare its motors, and in particular the number of motors. For this reason, multicopters aresubcategorized by the number of motors, e.g. four-motor multicopters are referred toas quadrotors or quadcopters (illustrated in fig. 2.1.) Another important characteristicis the frame size, given as the diameter encircling the motor centers, illustrated as theouter light gray circle of fig. 2.1.

The discrete-time dynamics function f is of central concern to this work, here definedas a time-invariant function of the state xt and control signal ut,

xt+1 = f(xt,ut).

We model the multicopter as a rigid body with thrust forces emanating from themotor centers along their axes, as indicated by the four red arrows in fig. 2.1. The rigidbody’s state x is modelled as a second order integrator, with a position r = (rx, ry, rz)

T

and orientation q = (qi, qj , qk, qr)T as a quaternion, and their derivatives, linear velocity

v = (vx, vy, vz)T and angular velocity ω = (ωx, ωy, ωz)

T. The full state is

x =(r, q,v,ω

)T.

Note that though the orientation q is a quaternion, angular velocity ω is not. Thecontrol signal u is the angular speed of each motor, and only affects the derivatives vand ω.

The dynamics f are here described as a system of differential equations, i.e. continuous-time, and later discretized as approriate. Position r is the integral of velocity v,

r = v.

3

w0

w1

w2

w3

Figure 2.1. A four-armed multicopter, or a quadcopter, here depicted in an X configura-tion, with its three body axes (shown in red, green, and blue) offset from thearms by 45°. The motors are numbered as indicated by wi, odd numbers ro-tating clockwise and even numbers rotating counter-clockwise, as indicatedby the arrows along the motor perimeters.

Orientation q is an integrator of angular velocity ω, and as shown in Boyle (2017),

2q = qω.

qω is a quaternion-vector multiplication, defined as the complex part of the Hamiltonproduct qω = (qi, qj , qk, qr)(ωx, ωy, ωz, 0), i.e. the first three components of the resultingquaternion.

As discussed in appendix A, we fit the thrust Fi of the ith motor with a polynomialFi = aw2

i + bwi + c = a(kwmaxui)2 + bkwmaxui + c). The net force acting on the rigid

body is the sum of thrust minus gravity and drag, in the inertial frame, so

mv = mg − kFdv +∑

iqFi

where m is the mass, g = (0, 0,−g)T, kFd is the drag coefficient, and Fi = (0, 0, Fi)T is

the force along the motor axis.Torque and angular velocity is in the body frame, with each motor torque τi given by

the cross product of the thrust force and displacement kri from the center of the frame,plus a part about the motor axis in the motor direction kdi = (−1)i with proportionalitykτz. Summing the terms, we have τi = kri × Fi + kdikτzFi. Finally, the total torque isthe sum of the motor torques minus angular drag,

Iω = −kτdω +∑

iτi

where I is the moment of inertia as a scalar.

4

..r.

r

.P

.v

. v. PID. q.

F

.

q

. P. ω.

ω

.PID

.

T

.

Mixer

.

u

.

Motors

Figure 2.2. Overview of the PX4 regulator cascade. Hats indicate reference variables,and are typically the output of a parent regulator. To reduce clutter, wehave not drawn edges from the UAV dynamics to the process variables r, v,q, and ω.

2.2. PID Controller for MulticoptersOne simple approach to multicopter control is to build a hierarchical set of PID con-trollers. Studying such a controller is useful in a broader sense for this degree project ifnot only for the understanding it imparts on the difficulties of multicopter control, butit is also necessary as an initialization for the methods used later in this degree project.In particular, we will construct a replica of the regulator cascade in Meier et al. (2015).

The regulator cascade goes from the user-supplied reference position r, from which itcomputes a reference velocity v , then a reference orientation q and thrust T , then a ref-erence angular velocity, ω, and finally a reference angular acceleration T , as illustratedin fig. 2.2. After the regulator cascade, a mixer decides how the resulting thrust andangular acceleration translates to PWM control signals for each motor. This is accom-plished by inversion of linearized forward dynamics through column-wise constructionof the two matrices

AT = CT

(kri × ka, . . .

)and AF = CF

(ka, . . .

)where ka = (0, 0, 1)T is the motor axis, CT and CF scale the torque and force fromeach motor — in other words, a linear transformations from control signal to torque andforce. Stacking the two matrices row-wise into A = (AT ,AF ) and computing its pseudo-inverse B = A† gives us a linear transformation from torque and force to control signal,i.e. u = B(T , Fka)

T. T and F are as shown in fig. 2.2; the matrix B is represented asthe mixer.

5

2.3. Trajectory OptimizationOptimal control is the process of computing controls, a policy, that satisfies some opti-mality criterion, typically minimizing a cost function. In a digital control setting suchas in this work, we deal with discretized time as noted in the previous section, withxt+1 = f(xt,ut). A subcategory of optimal control concerns linear time-invariant sys-tems, where the dynamics are linear and independent of time, i.e.

xt+1 = f(xt,ut) = Axt +But,

and the cost function is the sum of a quadratic loss function of state and action alongthe trajectory,

J∆=

∞∑t=1

l(xt,ut),

l(xt,ut)∆= xT

t Qxt + uTt Rut. (2.1)

This is known as the linear-quadratic control problem, and its solution is a linear-quadratic regulator. In the interest of brevity, we will not restate the solution here,but refer to Laub (1979) and the myriad books and reports written on the topic. It isoften necessary to linearize the system dynamics f by setting A to the Jacobian of fwith respect to the state xt and B to the Jacobian of f with respect to action ut.

2.3.1. Differential Dynamic ProgrammingThe trajectory optimization method used in this work is differential dynamic program-ming, or DDP, which can be viewed as a form of Gauss-Newton optimization cast inthe context of trajectory optimization with a finite time horizon. In particular, we use avariant of DDP known as iLQG, iterative linear-quadratic Gaussian regulator. The keydifference to the original DDP formulation is that iLQG assumes linear dynamics insteadof quadratic dynamics, and therefore does not have the same convergence guarantees asDDP (Tassa et al., 2012).

DDP is a form of dynamic programming problem, solving a problem by recursivelybreaking it into subproblems, until some trivial base case is reached. From there, byinduction, the subproblems are reconstructed to larger solved problems until the originalproblem is solved. For DDP, the original problem is find controls that minimizes costfrom time t to T with the subproblem minimize cost from t+ 1 to T , and the base caseminimize cost at T .

Let Jt be the cost-to-go from t to T . The optimal value function V at time t is

Vt(x)∆= minui:T Jt(x,ui:T ),

where ui:T is the set of actions from time i to T . The base case is then the cost of thefinal state,

VT (x) = lf (xT )

6

where lf is the cost of the final state, typically a scaled up l (see eq. (2.1).) The inductionis accomplished by finding the optimal action-value function Qt(x,u),

Vt(x) = minuQt(x,u)

Qt(x,u)∆= l(x,u) + Vt+1(f(x,u)). (2.2)

Let us for brevity’s sake drop the time subscript, and use subscript derivative notation,i.e. where Qxx is the Hessian of Q with respect to x. DDP uses a second order Taylorapproximation of Q(x,u) around a reference trajectory,

δQ(δx, δu)∆=

1δxδu

T 0 Qx Qu

Qx Qxx Qxu

Qu Qux Quu

1δxδu

,

and let V ′ be the value function in the following time step, then the pertinent submatricesare

Qx = lx + V ′xfx Qu = lu + V ′

ufu

Qxx = lxx + fTx V

′xxfx Quu = luu + fT

uV′xxfu

Qxu = lxu + fTx V

′xufu Qux = lux + fT

uV′uxfx.

found by the differentiation of Q in eq. (2.2)The minimum δu∗ of the quadratic approximation is obtained by taking the derivative

of δQ with respect to δu and setting it to zero,

δu∗ = argminδu δQ(δx, δu) = −Q−1uu(Qu +Quxδx).

Writing δu∗ as a linear system

δu∗ ∆= Kδx+ k

then k = −Q−1uuQu

and K = −Q−1uuQux.

The value function gradient is then propagated backwards in time with

Vx = Q′x +Q′

xuk′

Vxx = Q′xx +Q′

xuK′.

The algorithm proceeds in this manner up to the initial state, obtaining T instancesof k and K. k is the open-loop term, independent of the state, and K the gain matrix,correcting for deviations in state and thus forming local a closed-loop controller. Thisconstitutes the backward pass of the algorithm, after which the algorithm applies the

7

controls in a forward pass yielding a new trajectory {xt, ut}Tt=0 from the current referencetrajectory {xt,ut}Tt=0 by

x0 = x0

ut = ut + δu∗t

= ut + kt +Kt(xt − xt)

xt+1 = f(xt, ut).

The process then begins anew on the updated trajectory until convergence. Care mustbe taken to ascertain that there actually is an improvement, since the approximationsof dynamics and cost function only hold near the reference trajectory.

2.3.2. Damping the Backward PassA common problem is that of the vanishing gradient, where Quu sometimes shrinksto near zero, making the inverse difficult to compute because of numerical instability.To remedy this, a similar approach to Levenberg (1944) is taken where the inversionof Quu is regularized by the addition of a scalar ϵ to its eigenvalues. Let the real,symmetric, diagonalizable matrix A’s eigendecomposition be A = QΛQT, where Q isthe eigenvectors as columns and Λ the eigenvalues as a diagonal matrix, then

A−1 = QTΛ−1Q,

and with regularization,

A−1 = QT(Λ + ϵI)−1Q.

2.4. Deep Imitation LearningIn their review of imitation learning techniques, Attia and Dayan (2018) define imitationlearning as the task of learning some policy u = π(x) from imitation of an expert policyu = π(x), in our case DDP as described in section 2.3.1.

Attia and Dayan (2018) claim that one of the most successful and straight-forwardcurrent approaches to imitation learning is DAgger due to Ross et al. (2011), short fordataset aggregation. While Ross et al. (2011) chose linear regression and support vectormachines as the function approximator, we will instead use a deep neural network forpolicy approximation, hence the name deep imitation learning. The algorithm proceedsas follows:

1. Let the dataset D0 = {xt, ut}Tt=0 be the states and actions of a trajectory from theexpert from some random initial state.

2. Fit the policy πi to the dataset Di where i is the current iteration.

3. Sample a trajectory from the agent {xt,ut}Tt=0 using πi.

8

4. Compute the expert action ut at each visited state xt.

5. Aggregate agent states and expert actions into Di+1 = Di ∪ {xt, ut}Tt=0.

6. Begin next iteration from step 2.

The process of rolling out and finding expert actions is illustrated in fig. 2.3. Only thevery first action of the optimized trajectory is used for training.

The advantage of DAgger over other methods lies in the tight coupling between thecurrent policy πi and the expert’s actions — the agent is taught what action to take inthe states that it itself visits through its policy. By contrast, a shortcoming with othermethods is that they speculatively teach the agent how to act, often based on propertiesof the optimization, as will be shown in section 2.5.

DAgger is a form of online learning, where the sequence of policies π0,π1, . . . ,πN isasymptotically no regret, as shown by Ross et al. (2011). Regret is a measure of howmuch greater a policy’s loss is than the optimal policy of its class, i.e. the optimalneural network for our purposes. Thus no regret means that the policy, asymptotically,becomes the best approximation of the expert (Shalev-Shwartz et al., 2012).

..x0. x1.u0

. x2.u1

. x3.u2

. . . ..u3

.

x1

.

u0

.

x2

.

u1

.

x3

.

u2

.

x4

.

u3

Figure 2.3. Agent’s rolled out trajectory {xt,ut}Tt=0 from the policy πi, and expertaction ut for every visited state. The visited states and expert actions{xt, ut}Tt=0 are added to the aggregated dataset that is used to train thenext agent πi+1. The expert’s next state xt is discarded.

2.5. Related WorkImitation learning and control of multicopters have both been the subject of prior aca-demic research. In this section, we will restate pertinent work and relate this degreeproject to these works.

2.5.1. Alternative Optimal Control TechniquesThough not directly related to this work, it is worth noting that a majority of effort inthe field of UAV control is spent on direct methods, often aiming on making trajectoryoptimization sufficiently and robust to run at real-time as in the popular model predictive

9

control paradigm1. Model predictive control is signified by iteratively solving a finite-time horizon trajectory optimization problem and using it as a method of control.

Mellinger and Kumar (2011) presents a method of generating trajectories by meansminimizing the trajectory’s snap, i.e. the time derivative of acceleration, with interpo-lation of piecewise polynomial functions. Using their method, the authors are able toperform impressive aerial acrobatics.

Hehn and D’Andrea (2015) decouples the state dimensions and considers each di-mension as a separate optimization problem, allowing the computation of near-optimaltrajectories in real-time. The authors demonstrate their method by cooperatively coor-dinating multiple drones to build a tower out of small blocks of foam, showing rapidreplanning and coordination abilities.

2.5.2. Reinforcement LearningReinforcement learning is an area of research that is presently seeing tremendous interestfrom both the academic community, and hobbyists, perhaps due to its alluring promiseof something-from-nothing artificial intelligence. Reinforcement learning is similar in itsformulation to DDP above, but assumes V and Q are unknown. Instead the agent isgiven some reward signal after taking an action, and the goal is to learn what actionsproduce the best long-term reward.

While reinforcement learning methods are interesting, they suffer from two drawbacksthat make them less desirable in the context of this work: first, there is in generalno guarantee that the learned policy in reinforcement learning is optimal, and second,reinforcement learning algorithms for continuous action spaces is notoriously unstable,made worse by the highly unstable multicopter.

2.5.3. Behavioral CloningOne of the earlier works in modern imitation learning is Anderson et al. (2000), withbehavioral cloning. Their work is based on modular neural networks, where a set of agentnetworks are modulated by a larger gating network. Each agent network is designedas is typical for the era: very few units (two hidden units per expert), and sigmoidalactivations. The key insight in their work that stayed with the field is that it is betterto train an agent in states where it makes mistakes, and especially grave mistakes.

This is leveraged by Laskey et al. (2017), where a more modern neural network istrained to mimic the expert directly. The expert is then subjected to noise, and thisnoise is modelled on the errors from the agent. In this way, some degree of robustnessis achieved by sampling vicinities of the expert trajectories. Laskey et al. (2016) referto this as “human centric” learning rather than “robot centric”; in this work, we willconcern ourselves with robot centric learning according to this parlance. The authorsalso go on to argue that human centric learning, in the setting of deep neural networks,

1In fact, DDP, the optimization technique used in this work, is sometimes used as a model predictivecontroller.

10

is guaranteed to converge on an optimal policy. A critique to this claim is that a human-taught policy will only ever be as optimal as its human teacher, and humans are ingeneral not optimal controllers.

2.5.4. Guided Policy Search and Related Methods (Levine and Koltun,2013; Zhang et al., 2016)

There are multiple ways to remedy the aforementioned gripes with reinforcement learning.Levine and Koltun (2013) proposes guided policy search, GPS, which is a fairly similarapproach to what is presented in this degree project. The major difference is in theinterface between trajectory optimization and deep learning; rather than optimizingsingle trajectories like in our method, Levine and Koltun compute the open-loop controlsk and gain matrices K as in section 2.3.1, then sample the action from N (u,−Q−1

uu).Multiple trajectories are then sampled from the same DDP solution, allowing moretraining samples to be produced for the learning process.

However, the agent is not able to influence the states visited, and therefore is notguaranteed to get training samples for states that it would wind up in had the agentitself been in control. To circumvent this problem, Levine and Koltun augment the DDPloss function to penalize actions that are unlikely for the agent to take, and thus createtrajectories that are slightly optimized versions of the agent.

There have been several works published on variations of GPS, notably Zhang et al.(2016) formulate GPS in the context of autonomous UAVs. The motivation is very muchthe same as in this work: current optimal control methods are typically expensive, espe-cially in the context of lightweight rotorcraft such as multicopters, approximation withdeep learning methods should therefore lower that computational cost and enable moresophisticated control. Zhang et al. (2016) apply their method using partial observations,alleviating the need for state estimation, or at least reducing reliance on its fidelity.

2.5.5. Deep Learning Quadcopter Control via Risk-Aware Active Learning(Andersson et al., 2017)

Perhaps the most similar work is due to Andersson et al. (2017), combining DAgger withdeep learning, and applying it to quadrotor control. Their variant of DAgger appliessample weighting to improve learning of actions for states that are “risky.” The authorsimply that the optimization can be co-opted to decide on what states are risky, thoughleave this largely unspecified. Andersson et al. apply their method to a real-worldsystem, specifically a micro-aerial vehicle with a load capacity of about 30 grams. Theyperfrom control offboard, sending motor commands over radio. One imagines that thiswould cause considerable lag in the control, though no such lag is mentioned in the work.Their work also leaves the imitated expert policy largely unspecified, and there is littleeffort spent on exploring the properties of the imitation learning itself. By contrast,this degree project focuses on improving the learning convergence rate, reliability, andperformance of the learned policy.

11

3. Deep Imitation Learning of OptimalTrajectories

The algorithm used in this work is a variant of DAgger as in section 2.4 imitating anDDP-based trajectory optimizer as in section 2.3.

3.1. Finding Optimal TrajectoriesOur implementation of DDP is written in the high-level programming language Python,and uses the just-in-time compiler Numba (Lam et al., 2015) to reduce execution timeby at least one order of magnitude. The initial reference trajectory is taken from thePID as described in section 2.2, executed at 400 Hz.

As described in section 2.3.1, DDP requires the Jacobian matrices of the dynamicswith respect to state and action, the Jacobians and Hessians of the loss function withrespect to state and action. Let us therefore define these eight matrices fx, fu, lx, lu,lxx, lxu, luu, and lux.

3.1.1. Differentiation of DynamicsThough it is possible to find the Jacboians fx and fu through numerical approxima-tion methods, an analytical alternative has many advantages – in particular vis-a-visnumerical stability and performance. To that end, we employed symbolic differentiationsoftware to compute these from one-step Euler integration with xt+1 = xt + ∆tx. Letus first break this 13-by-13 matrix into submatrices and define them in turn,

fx =dxt+1

dxt= I +∆t

0 0 I 00 qq 0 qω0 vq −kFdI 00 ωq ωv −kTdI

.

We first direct our attention at the derivatives of orientation q with respect to itself,

qq =1

2

0 ωz −ωy ωx

−ωz 0 ωx ωy

ωy −ωx 0 ωz

−ωx −ωy −ωz 0

,

12

and then orientation with respect to angular velocity ω,

qω =1

2

qr −qk qjqk qr −qi

−qj qi qr−qi −qj −qk

.

Then the derivative of linear velocity with respect to orientation, keeping in mind thatlinear velocity is in the inertial frame — were it in body frame, its derivative with respectto orientation would naturally be zero.

vq = 2F

m

qk qr qi qj−qr qk qj −qi−qi −qj qk qr

,

where F is the sum of the thrust forces, and is therefore a function of the control signalu. Let us define fu by deriving with respect to each control signal separately,

fui =dxt+1

dut= ∆t[0,vui ,ωui ]

T,

where vui =F ′i

m

2(qiqk + qjqr)2(qjqk − qiqr)

(q2k + q2r − q2i − q2j )

and ωui =

1

I

F ′ikyi

−F ′ikxi

kdik2Tz2kwmaxui

.

kxi and kyi are the X and Y coordinates of the ith motor, and F ′i = 2ak2wmaxui+kwmaxb

is the derivative of the thrust force due to the ith motor with respect to ui.

3.1.2. Differentiation of LossAs previously noted, our loss is a quadratic function

l(xt,ut) = xTt Qxt + uT

t Rut.

We constrain the matrices Q and R to be symmetric, therefore the Jacobians and Hes-sians with respect to x and u are

lx = 2xtTQ lxx = 2Q lxu = 0

lu = 2utTR luu = 2R lux = 0.

3.1.3. Damping ScheduleWe found damping to be essential for meaningful trajectory optimization to occur. Inparticular, we used an exponential strategy as is common for the Levenberg-Marquardt

13

algorithm (Marquardt, 1963), with damping ϵ = Ck (see section 2.3.2 for a definition ofϵ.) The initial value for k was -5, and C = 2. k was incremented with each pass of DDPthat failed to find a better solution, and decremented with each pass that improved thesolution, until k reaches a predefined maximum kmax = 200.

In contrast to Marquardt (1963), we restarted the damping schedule after reachingkmax, stopping only when no improvements were made at any k, or after a maximumnumber of passes. The result was that lower damping values were attempted more often,and larger steps taken. Larger steps means larger gradient, so the propagation to theearlier time steps is better, which is particularly important in our work as we only usethe first few actions of the optimized trajectory. Optimization of only the end of thetrajectory is wasted effort, as only the first few time steps are used at all. Neither canwe avoid optimizing for these time steps altogether either, as they are needed to makethe optimization less desperate; if the algorithm only has a few fractions of a second tominimize cost, it becomes worthwhile to sacrifice stability to minimize average distancefrom the goal state.

3.1.4. Cost Matrix DesignOur cost function was construted by two diagonal matrices as mentioned previously,where the loss is as in eq. (2.1),

l(xt,ut) = xTt Qxt + uT

t Rut.

It provides an intuitive and straight-forward means of controlling the optimization pro-cess, and is easily differentiated as shown in section 3.1.2. The values used were foundthrough experimentation with the optimization process and reasoning about the result-ing trajectories.

The use of quaternions to describe orientation provides a convenient way to enforceaxis alignment between the inertial and body frame, i.e. unit orientation. For example,our cost function encourages the multicopter to stay upright by taking the negation ofthe cosine similarity between the inertial and body Z axis. For a quaternion, cosinesimilarity of the Z axis is

cosα = q2i + q2j − q2k − q2r

where α is the angle between the inertial Z axis and the body Z axis, and is zero whenthe body is completely upright.

Generalizing this dissimilarity cost to all axes,

(−kx + ky + kz)q2i + (kx − ky + kz)q

2j + (kx + ky − kz)q

2k − (kx + ky + kz)q

2r ,

where kx, ky and kz are cost coefficients for each axis1. Since these are all quadraticterms, it can be encoded along the diagonal of the matrix Q. The cost is minimizedwhen qr = 1, which is the identity quaternion q = q−1, that is to say, unit orientation.

1The result for the Z axis follows from setting kx = ky = 0 and kz = 1.

14

..Trajectory .x0. x1.u0

. x2.u1

. x3.u2

. x4.u3

. x5.u4

.

1st sample

.

x0

.

x2

.

u0

.

x4

.

u2

.

2nd sample

.

x1

.

x3

.

u1

.

x5

.

u3

Figure 3.1. Striding and stacking illustration with Nstride = 2 and Nstack = 3. Theactual trajectory is first strided to every other action-state pair, then threein a row are stacked, yielding two input samples.

3.1.5. Integration of Differential System of EquationsDuring forward integration in the DDP optimization process, a sub-step based Eulerintegration was used. Given a state x and control signal u, a step of length ∆t was takenin Nsubstep substeps with a constant control signal u, essentially performing piecewiselinear interpolation. In practice, we set Nsubstep = 2 as a compromise between accuracyand efficiency.

In actual simulation, i.e. when examining the performance of the networks, an open-source initial value problem solver was used that features the Runge-Kutta method andadaptive step sizes to handle stiff problems. Though the solver is substantially moreaccurate, it is too slow to use during the DDP optimization process.

3.2. Imitation LearningWe have used a variation of the DAgger method, as described in section 2.4. The networkwas implemented in the TensorFlow framework by Abadi et al. (2016). In rolling out thepolicies, we perturbed the state by sampling from N (x, σx) and drawing actions fromN (u, σu), where σx = 10−3 for the state and σu = 10−4 for the actions. This was donein order to improve robustness and enforce neighboring state exploration.

At each iteration, we constructed Nrollouts trajectories, each with Nsteps time steps.As in the trajectory optimization, ∆t = 1

400 , i.e. 400 Hz.

3.2.1. Network ArchitectureThe architecture of the networks was as follows:

1. An input layer of Nstack 13-dimensional states and four-dimensional actions stackedwith striding Nstride, and batch normalization;

2. Nlayers hidden layers consisiting of a fully-connected layer with Nunits rectifiedlinear units (ReLU), 5% dropout during training recommended by Andersson et al.(2017), and batch normalization; and,

15

..x0. x1.u0

. x2.u1

. x3.u2

. x4.u3

. . . ..u4

.

x0,1

.

u0,0

.

x0,2

.

u0,1

.

...

.

u0,2

.

x2,3

.

u2,2

.

x2,4

.

u2,3

.

...

.

u2,4

.

x4,5

.

u4,4

.

x4,6

.

u4,5

.

...

.

u4,6

Figure 3.2. Agent trajectory {xt,ut}Tt=0 and expert trajectories {xi,t, ui,t}Tt=0 illustratedby vertical trajectories starting from each state visited by the agent. Stridedoptimization means expert trajectories are computed every Noptstride states.Learning ahead means Nahead steps of these expert trajectories are added tothe dataset. Input striding and stacking is performed as if the agent hadperformed the expert actions and ended up in the expert states.

3. an output layer of four hyperbolic tangent units, rescaled and shifted so that theyare between zero and one.

The loss function was mean squared error, and optimized with the Adam optimizationdue to Kingma and Ba (2014) with a linearly annealed learning rate between 10−3 and10−5 over Nepochs epochs, with each epoch learning all data in batches of size Nbatch = 512.The weights w(l)

ij were drawn from w(l)ij ∼ N (0, 10−1). Batch normalization is as described

in Ioffe and Szegedy (2015). The striding and stacking mechanism is illustrated in fig. 3.1.

3.2.2. Learning Ahead and Strided OptimizationA notable difference in our method from DAgger as formulated in Ross et al. (2011) isthat we did not optimize every state. The motivation for this is that since we optimizewhole trajectories, merely using the first action of that trajectory is wasteful; and, ourtrajectories have significantly shorter time steps. Assuming the agent learns to performthe expert action given some state, the agent is going to need to know what it shoulddo at the next couple of states, which is the next steps of the trajectory. We thereforeintroduced strided optimization where the expert trajectories are only computed for asubset of states, and learning ahead where the agent is taught several steps of a singleexpert trajectory, both illustrated in fig. 3.2.

16

3.2.3. Trajectory ContinuationDeciding when to reset a trajectory is crucial for learning to occur; if one simply allowsthe agent to continue on the same trajectory indefinitely it will either fail and end upfalling further away from the goal state, or if successful, the agent will not see new statesas it keeps near the goal state. On the other hand, resetting the state on each trainingiteration limits the explored states to those reachable in Nsteps time steps. We thereforeintroduced three continuation criteria, all of which must be satisfied for a trajectory tocontinue:

1. the cost of the final state must be below the threshold Qkeepmax = 105;

2. the body Z axis must be above the inertial XY plane, i.e. upright; and

3. a random reset occurs with preset = 2% probability, to encourage exploration newtrajectories.

We additionally set a limit Nkeepmax = 1 on how many trajectories are continued betweentraining iterations. We chose preset so the expected trajectory length is 25 seconds, i.e.preset = 2%. In strided optimization, we set preset = 2‰.

17

4. Experimental ResultsWe trained the networks to imitate an expert that takes the multicopter to the origin athover, that is to say, with zero velocity. We argue that this is without loss of generality,because a translation of the coordinate system enables setpoints to be reached at anyposition. We consider variations to the following network design choices:

• network architecture: deep and narrow with Nlayers = 5 and Nunits = 32, shallowand wide with Nlayers = 3 and Nunits = 128, and a middleground with Nlayers = 2and Nunits = 64 similar to Andersson et al. (2017);

• rollouts and length: four shorter rollouts with Nrollouts = 4 and Nsteps = 100, ortwo longer rollouts with Nrollouts = 2 and Nsteps = 200.

• striding and stacking: combinations of no, small, medium and long striding withNstride = {1, 4, 20, 40}, and no, some, and a lot of stacking with Nstack = {1, 3, 10};and,

• learning ahead and strided optimization: Nahead = 4, Noptstride = 5 and preset =2‰ versus Nahead = 1, Noptstride = 1, and preset = 2%. See section 3.2.2 for anexplanation of these variables.

We compare the networks from two points of view: first, faithfulness of the network’sapproximation to the expert, how well does the network predict the actions of the expertin novel states; and what can be said of the stability of the network, how often does thenetwork fail to return to a steady hover, and how fast did the network arrive at thatstate?

4.1. Training LossThe agent’s ability to imitate the expert is quantified by its loss function, the meansquared error between predicted action and expert action. Neural network loss curvescharacteristically decreases rapidly at first, then tapers off as the optimization converges.The same should in principle be true for our method, but only holds within one trainingiteration. After each iteration, DAgger aggregates more data, and so the loss functioncould be expected to have spikes at each new training iteration, resulting in a sawtooth-like loss function plot. This was observed in our training, as shown in figs. 4.1a and 4.1b.

The spikes become smaller over time, as seen in fig. 4.1b. We reason that this is inpart due to a “drop in the bucket” effect where the dataset has become large enough thatadding new samples has a relatively small impact on the overall loss, but also due to the

18

50 100 150 200 250Epoch

0.01

0.02

Loss

(a) Early loss, from epoch 25 to 275.

10000 10050 10100 10150 10200Epoch

0.0140

0.0145

Loss

(b) Late loss, starting from epoch 10 000.

Figure 4.1. Plots of loss as function of epoch in blue. Vertical grid lines are plottedevery Nepochs = 25 epoch, so as to coincide with the start of new trainingiteration. The average per iteration is plotted in orange. The sudden spikesat the beginning of every iteration are caused by adding new samples.

2250 4500 6750 9000 11250Epoch

0.01

0.02

Loss

Figure 4.2. Macro behavior of loss over all epochs, showing asymptotic convergence to-wards an upper bound. Actual loss per epoch plotted in blue, with theaverage of 250 epochs in orange. Vertical grid lines are plotted so as tocoincide with the start of every tenth training iteration.

network learning to predict the expert. This is evidenced by the first few iterations offig. 4.1a where spike height decreases rapidly with each training iteration, and in fig. 4.2where the macro behavior of the loss function clearly converges towards an upper bound,suggesting that new datapoints fit the agent’s prediction better as training progreses.

4.2. Network Architecture versus Strided OptimizationWe trained pairs of networks simultaneously for 48 hours with six threads for eachnetwork on a six-core Intel Core i7-6850K at 3.6 GHz. We used three variations of thearchitecture: wide with three layers of 128 units, middle with two layers of 64 units, andnarrow with five layers of 32 units. We tested with and without strided optimization, asindicated by the optstride label. The resulting networks’ performances are presentedin fig. 4.3, showing a clear advantage to strided optimization. Not only are there fewercrashes, the final state is significantly closer to the goal. The middle-size network failsas often with striding, but is better at reaching its goal. It could be speculated that thisis because of chance, i.e. the non-strided network has an unusually low failure rate, andthe strided one an usually high. This seems plausible given that the other non-strided

19

networks perform worse, and the other strided networks perform quite a lot better.

widemiddlenarrow

wide, optstridemiddle, optstridenarrow, optstride

0% 20% 40% 60% 80% 100%Failed Trajectories

0 100 200 300 400 500Position Error (mm)

Figure 4.3. Comparison of network performance and reliability for network architectureversus strided optimization as described in chapter 4. The horizontal barsshow percentage of failed trajectories out of 1 000 runs, where failed meansthe final state is more than 500 mm away from the goal. The vertical linesshow distance to the goal after five seconds for successful trajectories, andare plotted at the 50th percentile with a 95% confidence interval.

4.2.1. Performance of Wide versus Narrow Networks

0 100 200 300 400Training Iteration

0

50

100

Failu

re R

ate

%

(a) Wide, strided optimization.

0 100 200 300Training Iteration

0

50

100

Failu

re R

ate

%

(b) Narrow, strided optimization.

Figure 4.4. Plots of failure rates for each training iteration, showing an advantage forthe narrow network.

As shown in figs. 4.4a and 4.4b, the narrow network achieves a lower failure ratequicker than the wide network. This is deceptive, however, as the resulting controller isless exact and has primarily learned to recover from errors, rather than learned to followoptimal trajectories. For a better illustration of this effect, see appendix B.

20

4.3. Improving PerformanceIn the interest of finding a better performing network, we trained the network using awider initial state distribution. The intuition is that for the agent to know what actionto take, it must have seen an analogous state before. Therefore, if the training processsees more varied initial states, and it stands to reason that the agent should visit morediverse states. Secondly, we let that same network train for 48 additional hours, to seehow much performance would improve. Both of these two networks are presented intable 4.1.

Table 4.1. Performance of variants of the wide network. The highvar network had alarger initial state distribution. The longrun network is the same networkwith 48 more hours of training. Errors are in millimeters. High variance re-sulted in a slightly higher mean error and upper bound, though a significantlylower failure rate.

Variant Mean Error 95th Perc. Failureswide 263.75 469.72 31.50%wide, optstride 28.07 131.30 5.70%wide, optstride, highvar 28.55 140.78 4.30%wide, optstride, highvar, longrun 13.79 28.02 1.20%

4.4. Striding and StackingIn all the above networks, we have had Nstack = Nstride = 2, meaning the agent knowswhat happened at most one millisecond ago. We explored various settings of these pa-rameters, and found that none performed significantly better than our default setting.Longer striding and more stacking resulted in worse performance. We argue that this isdue to the inputs being increasingly high-dimensional with more stacking, and increas-ingly varied with more striding. Indeed, we were able to achieve similar performancewith no stacking or striding.

4.5. Rollouts and LengthsNumber of rollouts and their length play a lesser part in learning than perhaps expected.We found these two parameters had no significant effect on learning, the fastest learningwas with Nsteps = 200 and Nrollouts = 2. We reason that this is because less time isspent on training the neural network and evaluating the agent, instead more time isspent computing optimal trajectories. This suggests that our imitation learning processprimarily needs expert actions over many states, and not that the states need to bevisited by the agent.

21

4.6. ReproducibilityIt could be argued that the process of learning is stochastic enough that “lucky shots”determine the final performance of a network. It is likely true that chance plays a non-negligible part in the convergence rate and performance of the learning process, andlikely more so when not striding optimization as more time is spent on each iteration.To investigate this, we repeated the training process three times for the narrow networkwith strided optimization, and saw no significant differences in performance betweenruns.

22

5. DiscussionIn this chapter, we discuss the results from the previous chapter, as well as the methoditself first from the point of view of the learning process, then the optimization process,and put this degree project in a wider societal and ethical context.

5.1. Deep Imitation LearningThe effects and considerations of training our agent can roughly be split into two parts:the learning process, and the optimization process. In this section we present the former.

5.1.1. Network ArchitectureNetwork architecture loosely translates to “ability to approximate.” In imitation learn-ing, and deep learning in a broader sense, the goal is to strike a balance between rotememorization of training data, and an ability to generalize to patterns in that data. Tothat extent, these methods tend to be fairly data intensive, and so reducing the numberof parameters forces the network to learn generalizations of the data rather than exactmappings.

The implication for generalizing poorly in the setting of control is that the controllerwill make mistakes with a high frequency, causing the agent to generate trajectories thathave failed or are suboptimal. The agent is then taught how to recover from its ownmistakes, rather than how it should have behaved in the first place. This is effect isfurther illustrated in appendix B.

5.1.2. Training Time and Strided OptimizationOur results clearly show that strided optimization benefits learning, and is for this degreeproject a necessity. The need for striding the optimization pass is perhaps lower if thetrajectory rollouts are shorter, thus creating a shorter feedback cycle between expert andagent, however, our results also show that short rollouts tended to give worse learningperformance.

Training time was also shown to be significant in improving agent performance, per-haps more than expected. Both of these factors are likely side-effects of the fact that itis important for the agent to be trained on a lot of data.

23

5.2. Trajectory OptimizationThough not strictly the focus of this degree project, the trajectory optimization techniqueand its properties is vital to an effective imitation learning agent.

5.2.1. The Expert Requires An ExpertImitation learning is bounded by the expert, as the agent will only be as good as theexpert. A major challenge in the execution of this degree project has therefore beento implement a trajectory optimization method that is both fast and near-optimal. In-concistencies in the expert confuses the learning process, and so it is arguably moreimportant to have a less optimal but more consistent expert. For this reason, imita-tion learning necessitates the ability to construct effective controls a priori, drasticallyreducing the number of problems it can solve.

5.2.2. Reference Trajectory ConsiderationsOur trajectory optimizer is based on the control scheme described in section 2.2, a con-troller that sometimes fails to return to the goal state in an acceptable timeframe. Itwould therefore be interesting to investigate if optimizing the agent’s own trajectorieswould result in better peformance. This would accomplish two things: first, the full tra-jectory would be optimized in one pass, accelerating learning substantially; and second,since the agent is an approximation of the optimal trajectory, there would be less tooptimize, reducing optimization time. This is similar to Levine and Koltun (2013).

5.2.3. Alternative Optimization MethodsClearly, the imitation learning agent’s ability to converge on a policy depends on thepolicy’s complexity. It would for example be impossible to learn a white noise policy —unless the agent is formulated as a stochastic policy where the neural network predictsdistribution parameters, as in Levine and Koltun (2013). The trajectory optimizationalgorithm is therefore a key component in the ability of the agent to learn a consistentpolicy, and the DDP implementation used in this degree project is a potential source ofconfusion for the agent.

In particular, DDP is in general not guaranteed to find an optimal trajectory for a con-trol constrained problem such as ours — we simply clip the controls. It might therefore bemore appropriate to use an optimization technique that allows for constrained controls,like sequential quadratic programming or interior point methods. The main problemwith this is the theoretical burden of computing the analytical gradient of the loss func-tion over trajectories. We have done so for the BFGS optimization method, though thisturned out to be prohibitively slow. Our derivations are shown in appendix C.

Another alternative is Newton-based optimization with trust regions. Such methodspose an even greater theoretical burden, as a prerequisite is the analytical Hessians ofthe trajectory loss function. This is true of sequential quadratic programming as well,and the reason we have not investigated the use of them.

24

5.2.4. Other Optimizer ImprovementsTassa et al. (2012) suggests some furter improvements to the DDP algorithm that wehave not incorporated in this work. In particular, the authors add a second line searchparameter α to scale the correction δu∗ back until an improved cost is found, and toadd the regularization term λ to the diagonal of V ′

xx rather than the eigendecompositionmethod we have presented. Though we have not evaluated the former, the latter changereduced optimization effectiveness drastically.

A second improvement would be to look at the optimization’s condition number. Itis well-known that numerically intensive methods can be improved by preconditioning,transforming the optimization to one that is easier to solve. We have not investigatedsuch schemes, as it falls outside of the scope of this degree project.

5.3. Societal and Ethical AspectsUnmanned aerial vehicles, perhaps in the form of multicopters, are going to play a keypart in the society of tomorrow. The methods presented and developed in this degreeproject are applicable to autonomous agents in wider sense, and are not restricted toflying. It is for example possible to model a submarine using the same rigid bodydynamics.

As is the case for the field of robotics in general, autonomous multicopters can anddo find use as military technology. However, it would be short-sighted to not developsuch a technology for fear of its byproducts; if so, then nuclear physics should nothave been discovered, as it was the foundation of the development of the nuclear bomb.Autonomous vehicles can instead act a democratizer: giving more people access to moregoods and services at a lower cost for both the individual and society.

Multicopters are being investigated for use as emergency healthcare delivery vehicles,particularly for especially time-sensitive emergencies such as cardiac arrest, where gettinga defibrillator kit to the victim in time can make a difference of life and death.

5.3.1. SustainabilityThis degree project has concerned itself chiefly with optimization, and with improvingthe efficiency of a budding technology from an optimal control perspective. Optimal-ity is subjective, and typically one of the goals of optimal control is to reduce powerconsumption, thereby theoretically reducing the ecological impact of multicopter flight.Better control also lowers cost of multicopters to the individual, reducing economicalbarriers to the technology for those with less resources.

The applications of unmanned aerial vehicles also serve to reduce the overall ecologicalstrain on the planet. For example, replacing last-mile delivery from the local post office tothe recipient’s door with multicopters would make deliveries in rural areas more efficient,as the alternative often is a car running a combustion engine on non-replenishable fuelslike gasoline. It should be noted that with the advent of the electric car, this impact

25

is already significantly lowered; however, aerial distances will almost always be shorterthan on-road distances.

The perhaps largest sustainability challenge to multicopters is that of most high-powerelectric vehicles, be it air, ground or water: the production and disposal of batteries.

26

6. ConclusionWe have investigated the relationship of imitation learning in a deep neural networkcontext with the widespread trajectory optimization algorithm DDP, and found somenoteworthy results:

• Strided optimization together with learning ahead plays a crucial role in accelerat-ing the convergence of the imitation learning process for moderately high frequencycontrol settings such as in this work. We reason that this is due to a shorter cyclelength between updating the agent and finding new optimal actions.

• It is likely that this acceleration is caused by generating more training data inshorter time, lending credence to the approach due to Levine and Koltun (2013);Zhang et al. (2016). This is further evidenced by the fact that a larger initial statedistribution during training leads to better performance, and that longer trainingtime gave substantial improvements in performance and reliability.

• Network architecture affects the agents ability to exactly reproduce actions fromthe dataset, and a simpler architecture leads to error recovering behaviors ratherthan optimal control. Deeper architectures succeed better, but are also signifi-cantly slower to converge. We showed that this is because simpler architecturesspend their limited expressive power on recovery behaviors, leading to undesirableoscillations.

Finally, the trajectory optimizer, the visualization environment, and the simulationtool have all been published as an open-source software package1.

6.1. Future WorkAndersson et al. (2017) propose risk-aware learning where more serious errors are cor-rected more often. This suggests that the training samples could perhaps be intelligentlyweighted instead. Such an approach is found in reinforcement learning called advantagelearning, transplanted to imitation learning, it could be interpreted as letting the train-ing samples be weighted according to the advantage of following the expert trajectoryover the actor, i.e. the cost difference between expert trajectory and actor trajectory atsome time in the future. This embodies the idea that if the agent’s actions are alreadyacceptable for some region of states, its outcome is not significantly worse than the ex-pert’s, then training to mimic the expert more exactly for such states is less important.

1Available at https://github.com/lericson/pysquad.

27

https://github.com/lericson/pysquad

This would increase expressive capacity of the neural network as it would be able toexpend expressive power on more serious errors.

The most important next step from a broader perspective is to bring this technologyto real-world robotic systems, and multicopters in particular. The challenge is then tofind a way to train the agent so that it can perform well in real-world situations. Oneapproach to this is to provide a simulation that is sufficiently close to reality, so thatwhen the agent cannot tell the difference between simulation and reality. This approachis taken in Tobin et al. (2017), with domain randomization, where the agent is trainedin a wide variety of pertubations of its environment, sufficiently variable that realityappears as just another variation.

AcknowledgementsThank you to the Royal Institute of Technology and RPL for enabling me in executingthis project, and the use of their computational resources. My thesis group and oursupervisor Patric Jensfelt have all been instrumental in this work, both as a source ofinspiration, as well as motivating me throughout.

28

BibliographyM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,

G. Irving, M. Isard, et al. TensorFlow: A system for large-scale machine learning. InOSDI, volume 16, pages 265–283, 2016.

C. W. Anderson, B. A. Draper, and D. A. Peterson. Behavioral Cloning of StudentPilots with Mo dular Neural Networks. In ICML, pages 25–32, 2000.

O. Andersson, M. Wzorek, and P. Doherty. Deep learning quadcopter control via risk-aware active learning. In AAAI, pages 3812–3818, 2017.

A. Attia and S. Dayan. Global overview of Imitation Learning. arXiv preprintarXiv:1801.06503, 2018. URL https://arxiv.org/abs/1801.06503.

M. Boyle. The Integration of Angular Velocity. Advances in Applied Clifford Algebras,27(3):2345–2374, 2017.

M. Hehn and R. D’Andrea. Real-time Trajectory Generation for Quadrocopters. IEEETransactions on Robotics, 31(4):877–892, 2015.

S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Trainingby Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167, 2015. URLhttps://arxiv.org/pdf/1502.03167.pdf.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014. URL https://arxiv.org/abs/1412.6980.

S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT compiler. InProceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC,page 7. ACM, 2015.

M. Laskey, C. Chuck, J. Lee, J. Mahler, S. Krishnan, K. Jamieson, A. Dragan, andK. Goldberg. Comparing Human-Centric and Robot-Centric Sampling for Robot DeepLearning from Demonstrations. arXiv preprint arXiv:1610.00850, Oct. 2016. URLhttps://arxiv.org/abs/1610.00850.

M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg. DART: Noise Injection forRobust Imitation Learning. arXiv preprint arXiv:1703.09327, Mar. 2017. URL https://arxiv.org/abs/1703.09327.

A. Laub. A Schur method for solving algebraic Riccati equations. IEEE Transactionson automatic control, 24(6):913–921, 1979.

29

https://arxiv.org/abs/1801.06503

https://arxiv.org/pdf/1502.03167.pdf





K. Levenberg. Method for the solution of certain non-linear problems in least squares.Quarterly of applied mathematics, 2(2):164–168, 1944.

S. Levine and V. Koltun. Guided Policy Search. In International Conference on MachineLearning, pages 1–9, 2013.

D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters.Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963.

L. Meier, D. Honegger, and M. Pollefeys. PX4: A node-based multithreaded opensource robotics framework for deeply embedded platforms. In Robotics and Automation(ICRA), 2015 IEEE International Conference on, pages 6235–6240. IEEE, 2015.

D. Mellinger and V. Kumar. Minimum Snap Trajectory Generation and Control forQuadrotors. In Robotics and Automation (ICRA), 2011 IEEE International Confer-ence on, pages 2520–2525. IEEE, 2011.

S. Ross, G. Gordon, and D. Bagnell. A Reduction of Imitation Learning and StructuredPrediction to No-Regret Online Learning. In Proceedings of the 14th InternationalConference on Artificial Intelligence and Statistics, pages 627–635, 2011.

S. Shalev-Shwartz et al. Online Learning and Online Convex Optimization. Foundationsand Trends in Machine Learning, 4(2):107–194, 2012.

Y. Tassa, T. Erez, and E. Todorov. Synthesis and Stabilization of Complex Behaviorsthrough Online Trajectory Optimization. In Intelligent Robots and Systems (IROS),2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012.

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain Random-ization for Transferring Deep Neural Networks from Simulation to the Real World. InIntelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on,pages 23–30. IEEE, 2017.

T. Zhang, G. Kahn, S. Levine, and P. Abbeel. Learning Deep Control Policies for Au-tonomous Aerial Vehicles with MPC-guided Policy Search. In Robotics and Automation(ICRA), 2016 IEEE International Conference on, pages 528–535. IEEE, 2016.

30

A. Thrust Force as Function of PWMSignal

Unmanned multicopters are typically built using brushless DC motors, where the rela-tionship between output torque and power is proportional by some motor-specific con-stant KT , i.e. the torque τ = (I−I0)KT where I is the current flowing through the motor,and I0 is the current through the motor when it is at rest. The voltage over the motor isgiven by its resistive part, and its back-EMF part V = IR+wKe where w is the angularvelocity of the motor, and Ke is, too, a motor-specific proportionality constant. In anidealized motor Ke = KT , the power consumption is therefore P = IV = I2R + IwKe.Assuming negligible resistive loss as well as at-rest current,

P = τwKV

KT= τw. (A.1)

The astute reader will notice that this is the angular counterpart to the classical me-chanical relationship of power, force and velocity P = Fv.

The momentum theory of fluid dynamics relates thrust force F to power P by therelation

v =

√F

2ρA(A.2)

where v is velocity, ρ is the density of air, and A is the area swept out by the propellerblades. Equation (A.1) and eq. (A.2) establish a quadratic relationship between thrust,torque and motor speed under the assumption that torque is proportional to force,

F =P

v= τw

√2ρA

F= wF

√k

F→ F = kw2.

The actual polynomial used in our simulations was a scaled version of a polynomialfitted to actual multicopter data1

Lastly, there is a distinction between the PWM signal u ∈ [0, 1] and the motor speedw; however, for this degree project we have assumed infinite torque from the motors, sow = kwmaxu where kwmax is the maximum motor speeed.

1https://wiki.bitcraze.io/misc:investigations:thrust retrieved on 2018-03-01, with a = 1.0942 ·10−7, b = −2.1059 · 10−4, c = 0.15417 before rescaling of the thrust polynomial F = aw2 + bw + c.

31

https://wiki.bitcraze.io/misc:investigations:thrust

B. Distributions of Trajectories

100

101

102

103

104

Dis

tanc

e (m

m)

100

101

102

103

104

Vel

ocity

(mm

/s)

0 1 2 3 4 5Time (s)

0.00

0.25

0.50

0.75

1.00

Orie

ntat

ion

0 1 2 3 4 5Time (s)

0

5

10

15

20A

ngul

ar V

eloc

ity (r

ad/s

)

Figure B.1. Distribution of trajectories for long-running wide network. The blue regionsurrounding the thick blue line represent the median and 95% confidenceinterval for trajectories that within 500 mm of the goal position. Orientation1.0 is zero rotation. Notably, there are trajectories with high-frequencysinusoidal orientation errors; this is due to high angular velocity.

Here we present performance of the long-running wide network with optimizationstriding and increased initial state variance, and a narrow network with better reliabil-ity statistics. All figures show 95% confidence bounds only for trajectories that endedwithin 500 mm of the goal position. The long-running and narrow network can be seenin figs. B.1 and B.2 respectively. Though the narrow network has significantly betterreliability (1.6% failures), its final position error mean and variance are higher. This isvisible in the two figures, as the narrow network’s trajectories are more chaotic. Fromthis we conclude that the narrow network has spent considerable expressive power onlearning to recover from errors, whereas the wide networks, having failed less, has learnedmore constructive control and so is less able to recover when failure does happen.

32

101

102

103

104

Dis

tanc

e (m

m)

100

101

102

103

104

Vel

ocity

(mm

/s)

0 1 2 3 4 5Time (s)

0.00

0.25

0.50

0.75

1.00

Orie

ntat

ion

0 1 2 3 4 5Time (s)

0

5

10

15

Ang

ular

Vel

ocity

(rad

/s)

Figure B.2. Distribution of trajectories for narrow network, see fig. B.2 for a figure expla-nation. Though the network has fewer catastrophic failures, it also performsless optimally when it succeeds.

33

C. Analytical Jacobian of Loss FunctionConsider our state-space system

xt+1 = f(xt,ut), (C.1)

with state xt ∈ RN and action ut ∈ RK . Let c be a cost function over states and actionsas a sum of two quadratic forms given by Q ∈ RN×N , and R ∈ RK×K ,

c(x,u) = xTQx+ uTRu. (C.2)

We assume that the goal state is at the origin of the state space without loss of generality1.We seek to minimize this cost function over a trajectory, given an initial state x0 andactions U = {ut}Tt=0. To this end, let us define our optimization objective, the lossfunction, as the sum of the cost over the trajectory

L(x0,U) = d(xT+1) +T∑t=0

c(xt,ut), (C.3)

Typically, d(x) = kxTQx for some scalar k > 1. There are many weighting schemes forthe cost function. The cost should arguably be the same for each state, as the final stateis no more or less important than the states along the trajectory; however, increasingthe weight on the final state has the intuitive meaning of adding importance to actuallyattaining the goal state.

To minimize the loss function with respect to U , we need the derivative with respectto U . Let us perform the differentiation step-by-step in time, beginning from the finalstate since the final action will have the smallest impact on the Jacobian as it only affectsthe final state xT+1. Let fx(y) denote the Jacobian of a vector function f with respectto its argument x computed at y. The gradient of the loss function with respect to thelast action is

LuT = ∇uT

(d(xT+1) + c(xT ,uT )

).

As the gradient is a linear operator, it distributes over the two terms. Using the chainrule, we expand the first such term

duT (xT+1) = dxT+1(xT+1)fuT (xT ,uT ),

with the second term being the gradient of the cost function, cuT (xT ,uT ). Note that welet the gradient be a row vector for mathematical convenience. Summing the two termswe get

LuT = dxT+1(xT+1)fuT (xT ,uT ) + cuT (xT ,uT ),

1Assume the goal was some some state zgoal = 0, then let x = z − zgoal, moving the goal state to theorigin of the new state space z.

34

where dx(x) = kxT(Q+QT), cx(x,u) = xT(Q+QT), cu(x,u) = uT(R+RT).For LuT−1 , we have a similar result but with an additional term, and an additional

factor in the chain rule step. More precisely,

LuT−1 = ∇uT−1

(d(xT+1) + c(xT ,uT ) + c(xT−1,uT−1)

).

Looking at the first term again, we have

duT−1(xT+1) = cu(xT+1,uT+1)fx(xT ,uT )fu(xT−1,uT−1),

then the second term,

cuT−1(xT ,uT ) = cxT (xT ,uT )fuT−1(xT ,uT ),

finally the last term is as above, cuT−1(xT−1,uT−1).From this, we surmise the general form of the loss function’s Jacobian:

Lui = cui(xi,ui) +

T+1∑j=i+1

cxj (xj ,uj)( j∏

k=i+1

fxk(xk,uk)

)fui(xi,ui) (C.4)

where∏

i f = fi · · · f1 · f0.Computing this gradient is possible in O(T 2) time.

35

TRITA

www.kth.se

Flying High: Deep Imitation Learning of Optimal Control ...

Documents

Transcript of Flying High: Deep Imitation Learning of Optimal Control ...