Integrated and Adaptive Guidance and Control for ...

20
Integrated and Adaptive Guidance and Control for Endoatmospheric Missiles via Reinforcement Learning Brian Gaudet * University of Arizona, 1127 E. Roger Way, Tucson Arizona, 85721 Isaac Charcos University of Arizona, 1127 E. Roger Way, Tucson Arizona, 85721 Roberto Furfaro University of Arizona, 1127 E. Roger Way, Tucson Arizona, 85721 We apply the meta reinforcement learning framework to optimize an integrated and adap- tive guidance and flight control system for an air-to-air missile, implementing the system as a deep neural network (the policy). The policy maps observations directly to commanded rates of change for the missile’s control surface deflections, with the observations derived with minimal processing from the computationally stabilized line of sight unit vector measured by a strap down seeker, estimated rotational velocity from rate gyros, and control surface deflection angles. The system induces intercept trajectories against a maneuvering target that satisfy con- trol constraints on fin deflection angles, and path constraints on look angle and load. We test the optimized system in a six degrees-of-freedom simulator that includes a non-linear radome model and a strapdown seeker model. Through extensive simulation, we demonstrate that the system can adapt to a large flight envelope and onominal flight conditions that include per- turbation of aerodynamic coecient parameters and center of pressure locations. Moreover, we find that the system is robust to the parasitic attitude loop induced by radome refraction, imperfect seeker stabilization, and sensor scale factor errors. Finally, we compare our system’s performance to two benchmarks: a proportional navigation guidance system benchmark in a simplified 3-DOF environment, which we take as an upper bound on performance attainable with separate guidance and flight control systems, and a longitudinal model of proportional nav- igation coupled with a three loop autopilot. We find that our system moderately outperforms the former, and outperforms the latter by a large margin. I. Introduction D sign of flight control systems for supersonic air-to-air missiles is complicated by factors that include the large flight envelope, a changing center of mass during rocket burn, mismatches between the design and deployment environments, control saturation, and the inability to measure altitude, speed, and angle of attack. Moreover, the combination of look angle dependent radome refraction, imperfect seeker stabilization, and rate gyro bias results in a false indication of target motion, which gives rise to a parasitic attitude loop that reduces accuracy and can potentially destabilize the guidance and control (G&C) system [13]. Current practice in air-to air-missile G&C treats guidance and control as separately optimized systems. For example, the guidance system might map the line of sight (LOS) rotation rate and closing speed to a commanded acceleration. This commanded acceleration is then mapped to commanded control surface deflections by the flight control system. This control system, typically referred to as the missile autopilot, is implemented as three separate controllers for a skid to turn implementation: a roll stabilizer, a pitch controller, and a yaw controller [4]. The roll stabilizer drives the missile roll rate to zero, whereas the pitch and yaw controllers are each implemented as a three loop autopilot. In a given control channel (pitch or yaw), the three loop autopilot uses measured body frame acceleration and rotational velocity to cause the missile’s realized acceleration to track the commanded acceleration without overshoot. The response of these controllers is dependent on dynamic pressure, and missiles * Research Engineer, Department of Systems and Industrial Engineering, E-mail: [email protected], [email protected] Graduate Student, Department of Systems and Industrial Engineering. E-mail: [email protected] Professor, Department of Systems and Industrial Engineering, Department of Aerospace and Mechanical Engineering. E-mail: [email protected] 1

Transcript of Integrated and Adaptive Guidance and Control for ...

Integrated and Adaptive Guidance and Control forEndoatmospheric Missiles via Reinforcement Learning

Brian Gaudet�

University of Arizona, 1127 E. Roger Way, Tucson Arizona, 85721

Isaac Charcos†

University of Arizona, 1127 E. Roger Way, Tucson Arizona, 85721

Roberto Furfaro‡

University of Arizona, 1127 E. Roger Way, Tucson Arizona, 85721

We apply the meta reinforcement learning framework to optimize an integrated and adap-tive guidance and flight control system for an air-to-air missile, implementing the system asa deep neural network (the policy). The policy maps observations directly to commandedrates of change for the missile’s control surface deflections, with the observations derived withminimal processing from the computationally stabilized line of sight unit vector measured by astrap down seeker, estimated rotational velocity from rate gyros, and control surface deflectionangles. The system induces intercept trajectories against a maneuvering target that satisfy con-trol constraints on fin deflection angles, and path constraints on look angle and load. We testthe optimized system in a six degrees-of-freedom simulator that includes a non-linear radomemodel and a strapdown seeker model. Through extensive simulation, we demonstrate that thesystem can adapt to a large flight envelope and o� nominal flight conditions that include per-turbation of aerodynamic coe�cient parameters and center of pressure locations. Moreover,we find that the system is robust to the parasitic attitude loop induced by radome refraction,imperfect seeker stabilization, and sensor scale factor errors. Finally, we compare our system’sperformance to two benchmarks: a proportional navigation guidance system benchmark in asimplified 3-DOF environment, which we take as an upper bound on performance attainablewith separate guidance and flight control systems, and a longitudinal model of proportional nav-igation coupled with a three loop autopilot. We find that our system moderately outperformsthe former, and outperforms the latter by a large margin.

I. Introduction

D�sign of flight control systems for supersonic air-to-air missiles is complicated by factors that include the largeflight envelope, a changing center of mass during rocket burn, mismatches between the design and deployment

environments, control saturation, and the inability to measure altitude, speed, and angle of attack. Moreover, thecombination of look angle dependent radome refraction, imperfect seeker stabilization, and rate gyro bias results in afalse indication of target motion, which gives rise to a parasitic attitude loop that reduces accuracy and can potentiallydestabilize the guidance and control (G&C) system [1–3]. Current practice in air-to air-missile G&C treats guidance andcontrol as separately optimized systems. For example, the guidance system might map the line of sight (LOS) rotationrate and closing speed to a commanded acceleration. This commanded acceleration is then mapped to commandedcontrol surface deflections by the flight control system. This control system, typically referred to as the missile autopilot,is implemented as three separate controllers for a skid to turn implementation: a roll stabilizer, a pitch controller, and ayaw controller [4]. The roll stabilizer drives the missile roll rate to zero, whereas the pitch and yaw controllers are eachimplemented as a three loop autopilot. In a given control channel (pitch or yaw), the three loop autopilot uses measuredbody frame acceleration and rotational velocity to cause the missile’s realized acceleration to track the commandedacceleration without overshoot. The response of these controllers is dependent on dynamic pressure, and missiles

�Research Engineer, Department of Systems and Industrial Engineering, E-mail: [email protected], [email protected]†Graduate Student, Department of Systems and Industrial Engineering. E-mail: [email protected]‡Professor, Department of Systems and Industrial Engineering, Department of Aerospace and Mechanical Engineering. E-mail:

[email protected]

1

operating over a large flight envelope typically require some form of gain scheduling. As it is not practical to measurealtitude and speed during missile flight, these gains are selected based on the predicted trajectory of the interceptortaking into account the initial conditions of the engagement scenario, and are fixed for the duration of the intercept.However, gain scheduling results in performance degradation when target maneuvers cause the actual trajectory to di�ersignificantly from the predicted trajectory. An alternative is an adaptive autopilot that automatically adjusts autopilotparameters in a way that maximizes some performance metric.

One solution to improve the performance of a G&C system is to integrate the guidance and control subsystems,allowing the integrated system to exploit synergies between the guidance and control functions. For example, a systemthat maps observations from the navigation system directly to commanded control surface deflections can take intoaccount the response of the vehicle to control surface deflections, actuator dynamics, and actuation and structuralconstraints, potentially boosting performance. Moreover, without the need for spectral separation between multipleloops, the combined guidance system can potentially use a lower e�ective time constant, reducing flight control responsetime. Importantly, the spectral separation between separate guidance and flight control systems might not be valid at theend of the engagement, where rapid changes in engagement geometry occur [5].

Recent work studying integrated G&C includes [5] where the authors develop a longitudinal integrated G&C systemusing continuous time predictive control, with the engagement state and disturbances estimated using a generalizedextended state observer. A partially integrated (two loop) guidance and control system is developed in [6] usingsliding mode control, and in [7] the authors use a back-stepping technique to develop an integrated system satisfyingimpact angle constraints. Most work to date in integrated G&C have used a longitudinal model, ignoring the e�ects ofcross-coupling between the pitch and yaw channels [8]. One of the few papers to handle the full 6-DOF case is [9],where the authors use a two loop approach for a partially integrated G&C system in a surface to air missile application.Although there are two loops, the system is partially integrated in that the guidance loop is aware of the 6-DOF dynamics.Unfortunately, the guidance loop requires not only the full engagement state to be estimated, but also aerodynamiccoe�cients, and performance was not tested with uncertainty on these variables. Moreover, a radome model was notincluded in the simulation, so it is not clear if the parasitic attitude loop would destabilize the system, and the target doesnot attempt evasive maneuvers, which is a possibility for maneuverable ballistic missiles. An integrated and adaptiveG&C system is developed in [10] using a longitudinal model with simplified non-linear dynamics and L

1 adaptivecontrol. It is worth noting that a single loop integrated and adaptive G&C system for an exo-atmospheric interceptorwas developed in [11].

To date, research into integrated G&C for endoatmospheric missile applications has not taken advantage of recentadvancements in deep learning methods and algorithms. Meta-RL [12–14] has been demonstrated to be e�ective inoptimizing integrated and adaptive G&C systems that generate direct closed-loop mapping from sensor outputs toactuation commands, and that allows the G&C system to adapt real-time to stochastic and time-varying environments[15, 16], including environments with highly coupled environmental and internal dynamics [11], coupled rotationaldynamics [17], and path constraint satisfaction [18]. In the meta-RL framework, an agent instantiating a policy learnshow to complete a task through episodic simulated experience over a continuum of environments. The policy isimplemented as a deep neural network that maps observations to actions u = c\ (o), and in our work is optimized using acustomized version of proximal policy optimization (PPO)[19]. Adaptation is achieved by including a recurrent networklayer with hidden state h in both the policy and value function networks. Maximizing the PPO objective functionrequires learning hidden layer parameters \h that result in h evolving in response to the history of o and u in a mannerthat facilitates fast adaptation to environments encountered during optimization, as well as novel environments outsideof the training distribution. The optimized policy then adapts real-time to conditions experienced during deployment.Importantly, the network parameters remain fixed during deployment with adaptation occurring through the evolution ofh. Although it can take days to optimize a policy, the optimized policy stored on the flight computer can be run forwardin a few milliseconds.

In this work we use the meta-RL framework to optimize an integrated and adaptive G&C system for an air-to-airmissile application using a 6-DOF simulation environment. A system level view of the GN&C system is given inFig. 1, where the policy replaces the traditional combination of a separate guidance and flight control system. Over thetheater of operations defined by an engagement ensemble, the G&C system implements a closed loop mapping fromfiltered sensor outputs to commanded deflection rates for the missile’s control surfaces. This mapping induces intercepttrajectories that satisfy path constraints on load and look angle. Importantly, the system is optimized over an ensembleof aerodynamic models with variation in aerodynamic coe�cients and center of pressure locations, resulting in a systemthat can adapt real time to di�erences in the optimization and deployment environments. Our simulator models lookangle dependent radome refraction and rate gyro scale factor errors, both of which contribute to the parasitic attitude

2

loop. The simulated target executes challenging maneuvers with a maximum acceleration capability 1/3 that of themissile. We compare our system’s performance to that of proportional navigation (PN) [20] in a 3-DOF environment,where the missile’s acceleration is set to the commanded acceleration but clipped to account for dynamic pressure.Since the task of the flight control system is to track this acceleration, we believe the 3-DOF performance provides anupper bound on achievable performance using separate guidance and control systems, and is a reasonable benchmark.In addition, we compare our system to a longitudinal model of a traditional G&C system where the guidance system’scommanded acceleration is mapped to control surface deflections using a three loop autopilot for the pitch channel. Incontrast to prior work, our G&C system uses only observations that are readily available from seeker and rate gyrooutputs with minimal processing, and we simulate the computational stabilization of the strap down seeker. Due to rategyro bias, the stabilization is imperfect, and the combination of imperfect stabilization and radome refraction createsa parasitic attitude loop. Moreover, we consider challenging target maneuvers and path constraints on load and lookangle, and we optimize and test the system using the full non-linear 6-DOF dynamics and an intuitive geometry basedaerodynamics model derived from slender body and slender wing theory. However, we decided to simplify the problemby not modeling the rocket boost phase.

Fig. 1 System Diagram of GN&C system

The paper is organized as follows. In Section II we present the aerodynamic model, engagement scenarios, radomeand rate gyro models, actuator model, and the equations of motion. Next in Section III we give a brief summary of themeta-RL framework. This is followed by Section IV, where we formulate the meta-RL optimization problem for theair-to-air homing phase scenario described in Section II. In Section V we optimize and test the integrated G&C policy,and then compare the results to a 3-DOF benchmark using PN and a more realistic longitudinal benchmark using PNand an adaptive three loop autopilot, which is the standard for currently deployed missile systems.

II. Problem Formulation

A. Aerodynamic ModelAs we could not obtain an aerodynamic coe�cient model generated from computational fluid dynamics simulation

of a missile geometry, our aerodynamic model is derived from slender body and slender wing theory. The missilegeometry is shown in Fig. 2, which is drawn only approximately to scale. Here 3 = 0.31, ⌘W = 0.63, 2RW = 1.88,2RT = 0.63, ⌘T = 0.63, ⌘W = 0.63, GHL = 6.09, GL = 0.63, GL = 0.63, GW = 1.25, GCG = 3.13, GHL, and GN = 0.94, withall values in meters. Our approach is similar to Zarchan’s longitudinal model [21], but extended to 6-DOF. Specifically,we calculate an aerodynamic force coe�cient vector and a center of pressure vector for each missile component (nose,wing, body, tail), a roll damping moment, and a yaw-roll coupling moment, and neglect interference e�ects. In this workwe model the portion of the homing phase that occurs after rocket burnout, so the center of gravity remains constant.

The center of pressure locations for the nose cpNOSE, wing cpWING, body cpBODY, left tail cpLT, right tail cpRT,bottom tail cpBT, and upper tail cpUT,are given in Eqs. (1a) through (1c), where each vector is in the missile body frame.Here 0N = 0.673GN and 0B = 0.3 (GL � GN). The center of pressure location matrices used for the normal and side force

3

Fig. 2 Missile Geometry

computations are given in Eqs. (1d) and (1e) respectively.

cpNOSE =h0.67GN 0 0

i, cpWING =

hGN + GW + 0.72RW � 0.22TW 0 0

i(1a)

cpBODY =0.670NGN + 0B

GN + 0.5(GL � Gn)

0N + 0B0 0

�, cpLT =

hGHL �⌘T � 3/2 0

i(1b)

cpRT =hGHL ⌘T + 3/2 0

i, cpBT =

hGHL 0 �⌘T � 3/2

i, cpUT =

hGHL 0 ⌘T + 3/2

i(1c)

cpN 2 R5⇥3 =hcp)

NOSE cp)

WING cp)

BODY cp)

LT cp)

RT

i)

(1d)

cpY 2 R5⇥3 =hcp)

NOSE cp)

WING cp)

BODY cp)

BT cp)

UT

i)

(1e)

The normal force coe�cient vectors for the nose ⇠#NOSE , wing ⇠#WING , left tail ⇠#LT , body, ⇠#BODY , and righttail ⇠#RT , are given in Eqs. (2a) through (2b). Here U is the angle of attack, [ =

p

"2 � 1, " is the Mach number,

(REF =c32

4, (WING =

⌘W (2RW + 2TW)

2, (PLAN = 3 (GL � GN) + 0.673GN, and (TAIL =

⌘W (2RT + 2TT)

2. The normal

force coe�cient matrix is then given in Eq. (2c).

C#NOSE =h0 0 2 sinU

i, C#WING =

0 0

8(WING sinU

[(REF

�, C#LT =

0 0

8(TAIL sin (U � \!) )

[(REF

�(2a)

C#BODY =0 0

1.5(PLANsign(U) (sinU)2

[(REF

�, C#RT =

0 0

8(TAIL sin (U � \') )

[(REF

�(2b)

C# 2 R5⇥3 =hC)

#NOSEC)

#WINGC)

#BODYC)

#LTC)

#RT

i)

(2c)

Similarly, the side force coe�cient vectors are given in Eqs. (3a) through (3b), where V is the side slip angle. Theside force coe�cient matrix is then given in Eq. (2c).

C.NOSE =h0 2 sin V 0

i, C.WING =

0

8(WING sin V

[(REF0

�, C.BT =

0

8(TAIL sin (V � \⌫) )

[(REF0

�(3a)

C.BODY =0

1.5(PLANsign(V) (sin V)2

[(REF0

�, C.RT =

0

8(TAIL sin (V � \') )

[(REF0

�(3b)

C. 2 R5⇥3 =hC)

.NOSEC)

.WINGC)

.BODYC)

.BTC)

.UT

i)

(3c)

4

The axial force coe�cient ⇠� is given as shown in Eq. (4a); we use : = 4 and ⇠�0 = 0.35.

⇠� = ⇠�0 + :���hÕ⇠#

Õ⇠.

i��� (4a)

We then compute the normal and side forces that are applied at the centers of pressure as shown in Eqs. (5a) and(5b), and the axial force in (5c).

F# = @(REFC# (5a)

F. = @(REFC. (5b)

�� = @(REF⇠� (5c)

The body frame force FB and torque LB are then computed as given in Eqs. (6a) through (6e). HereÕ

rows denotessummation over matrix rows (axis=0 in Python), Ldamp is the roll damping term suggested in [22], Lcoupling is the rollinduced from sideslip, and + is the missile speed.

FB =h��� �

ÕF. �

ÕF#

i(6a)

L̂B = (cpN � GCG) ⇥’rows

F# + (cpY � GCG) ⇥’rows

F. (6b)

Ldamp =

"4

� 2.15⌘T

3

! L̂B

[0]rad2deg(8[0])3

2+

!0 0

#(6c)

Lcoupling =h100V 0 0

i(6d)

LB = L̂B+ Ldamp + Lcoupling (6e)

A plot illustrating lift to drag (normal and axial transformed to the wind frame) are given in Fig. 3. This agreesreasonably well with the lift to drag of a cylinder with tapered nose with a length to diameter ratio of 10 as shown in[23] Figs 8 and 9. The airframe response to \LT = \RT = 10� and \BT = \UT = 20� is shown in Fig. 4 for an altitudeof 5 km. Note the coupling of yaw into the roll rate from Eq. (6d). The airframe is highly sensitive to di�erential findeflections, with \BT � \UT = \LT � \RT = 0.2� yielding lG = 200 deg/s before the roll damping (Eq. (6c)) kicks in.

Fig. 3 Lift versus Drag

We approximate the missile as a cylinder for purposes of computing the missile’s inertia tensor, which is given inFigure 7, where A = 3/2, and < = 455 kg is the missile mass.

J = <

2666664

A2/2 0 0

0 (3A2+ G2

L)/12 0

0 0 (3A2+ G2

L)/12

3777775(7)

5

Fig. 4 Open Loop Airframe Response to Fin Deflection, Altitude = 5km

In order to model di�erences between the optimization aerodynamics and deployment aerodynamics, we perturb theaerodynamic coe�cient matrices C# (Eq. (2c)), Y# (Eq. (3c)), the center of pressure matrices cpN (Eq. (1d)) and cpN(Eq. (1e)), and the axial coe�cient ⇠� (Eq. (4a)). Specifically, at the start of each episode, we randomly sample fromuniformly distributed random variables as shown in Table 1. The coe�cients and centers of pressure are then perturbedas C� = C�(1 + &A

force), C# = C# (1 + &Nforce), C. = C. (1 + &Y

force), cpN = cpN (1 + &Ncp), cpY = cpY (1 + &Y

cp).

Table 1 Aerodynamic Coe�cient Perturbation

Parameters Drawn Uniformly with probability 0.5 and randomly set to +/- Max otherwise min maxAerodynamic Lift Coe�cient Vector Variation &N

force 2 R5⇥3 -0.1 0.1

Aerodynamic Side Coe�cient Vector Variation &Yforce 2 R

5⇥3 -0.1 0.1Aerodynamic Axial Coe�cient Vector Variation nA

force -0.1 0.1Aerodynamic Center of Pressure Vector Variation (Normal) &N

cp 2 R5⇥3 -0.01 0.01Aerodynamic Center of Pressure Vector Variation (Side) &Y

cp 2 R5⇥3 -0.01 0.01

B. Engagement Geometry and Initial Conditions

Fig. 5 Engagement

In this work we model a skewed head on engagement scenario. Referring to Fig. 5�, the missile position vector,missile velocity vector, target position vector, and target velocity vector are shown as rM, vM, rT, vT. We can alsodefine the relative position and velocity vectors rTM = rT � rM and vTM = vT � vM. The elevation angle \⇢ is the anglebetween rTM and its projection onto the x-y plane. We randomly generate the target’s initial velocity vector such that vT

�In this figure, the illustrated vectors are not in general within the x-z plane

6

lies within a cone with axis rTM and half apex angle \vT . A collision triangle is then defined in a plane that is not ingeneral aligned with the coordinate frame shown in Fig. 5, and is illustrated in Fig. 6. Here we define the requiredlead angle ! for the missile’s velocity vector vM as the angle that will put the missile on a collision triangle with thetarget in terms of the target velocity vT, line-of-sight angle W, and the magnitude of the missile velocity as shown inEquation (8a).

Fig. 6 Planar Heading Error

! = arcsin✓kvTk sin(V + W)

kvMk

◆E<H

= kvMk cos(! + W) E<I= kvMk sin(! + W) (8a)

This formulation is easily extended to a three dimensional engagement using the following approach:1) define a plane normal as v̂C ⇥ ,̂2) rotate vT and ,̂ onto the plane3) calculate the required planar missile velocity (Eq. (8a))4) rotate this velocity back into the original reference frameThus in R3 we define a heading error (HE) as the angle between the missile’s initial velocity vector and the velocity

vector associated with the lead angle required to put the missile on a collision heading with the target. Note that due tothe missile aerodynamic forces and target acceleration, this is far from a perfect collision triangle, and the true headingerror is greater than HE. Indeed, if we simulate with a zero heading error at a 10 km initial range, we observe a miss of300m, indicating a true heading error of three degrees.

The simulator randomly chooses between a target bang-bang and weave maneuver with equal probability, with theacceleration applied orthogonal to the target’s velocity vector. The maneuvers have varying acceleration levels andrandom start time, duration, and switching time. At the start of each episode, with probability 0.5 the maneuvers in thatepisode use the target’s maximum acceleration capability, and with probability 0.5 the acceleration is sampled uniformlybetween 0 and the maximum. We assume the target uses aerodynamic control surfaces (no thrust vector control).Consequently, the maximum target acceleration is reduced taking into account dynamic pressure. Specifically, weassume that the target can achieve the acceleration shown in Table 2 only at qMAX

>, the dynamic pressure corresponding to

its maximum speed at sea level, but we reduce this maximum acceleration by the ratio ofqo

MAX

qo, where qo =

1

2dkvTk

2.

Sample target maneuvers are shown Fig. 7, note that in some cases the maneuver period is considerably shorter orlonger, with the longest periods being twice the time of flight, in which case the bang-bang maneuver becomes a stepmaneuver. We assume the target is powered and can maintain a constant speed during the maneuver.

We can now list the range of engagement scenario parameters in Table 2. During optimization and testing, theseparameters are drawn uniformly between their minimum and maximum values, except as noted. The generation ofheading error is handled as follows. We first calculate the optimal missile velocity vector that puts the missile ona collision triangle with the target as described previously. We then uniformly select a heading error �⇢ betweenthe bounds given in Table 2, and randomly perturb the direction of the missile’s velocity vector direction such thatarccos(vM · vM?

) < �⇢ , where vM?is the perturbed missile velocity vector. Since the missile is launched by an aircraft,

the initial angle of attack, side slip angle, and roll can vary as shown in Table 2. At the start of each episode, eachcomponent of the aerodynamic coe�cient vectors (normal, side, and axial) and the center of pressure vector describedin Section II.A are independently perturbed by &force and &cp as shown in Table 1.

7

Fig. 7 Sample Target Maneuvers

Table 2 Simulator Initial Conditions for Optimization

Parameters Drawn Uniformly Min MaxRange krTMk (m) 5000 10000Elevation Angle \⇢ (degrees) -30 30Missile Velocity Magnitude kvMk (m/s) 800 1000Target Velocity Magnitude kvTk (m/s) 250 600Target Velocity Cone Half Apex Angle \vT (degrees) 45 45Heading Error (degrees) 0 5Initial Angle of Attack U (degrees) -10 10Initial Side Slip Angle V (degrees) -5 5Initial Roll Angle (degrees) -30 30Target Maximum Acceleration m/s2 0 10⇥9.81Target Bang-Bang duration (s) 1 8Target Bang-Bang initiation time (s) 0 6Target Weave Period (s) 1 8Target Weave O�set (s) 1 5

The missile G&C system is optimized to satisfy path constraints, these are tabulated in Table 3. The path constraintson attitude are not actually required, but reduced optimization run time by keeping the agent out of regions ofunproductive state space. The minimum speed constraint is imposed to insure the missile’s speed does not fall tosubsonic speeds, in which case the calculation for [ =

p

"2 � 1 in Section II.A would not be correct. The look angleconstraint implements a field of view constraint, this constrains the maximum angle between the body frame x-axis andthe body frame LOS vector to be less than 80�.

Table 3 Path Constraints

Constraint Min MaxMinimum Speed (m/s) 400 400Pitch (degrees) -85 85Yaw (degrees) -85 85Roll (degrees) -100 100X component of Rotational Velocity Vector lx (degrees / s) -6 6Look angle \L (degrees) - 80Load k [�B

H, �B

I]k (g) - 35

8

C. Seeker, Rate Gyro and Accelerometer Models

Let r̂NTM =

rTM

krTMkbe the inertial frame line of sight (LOS) unit vector, and let the radome aberration angle \' be the

angle between the ground truth LOS and the apparent LOS, with the di�erence caused by radome refraction. The bodyframe LOS unit vector is calculated as r̂B

TM = CBNr̂NTM, where CBN is the direction cosine matrix (DCM) mapping from

the inertial to body frame. We assume a symmetrical radome, where the radome aberration angle \' is a function oflook angle \! = arccos (r̂B

TM · ew), where ew = [1, 0, 0] is the missile centerline (body frame x-axis). First, we calculatethe azimuthal (\D) and elevation (\E ) refraction errors as shown in Eq. (9a), where �\D

, �\E, :D , and :E are sampled

uniformly within the bounds given in Table 4 at the start of each simulation episode. The resulting aberration error \' is

shown in Fig. 8 for the case of �D = 10 mrad and various values of : . Note that the radome slopem\'

m\!is given by the

slope of the curves in the figure.

\D = �D

✓0.75

\!c/2

+ 0.25 cos✓2c:D

\!

◆◆\E = �E

✓0.75

\!c/2

+ 0.25 cos✓2c:E

\!

◆◆(9a)

Fig. 8 Radome Aberration Angle as Function of Look Angle

Table 4 Radome Model Parameter Bounds

Variable Lower limit Upper Limit�D -1e-2 1e-2�E -1e-2 1e-2:D 1.00 3.00:E 1.00 3.00

We then create the refracted body frame LOS unit vector r̃BTM = C(qR)r̂B

TM, where C(qR) is the DCM correspondingto the 321 Euler rotation qR = [0, \D , \E ], noting that the aberration angle is given by \' = arccos(r̃B

TM · r̂BTM). r̃B

TM isthen computationally stabilized by rotating it by the estimated missile rotation dqobs: ,̂ = C(dqobs)r̃B

TM, where C(dqobs)

is the DCM corresponding to the rotation dqobs, and the stabilized body frame LOS unit vector ,̂ is an output of the

seeker model. In addition, the seeker model outputs a closing speed measurement E2 = �rB

TM · vBTM

krBTMk

, range measurement

A = krBTMk, and surrogate for the LOS rotation rate. The exact LOS rotation rate can be calculated as ⌦GT =

rTM ⇥ vTM

rTM · rTM.

However, vTM is not directly measurable from sensor outputs, although it can be estimated using a Kalman filter [24].

Instead, our G&C system uses a surrogate for the LOS rotation rate ⌦ =,̂C ⇥ ,̂C��C

�C. The intuition for the surrogate

rotation rate is that the cross product increases with the angle between the two samples of ,̂.

9

Our rate gyro model corrupts the ground truth rotational velocity 8GT with both scale factor bias and Gaussiannoise, as shown in Eq. (10), where U(0, 1, =) denotes an = dimensional uniformly distributed random variable boundedby (0, 1), where each dimension of the random variable is independent, and N(`,f, =) denotes an = dimensionalnormally distributed random variable with mean ` and standard deviation f. We use &l = 0.001 and fl = 0.001.

8obs = 8GT (1 + U(�&l , &l), 3)) + N (0,fl , 3) (10)

The estimated change in attitude dqobs is parameterized as a quaternion, and is estimated by integrating 8 as shown

in Equation 11, where dqobs is reset at the start of each episode dqobs0=

h1 0 0 0

i. In our simulation model, we

approximate this integration using fourth order Runge-Kutta integration. Since 8 is corrupted with both a scale factorerror and Gaussian noise, in general dqobs < dqGT, and the previously discussed computational stabilization will beimperfect. The combination of imperfect stabilization and radome refraction result in a false indication of target motion.This results in increased miss distance, and can potentially destabilize the G&C system. This is discussed in more detailin [1, 2, 11].

266666664

§dq0§dq1§dq2§dq3

377777775=

12

266666664

dq0 �dq1 �dq2 �dq3

dq1 dq0 �dq3 dq2

dq2 dq3 dq0 �dq1

dq3 �dq2 dq1 dq0

377777775

266666664

0

l0

l1

l2

377777775(11)

We found that providing the meta-RL G&C system with an estimated body frame acceleration improved satisfactionof the load constraint. Similar to the rate gyro model, the observed body frame acceleration is corrupted with a scalefactor error as shown in Eq. (12), where we use &acc = 0.001.

aBobs = aB

GT (1 + U(�&acc, &acc), 3))) (12)

The navigation frequency is 25 Hz, i.e., signals from the radome, rate gyro, and accelerometer models are input tothe policy every 0.04 s.

D. Actuator ModelFor endoatmospheric applications, we have found that meta-RL optimization e�ciency is improved by interpreting

policy actions as commanded control rates, which leads to smoother changes in control surface deflections. The outputof the guidance policy u = c(o) 2 R4 described in Section IV.A is split into four scalar components, a commandeddeflection rate to apply to both the missile’s upper and lower tail control surfaces �\cmd

+, a deflection rate to apply to

both the missile’s left and right tail control surfaces �\cmd�

, a di�erential deflection rate to apply to both the missile’supper and lower tail control surfaces �\cmd

3+, and a di�erential deflection rate to apply to both the missile’s left and right

tail control surfaces �\cmd3�

. These commanded deflection rates are generated by scaling and clipping the policy actionvector u as shown in Eqs. (13a) through (13b), where �\max

+= �\max

�= 20 deg/s, and �\max

3+= �\max

3�= 0.1 deg/s.

�\cmd+

= �\max+

clip(u[0],��\max+

,�\max+

) �\cmd�

= �\max�

clip(u[1],��\max�

,�\max�

) (13a)

�\cmd3+

= �\max3+

clip(u[2],��\max3+

,�\max3+

) �\cmd3�

= �\max�

clip(u[3],��\max3�

,�\max3�

) (13b)

These are then integrated to obtain the commanded deflections \cmd+

, \cmd�

, \cmd3+

, and \cmd3�

. The integrated commandeddeflections are then mapped to fin deflections as shown in Eqs. (14a) through (14b), with [\̃!) , \̃') , \̃⌫) , \̃*) ] passedto the input of the actuator dynamics model. Here \max

+= \max

�= 20 deg, and \max

3+= \max

3�= 0.1 deg. For meta-RL

optimization the actuator dynamics is implemented as a first order lag with g = 0.02s, but we test the optimized systemusing the second order actuator dynamics model suggested in [25], with the transfer function shown in Eq. (15) appliedto each deflection using ZACT = 0.7 and lACT = 150 rad/s. The output of the actuator dynamics are the control surfacedeflections [\!) , \') , \⌫) , \*) ] that are input to the aerodynamics model described in Section II.A. Note that thereason we do not optimize with the second order model is due to the high frequency pole, which necessitates reducing

10

the integration step size to a value that would significantly slow optimization.

\̃cmd�

= clip⇣\cmd�

,�\max�

, \max�

⌘\̃cmd3�

= clip�\3� ,�\max

�, \max

3�

�\̃cmd!)

= \̃� � \̃3� \̃cmd')

= \̃� + \̃3� (14a)

\̃cmd+

= clip⇣\cmd+

,�\max+

, \max+

⌘\̃cmd3+

= clip�\3+ ,�\

max+

, \max3+

�\̃cmd⌫)

= \̃+ � \̃3+ \̃cmd*)

= \̃+ + \̃3+ (14b)

1

1 +2ZACT

lACTB +

B2

l2ACT

(15)

E. Equations of MotionLet vB

⌘ [D, E,F] denote the body frame missile velocity. The angle of attack U and sideslip angle V are thencomputed as shown in Eq. (16a).

U = arctanF

DV = arcsin

E

kvBk(16a)

The atmospheric density d is calculated using the exponential atmosphere model d = d04�(⌘)/⌘B , where d0 =1.225 kg/m3 is the density at sea level, and ⌘> = 7018.00344 m is the density scale-height, and the Mach number" is calculated using the standard Earth atmospheric temperature model. The missile control surface deflections

)ctrl = [\LT, \RT, \BT, \UT], dynamic pressure @> =1

2d+2 with + = kv⌫

k, rotational velocity vector 8, " , U and V are

then input to the aerodynamic model described in Section II.A to compute the body frame force FB and torque LB.The rotational velocity8 is updated by integrating the Euler rotational equations of motion, as shown in Equation (17),

where J is the missile’s inertia tensor, and with the skew symmetric operator [a⇥] defined in Eq. (18).

J §8 = �[8⇥]J8 + LB (17)

[a⇥] ⌘

2666664

0 �03 02

03 0 �01

�02 01 0

3777775(18)

The missile’s attitude q is updated by integrating the di�erential kinematic equations shown in Equation (19), wherethe missile’s attitude is parameterized using the quaternion representation and 88 denotes the 8C⌘ component of therotational velocity vector 8⌫.

266666664

§@0

§@1

§@2

§@3

377777775=

12

266666664

@0 �@1 �@2 �@3

@1 @0 �@3 @2

@2 @3 @0 �@1

@3 �@2 @1 @0

377777775

266666664

0

l0

l1

l2

377777775(19)

The missile body frame velocity vB is updated by integrating the di�erential equation given in Eq. 20, where <is the missile mass, CBN (q) is the DCM mapping from the inertial to body frame given the missile attitude q, and6 = �9.81 m/s2.

§vB = �[vB⇥] +

FB

<+ [CBN (q)])

h0 0 6

i(20)

vB is then rotated into the North-East-Down (NED) inertial frame as shown in Eq. (21a), and vM is calculated byrotating vNED 90� around the x-axis into the reference frame depicted in Fig. 5.

vNED = CBN (q)vB (21a)

vM =hvNED

0 �vNED1 �vNED

2

i(21b)

11

The missile inertial frame position rM is then updated by integrating Eq. (22)

§rM = vM (22)

The target is modeled as shown in Eq. (23a), where aT is the target acceleration assuming the maneuvers describedin Section II.B.

§rT = vT §vT = aT (23a)

The equations of motion are updated using fourth order Runge-Kutta integration. For ranges greater than 160 m, atimestep of 20 ms is used, and for the final 160 m of homing, a timestep of 0.2 ms is used in order to more accuratelymeasure miss distance; this technique is borrowed from [26].

III. Background: Reinforcement Learning FrameworkIn the reinforcement learning framework, an agent learns through episodic interaction with an environment how to

successfully complete a task using a policy that maps observations to actions. The environment initializes an episodeby randomly generating a ground truth state, mapping this state to an observation, and passing the observation to theagent. The agent uses this observation to generate an action that is sent to the environment; the environment then usesthe action and the current ground truth state to generate the next state and a scalar reward signal. The reward and theobservation corresponding to the next state are then passed to the agent. The process repeats until the environmentterminates the episode, with the termination signaled to the agent via a done signal. Trajectories collected over a setof episodes (referred to as rollouts) are collected during interaction between the agent and environment, and used toupdate the policy and value functions. The interface between agent and environment is depicted in Fig. 9, where theenvironment instantiates the models shown in Fig. 1.

Fig. 9 Environment-Agent Interface

A Markov Decision Process (MDP) is an abstraction of the environment, which in a continuous state and actionspace, can be represented by a state space S, an action space A, a state transition distribution P(xC+1 |xC , uC ), and areward function A = R(xC , uC )), where x 2 S and u 2 A, and A is a scalar reward signal. We can also define a partiallyobservable MDP (POMDP), where the state x becomes a hidden state, generating an observation o using an observationfunction O(x) that maps states to observations. The POMDP formulation is useful when the observation consistsof sensor outputs. In the following, we will refer to both fully observable and partially observable environments asPOMDPs, as an MDP can be considered a POMDP with an identity function mapping states to observations.

Meta-RL di�ers from generic reinforcement learning in that the agent learns over an ensemble of POMPDs. ThesePOMDPs can include di�erent environmental dynamics, aerodynamic coe�cients, actuator failure scenarios, mass andinertia tensor variation, and varying amounts of sensor distortion. Optimization within the meta-RL framework resultsin an agent that can quickly adapt to novel POMDPs, often with just a few steps of interaction with the environment.There are multiple approaches to implementing meta-RL. In [27], the authors design the objective function to explicitlymake the model parameters transfer well to new tasks. In [12], the authors demonstrate state of the art performanceusing temporal convolutions with soft attention. And in [13], the authors use a hierarchy of policies to achieve meta-RL.In this work, we use an approach similar to [14] using a recurrent policy and value function. Note that it is possible to

12

train over a wide range of POMDPs using a non-meta RL algorithm. Although such an approach typically results in arobust policy, the policy cannot adapt in real time to novel environments. In this work, we implement meta-RL usingproximal policy optimization (PPO) [19] with both the policy and value function implementing recurrent layers in theirnetworks. After training, although the recurrent policy’s network weights are frozen, the hidden state will continue toevolve in response to a sequence of observations and actions, thus making the policy adaptive. In contrast, a policywithout a recurrent layer has behavior that is fixed by the network parameters at test time.

The PPO algorithm used in this work is a policy gradient algorithm which has demonstrated state-of-the-artperformance for many reinforcement learning benchmark problems. PPO approximates the Trust Region PolicyOptimization method [28] by accounting for the policy adjustment constraint with a clipped objective function. Theobjective function used with PPO can be expressed in terms of the probability ratio ?: ()) given by,

?: ()) =c) (u: |o: )

c)old (u: |o: )(24)

The PPO objective function is shown in Equations (25a) through (25c). The general idea is to create two surrogateobjectives, the first being the probability ratio ?: ()) multiplied by the advantages �c

w (o: , u: ) (see Eq. (26)), andthe second a clipped (using clipping parameter n) version of ?: ()) multiplied by �c

w (o: , u: ). The objective to bemaximized � ()) is then the expectation under the trajectories induced by the policy of the lesser of these two surrogateobjectives.

obj1 = ?: ())�c

w (o: , u: ) (25a)

obj2 = clip(?: ())�c

w (o: , u: ), 1 � n , 1 + n) (25b)

� ()) = E? (3) [min(obj1, obj2)] (25c)

This clipped objective function has been shown to maintain a bounded Kullback-Leibler (KL) divergence [29] withrespect to the policy distributions between updates, which aids convergence by ensuring that the policy does not changedrastically between updates. Our implementation of PPO uses an approximation to the advantage function that is thedi�erence between the empirical return and a state value function baseline, as shown in Equation 26, where W is adiscount rate and A the reward function, described in Section IV.A.

�c

w (x: , u: ) =

")’✓=:

W✓�:A (o✓ , u✓)

#�+ c

w (x: ) (26)

Here the value function + c

w is learned using the cost function given by

! (w) =1

2"

"’8=1

+ c

w (o8:) �

")’✓=:

W✓�:A (u8

✓, o8

✓)

#!2

(27)

In practice, policy gradient algorithms update the policy using a batch of trajectories (roll-outs) collected by interactionwith the environment. Each trajectory is associated with a single episode, with a sample from a trajectory collected atstep : consisting of observation o: , action u: , and reward A: (o: , u: ). Finally, gradient ascent is performed on ) andgradient descent on w and update equations are given by

w+ = w�� Vwrw ! (w) |w=w� (28)

)+ = )� + V) r)� ()) |)=)� (29)

where Vw and V) are the learning rates for the value function, + c

w (o: ), and policy, c) (u: |o: ), respectively.In our implementation of PPO, we adaptively scale the observations and servo both n and the learning rate to target

a KL divergence of 0.001.

IV. Methods

A. Meta-RL Problem FormulationIn this air-to-air missile application, an episode terminates when the closing velocity E2 turns negative, the missile

speed falls below 400 m/s, or a path constraint is violated (See Table 3). The agent observation o is shown in Eq. (30),

13

where ,̂ and ⌦ are the LOS and surrogate LOS rotation rate from the seeker model described in Section II.C, qinit

is the initial attitude of the aircraft launching the missile (and therefore the missile itself) which we assume can becommunicated to the missile prior to launch, dqobs and 8obs are the change in attitude and rotational velocity from therate gyro model, aB

obs the measured body frame acceleration from the accelerometer model, and \̃cmd+

, \̃cmd�

, \̃cmd3+

, and\̃cmd3�

are as shown in Eqs. (14a) through (14b) in Section II.D. The policy actions are interpreted as described in SectionII.D.

o =h,̂ ⌦ qinit + dqobs 8obs aB

obs \̃cmd+

\̃cmd�

\̃cmd3+

\̃cmd3�

i(30)

The reward function is shown below in Equations (31a) through (31c). Ashaping is a shaping reward given at each stepin an episode. These shaping rewards take the form of a Gaussian-like function of the norm of the line of sight rotationrate ⌦. Arollrate encourages the agent to minimize roll rate, with di�erential fin deflections only used to counteract therolling moment induced by the yawing moment (See Section II.A). Actrl is a control e�ort penalty, again given at eachstep in an episode, and Abonus is a bonus given at the end of an episode if the miss distance is below the threshold Alim.Importantly, the current episode is terminated if a path constraint is violated, in which case the stream of positiveshaping rewards is terminated, and the agent receives a negative reward. We use U = 1, V = �0.05, X = 0.0, n = 10,Z = �10, Alim = 3 m, f⌦ = 0.02. Although performance was maximized by setting X to zero, the minimum speed pathconstraint discouraged excess commanded changes to control surface deflections. We use a discount rate of 0.90 forshaping rewards and 0.995 for the terminal reward.

Ashaping = U exp

� k⌦k

2

f2⌦

!Arollrate = V |lG | (31a)

Actrl = X������ hu0 u1 u2 u3

i ������ Abonus =

(n , if rTM < Alim and done

0, otherwise(31b)

Apenalty =

(Z , if any path constraint violated

0, otherwiseA = Ashaping + Actrl + Abonus + Apenalty (31c)

The policy and value functions are implemented using four layer neural networks with tanh activations on eachhidden layer. Layer 2 for the policy and value function is a recurrent layer implemented using gated recurrent units[30]. The network architectures are as shown in Table 5, where =hi is the number of units in layer 8, obs_dim is theobservation dimension, and act_dim is the action dimension. The policy and value functions are periodically updatedduring optimization after accumulating trajectory rollouts of 60 simulated episodes.

Table 5 Policy and Value Function network architecture

Policy Network Value NetworkLayer # units activation # units activationhidden 1 10⇤obs_dim tanh 10⇤obs_dim tanhhidden 2

p=h1 ⇤ =h3 tanh

p=h1 ⇤ =h3 tanh

hidden 3 10 ⇤ act_dim tanh 5 tanhoutput act_dim linear 1 linear

V. Experiments

A. 3-DOF PN Guidance BenchmarkOur first benchmark is a PN guidance system implemented in a 3-DOF simulator. Clearly, such a system provides an

upper bound for the achievable performance when the guidance is coupled with a three loop autopilot. The guidancelaw is given in Eq. (32a), with the commanded acceleration adjusted so that it is perpendicular to vM in Eq. (32b). InEq. (32c), we define 0ref as the magnitude of the maximum achievable lateral acceleration at sea level and 1000 m/s. Weset 0ref = 74g, which corresponds to the steady state acceleration at those conditions with a 20 degree horizontal and

14

vertical fin deflection in the 6-DOF simulator. 0ref is then clipped based o� of dynamic pressure, where d(⌘) is theatmospheric density at altitude ⌘ and + = kvMk. Drag is modeled by reducing the missile speed + by integrating thedi�erential equations Eq. (32d), where : = 4; this is similar to the drag model used in the aerodynamic model describedin Section II.A. We use # = 3 with 0M passed through a low pass filter with a time constant of 0.3 s to model theguidance system time constant. This corresponds to the conditions used for guidance and flight control testing for a threeloop autopilot [31] using a longitudinal model of the missile geometry used in this work, i.e., Fig. 2. The low navigationratio and high guidance time constant are required to keep the system stable in the presence of radome refraction, whichwe do not model in our 3-DOF simulator. The performance of the benchmark using the engagement scenario describedin Section II.B is shown in Table 6, which tabulates the percentage of episodes resulting in a miss distance of less than 1m, 2 m, and 3 m, terminal missile speed + 5 , and the magnitude of missile acceleration that is perpendicular to themissile’s velocity kaMk, calculated over all steps of all episodes. The two cases are "Nominal", which are identical tothat used for meta-RL optimization (Section II.B), and "High Altitude", which starts with the missile at an altitude of 15km in each episode, and the "TACC=-" cases have the maximum target acceleration increased to - g. It is worth notingthat when the guidance time constant was dropped to 0.1 s, the 3-DOF PN guidance performance approached that of themeta-RL guidance system.

⌦ =rTM ⇥ vTM

rTM · rTM) aCOM = �#E2 (r̂TM ⇥⌦) (32a)

aPAR = aCOM · vTMvTM

kvTMkaPERP = aCOM � aPAR (32b)

0M = clip✓kaPERPk, 0,

d(⌘)+2

d(0)100020ref

◆aM = 0M

aPERP

kaPERPk(32c)

§+ = �: k [aMk] (32d)

Table 6 3-DOF PN Performance

Miss < 1m Miss < 2m Miss < 3m + 5 (m/s) kaMk (m/s2)

Case % % % ` f Min ` f

Nominal 52 81 94 820 63 582 15 14

High Altitude 44 72 89 853 56 714 13 11

TACC=10 51 80 93 815 64 539 18 18

B. Longitudinal Three Loop Autopilot BenchmarkThe longitudinal autopilot model was taken from Zarchan [32], with autopilot parameters derived from the linearized

airframe model. However, we simulate using the non-linear dynamics modified to account for drag, using the same dragmodel as in our 6-DOF simulator. We use a planar implementation of the guidance law from Section V.A. We simulateda range of engagements at distances from 3120 m to 4687 m and an altitude range of 1250 m to 2500 m. The initialmissile speed was constant at 1000 m/s, target speed constant at 312 m/s, a heading error randomized between +/- 5�,and an initial target elevation angle \⇢ (see Fig. 5) from 0 to 20 degrees. We found that reducing the missile speedby 20% or attempting to increase the initial distance to 10 km resulted in very large miss distances. It appears thatthis implementation of a three loop autopilot is less e�ective when drag is taken into account, with the missile speeddropping below 400 m/s for longer ranges to target. Moreover, despite clipping the missile commanded acceleration at35 g, the combined system occasionally violated the load constraint. We used the same target maneuvers as for the6-DOF simulation (with 8-g maximum target acceleration), but modified to work in a planar engagement. We gavethe autopilot access to the ground truth missile speed and altitude, and the autopilot gains were set at each simulationstep based o� of the missile’s current speed and altitude. Since we did not model radome e�ects, we found the bestperformance was with a guidance time constant of 0.02 s. first order actuator lag of 0.02 s, and N=3. The navigationfrequency was 100 Hz. The results of running 500 episodes are shown below in 7.

15

Table 7 Longitudinal PN + Three Loop Autopilot Benchmark

Miss < 1m Miss < 2m Miss < 3m Miss < 5m + 5 (m/s) kaMk (m/s2)

Case % % % % ` f Min ` f

Nominal 0.2 8.6 32.2 83.6 511 52 357 17 24

C. Meta-RL Optimization and TestingOptimization uses the initial conditions and vehicle parameters given in the problem formulation (Section II). The

agent quickly learns to satisfy the path constraints. Once the agent learns to satisfy the constraints, the agent adjustsits policy to maximize both shaping and terminal rewards while continuing to satisfy constraints. Learning curvesare given in Figures 10 through 11. Fig. 10 plots the reward history (sum of shaping and terminal rewards), with themean ("Mean R"), mean minus 1 standard deviation ("SD R"), and minimum ("Min R") rewards plotted on the primaryy-axis and the mean and maximum number of steps per episode plotted on the secondary y-axis. Similarly, Fig. 11plots terminal miss distance statistics. These statistics are computed over a batch of rollouts (60 episodes). A trajectorystarting at an altitude of 15 km that resulted in a 0.9 m miss is shown in Fig. 12.

Fig. 10 Optimization Reward History

Fig. 11 Optimization Miss Distance History

To test the meta-RL guidance system, we ran 5000 episodes over each of the cases shown in Table 8, where kaMk isthe norm of the body frame normal and side accelerations. The "Nominal", "High Altitude", and "TACC=-" cases areas described in Section V.A, the "Peak \' = - cases have �D and �E increased to - in the radome model (see SectionII.C), and the "PV=-_." cases have the maximum aerodynamic coe�cient variation set to - and aerodynamic centerof pressure variation set to . (See Table 1).

16

Table 8 Meta-RL Policy Performance

Miss < 1m Miss < 2m Miss < 3m + 5 (m/s) kaMk (m/s2) Vio

Case % % % ` f Min ` f %

Nominal 76 90 95 800 69 398 23 26 0.0

High Altitude 64 85 92 832 61 621 16 17 0.0

TACC=10 73 88 93 800 69 398 23 27 0.1

TACC=15 62 79 87 793 72 398 27 29 0.2

TACC=20 54 72 81 783 78 399 30 33 0.6

Peak \' = 0.02 65 81 88 781 92 397 30 37 0.1

Peak \' = 0.05 37 52 50 710 141 394 54 59 0.8

PV=20_0 75 89 94 798 71 398 23 27 0.0

PV=40_0 66 83 90 792 77 398 26 30 0.1

PV=20_5 67 83 90 789 76 789 27 32 0.1

PV=40_5 58 74 83 775 96 396 32 38 0.3

D. DiscussionThe meta-RL optimized G&C system adapts well to the large flight envelope (0 to 20km altitude) used in our

experiments, even without knowledge of the launch altitude and speed, which could in practice be communicated fromthe aircraft to the missile. Note that the missile dynamic pressure ranges from 5770 kg-m2 at 20 km and 400 m/s to612,500 kg-m2 at sea level and 1000 m/s. From Table 8 we see that the G&C system adapts well to novel conditions notexperienced during optimization. Although performance decreased with higher radome aberration angles, this could beimproved using an active compensation approach [33], and it is possible that optimizing with higher radome slopes mayimprove tolerance to larger aberration angles. Importantly, the policy adapts well to variation in aerodynamic forcecoe�cients and center of pressure locations, making the system robust to di�erences between the optimization anddeployment environments. We see that performance deteriorates with higher target acceleration levels, and this couldlikely be improved by incorporating bias into the reward shaping function, similar to biased PN [34].

The meta-RL optimized G&C system outperformed the longitudinal benchmark of PN coupled with a three loopautopilot by a large margin, and also outperformed the 3-DOF PN benchmark, which serves as an upper bound onperformance attainable using separate guidance and flight control systems. We believe that the meta-RL policy isable to outperform the 3-DOF benchmark due to a combination of three factors. First, we believe that the meta-RLpolicy takes advantage of the underdamped airframe, particularly early in the engagement. We see from Fig. 4 that theopen loop airframe response is underdamped, with the achieved acceleration overshooting the steady state acceleration.Clearly, a G&C system that can take advantage of this behavior can obtain a performance advantage, and it appearsfrom the inspection of multiple trajectories that the system does use the underdamped airframe to its advantage (See forexample, Fig. 12). In contrast, in the traditional approach where a guidance system is coupled with a flight controlsystem, decreasing the flight control system damping below 0.7 typically leads to decreased performance, potentiallydestabilizing the G&C system [35]. Second, using our reward formulation, the agent has an incentive to deviate fromminimizing the LOS rotation rate if it increases the probability of receiving the terminal reward, particularly if thedeviation occurs close to the end of an episode. Since minimizing the LOS rotation rate is not an optimal strategy for thecase of a maneuvering target (hence the use of augmented proportional navigation [34]), giving the policy the flexibilityto deviate from this goal can enhance performance. Again, in Fig. 12, we see evidence for this behavior. The third factoris the decreased flight control response time of the integrated system. Performance can likely be improved further byoptimizing with a higher navigation update frequency.

The design process for meta-RL optimization of an integrated G&C system is more straightforward and robustthan traditional approaches. For example, a three loop autopilot is typically designed using a simplified model of theguidance system, with unrealistic assumptions such as a linearized airframe model, constant missile speed, constantaltitude, and a body lifting force that is linear in angle of attack [4], and is independent of guidance system optimization.Consequently, when the guidance and flight control systems are simulated together in a high fidelity simulator, autopilotgains will likely need to be adjusted, especially in flight regimes where the airframe linearization is inaccurate. Incontrast, the meta-RL optimization framework directly optimizes an integrated and adaptive G&C system using a high

17

Fig. 12 High Altitude Sample Trajectory

fidelity simulation environment. In practice, the simulator can instantiate a reduced order aerodynamics model [36]built from computational fluid dynamics simulations. Similarly, the simulator can instantiate high fidelity reduced orderradome and noise models †, and the meta-RL optimization framework is compatible with simulating hardware in theloop. Thus, the meta-RL optimization framework has the potential to reduce the time and cost required to developa new missile system. Moreover, since the meta-RL optimized policy can adapt to novel conditions not seen duringoptimization, the framework should make successful flight tests more likely. Finally, there should not be any issuesimplementing the guidance policy on a flight computer, as although it can take several days to optimize a policy, thedeployed policy can be run forward in a few milliseconds, as the forward pass consists of a few multiplications of smallmatrices.

†Note that the RL optimization framework can be accelerated by running each episode in a rollout batch on a separate CPU.

18

VI. ConclusionWe created a missile aerodynamic model using slender body and slender wing theory, and developed a six

degrees-of-freedom simulator to model a range of air-to-air missile head-on engagement scenarios, with the simulatorincluding a radome and rate gyro model. The interception problem was then formulated in the meta reinforcementlearning framework, using a reward function that minimizes the line of sight rotation rate while imposing path constraintson load and field of view, and an observation space that includes only observations that can be obtained by sensor outputswith minimal processing. The optimized policy implements an integrated and adaptive guidance and control system,directly mapping observations to commanded control surface deflection rates. We found that the optimized guidanceand control system is robust to moderate levels of radome refraction and rate gyro scale factor bias, performs wellagainst challenging target maneuvers, can adapt to a large flight envelope, and generalizes well to novel conditions notexperienced during optimization, including large aerodynamic coe�cient and center of pressure location perturbations.We found that our guidance and control system significantly outperformed a longitudinal benchmark implementingproportional navigation and a three loop autopilot, and also outperformed a three degrees-of-freedom proportionalnavigation benchmark (this being the performance bound on systems with separate guidance and flight control systems).Future work will attempt to enhance accuracy to that required for hit to kill applications, add a realistic radar noisemodel, integrate our radome refraction compensation software, and model the entire missile trajectory including therocket burn phase.

References[1] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.

126–132. https://doi.org/10.2514/4.868948.

[2] Siouris, G. M., “Missile guidance and control systems,” Springer Science & Business Media, 2004, pp. 142–143.https://doi.org/10.1007/b97614.

[3] Shneydor, N. A., “Missile guidance and pursuit: kinematics, dynamics and control,” Elsevier, 1998, pp. 151–155.https://doi.org/10.1533/9781782420590.

[4] Siouris, G. M., “Missile guidance and control systems,” Springer Science & Business Media, 2004, pp. 129–144.https://doi.org/10.1007/b97614.

[5] Panchal, B., Mate, N., and Talole, S., “Continuous-time predictive control-based integrated guidance and control,” Journal ofGuidance, Control, and Dynamics, Vol. 40, No. 7, 2017, pp. 1579–1595. https://doi.org/10.2514/1.g002661.

[6] Wang, X., and Wang, J., “Partial integrated guidance and control with impact angle constraints,” Journal of Guidance, Control,and Dynamics, Vol. 38, No. 5, 2015, pp. 925–936.

[7] He, S., Song, T., and Lin, D., “Impact angle constrained integrated guidance and control for maneuvering target interception,”Journal of Guidance, Control, and Dynamics, Vol. 40, No. 10, 2017, pp. 2653–2661.

[8] Siouris, G. M., “Missile guidance and control systems,” Springer Science & Business Media, 2004, p. 132. https://doi.org/10.1007/b97614.

[9] Padhi, R., Chawla, C., and Das, P. G., “Partial integrated guidance and control of interceptors for high-speed ballistic targets,”Journal of Guidance, Control, and Dynamics, Vol. 37, No. 1, 2014, pp. 149–163. https://doi.org/10.2514/1.61416.

[10] Erdos, D., Shima, T., Kharisov, E., and Hovakimyan, N., “L1 adaptive control integrated missile autopilot and guidance,” AIAAguidance, navigation, and control conference, 2012, p. 4465.

[11] Gaudet, B., Furfaro, R., Linares, R., and Scorsoglio, A., “Reinforcement Metalearning for Interception of ManeuveringExoatmospheric Targets with Parasitic Attitude Loop,” Journal of Spacecraft and Rockets, Vol. 58, No. 2, 2021, pp. 386–399.

[12] Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P., “A Simple Neural Attentive Meta-Learner,” International Conferenceon Learning Representations, 2018.

[13] Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J., “META LEARNING SHARED HIERARCHIES,” InternationalConference on Learning Representations, 2018.

[14] Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M.,“Learning to reinforcement learn,” arXiv preprint arXiv:1611.05763, 2016.

19

[15] Gaudet, B., Linares, R., and Furfaro, R., “Six degree-of-freedom body-fixed hovering over unmapped asteroids via LIDARaltimetry and reinforcement meta-learning,” Acta Astronautica, 2020. https://doi.org/10.1016/j.actaastro.2020.03.026.

[16] Gaudet, B., Linares, R., and Furfaro, R., “Terminal adaptive guidance via reinforcement meta-learning: Applications toautonomous asteroid close-proximity operations,” Acta Astronautica, 2020. https://doi.org/10.1016/j.actaastro.2020.02.036.

[17] Gaudet, B., Linares, R., and Furfaro, R., “Deep reinforcement learning for six degree-of-freedom planetary landing,” Advancesin Space Research, Vol. 65, No. 7, 2020, pp. 1723–1741. https://doi.org/10.1016/j.asr.2019.12.030.

[18] Gaudet, B., Drozd, K., Meltzer, R., and Furfaro, R., “Adaptive Approach Phase Guidance for a Hypersonic Glider viaReinforcement Meta Learning,” arXiv preprint arXiv:2107.14764, 2021.

[19] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O., “Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347, 2017.

[20] Shneydor, N. A., “Missile guidance and pursuit: kinematics, dynamics and control,” Elsevier, 1998, pp. 101–124.https://doi.org/10.1533/9781782420590.

[21] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.473–482. https://doi.org/10.2514/4.868948.

[22] Mikhail, A. G., “Roll Damping for Finned Projectiles Including: Wraparound, O�set, and Arbitrary Number of Fins,” Tech.rep., ARMY RESEARCH LAB ABERDEEN PROVING GROUND MD, 1995.

[23] Jorgensen, L. H., “A method for estimating static aerodynamic characteristics for slender bodies of circular and noncircularcross section alone and with lifting surfaces at angles of attack from 0 deg to 90 deg,” 1973.

[24] Kalman, R. E., and Bucy, R. S., “New results in linear filtering and prediction theory,” Journal of basic engineering, Vol. 83,No. 1, 1961, pp. 95–108.

[25] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.513–514. https://doi.org/10.2514/4.868948.

[26] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.18–21. https://doi.org/10.2514/4.868948.

[27] Finn, C., Abbeel, P., and Levine, S., “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” ICML, 2017.

[28] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P., “Trust region policy optimization,” International Conferenceon Machine Learning, 2015, pp. 1889–1897.

[29] Kullback, S., and Leibler, R. A., “On information and su�ciency,” The annals of mathematical statistics, Vol. 22, No. 1, 1951,pp. 79–86.

[30] Chung, J., Gulcehre, C., Cho, K., and Bengio, Y., “Gated feedback recurrent neural networks,” International Conference onMachine Learning, 2015, pp. 2067–2075.

[31] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.552–566. https://doi.org/10.2514/4.868948.

[32] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.529–564. https://doi.org/10.2514/4.868948.

[33] Gaudet, B., “Adaptive Scale Factor Compensation for Missiles with Strapdown Seekers via Predictive Coding,” arXiv preprintarXiv:2009.00975, 2020.

[34] Shneydor, N. A., “Missile guidance and pursuit: kinematics, dynamics and control,” Elsevier, 1998, pp. 166–171.https://doi.org/10.1533/9781782420590.

[35] Zarchan, P., “Tactical and strategic missile guidance,” American Institute of Aeronautics and Astronautics, Inc., 2012, pp.547–549. https://doi.org/10.2514/4.868948.

[36] Benner, P., Goyal, P., Kramer, B., Peherstorfer, B., and Willcox, K., “Operator inference for non-intrusive model reduction ofsystems with non-polynomial nonlinear terms,” Computer Methods in Applied Mechanics and Engineering, Vol. 372, 2020, p.113433.

20