Improved Adaptive-Reinforcement Learning Control for Morphing …€¦ · Improved...

Improved Adaptive-Reinforcement Learning Control for Morphing Unmanned Air Vehicles

James Doebbler, Monish D. Tandale, John ValasekTexas A&M University

Andrew J. MeadeRice University

AIAA Infotech@Aerospace Conference26-29 Sep 2005Washington, DC

Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-2

Overview

Brief Introduction to Morphing Air VehiclesAdaptive-Reinforcement Learning Control ArchitectureSimplified Model of a Morphing Vehicle The Old Way– Reinforcement Learning Module– Structured Adaptive Model Inversion Control– Numerical Example

The New Way– Numerical Example

Results & ConclusionsFuture Work


Student Research Team2005 - 2006

Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-4Aerospace Vehicle Systems Technology Program

Tech

nolo

gyPr

ogre

ss

Time

SOA Metal Supercritical Wing

AE Tailored Wet Composite Wing

Composite Blended Wing

Composite Wing & Fuselage VLA

Active Aero & Structures

Low-Boom SupersonicTransport(s)

Intelligent PersonalAir Vehicles

Bio/NanoSelf-Optimizing Aircraft

Advanced Concept Evolution


• Nanostructures: 100 times stronger than steel at 1/6 the weight

• Active flow control

• Distributed propulsion

• Electric propulsion, advanced fuel cells, high-efficiency electric motors

• Integrated advanced control systems and information technology

• Central “nervous system”and adaptive vehicle control

• Develop light, strong, and structurally efficient air vehicles.

• Improved aerodynamic efficiency.

• Design fuel-efficient, low-emission propulsion systems.

• Develop safe, fault-tolerant vehicle systems.

Today’s Challenges: Technology Solutions:

Revolutionary Vehicles–Technologies

Fuel Cell Propulsion

Active Flow Control

Adaptive Control

Nanotube

Electric Circuit

Anode Catalys

t

Exhaust

Cathode Catalyst

Fuel

PolymerElectrolyteMembrane

Air


WHEN to reconfigure

HOW to reconfigure

LEARNING to reconfigure

Big Picture Research Goals


Morphing for Mission Adaptation– Large scale, relatively slow, in-flight shape change

to enable a single vehicle to perform multiple diverse mission profiles

as opposed to:

Morphing for Control – In-flight physical or virtual shape change to achieve

multiple control objectives (maneuvering, flutter suppression, load alleviation, active separation control)

MissionAdaptation Control

John Davidson, NASA Langley, AFRL Morphing Controls Workshop – Feb 2004

Which Morphing?


Make Full Use Of The Physical Knowledge Of The Problem

??

Machine Learning

Reconfiguration Policy

learning

Adaptive Control

Parameters in a Known Functional Relationship

adapting

Learn When to Reconfigure With Machine Learning

Our Approach


Synthetic Jets for Virtual Shaping and Separation Control

MultiSensor MEMS Arrays for Flow Control Feedback

Adaptive Controller

Sensed Information Aggregation

Control Information Distribution

Reconfiguration Command Generation

System Performance Evaluation

Knowledge Base

Environment

Control Architecture


Adaptive-ReinforcementLearning Control (A-RLC)

Conceptual Control Architecturefor Reconfigurable Aircraft

SAMIStructured Adaptive Model Inversion

(Traditional Control)

Flight controller to handle wide variation in dynamic properties

due to shape change

RL Reinforcement Learning

(Intelligent Control)

Learn the morphing dynamics and the optimal shape at every flight

condition in real-time


Morphing Vehicle Model

Lockheed Martin

NextGen


2-D Plate Rectangular Block Ellipsoid Delta Wing Final2003 2004 2005 2006 Objective

Morphing Vehicle Evolution


Morphing Vehicle - TiiMY

ShapeEllipsoidal shape with varying axis dimensions.Constant volume (V) during morphing2 independent variables: y and z, dependent dimension

Morphing DynamicsSmart material: carbon nano-tubes or shape memory alloyMorphing Dynamics : Simple Nonlinear Differential Equations

6Vxyzπ

=

y-dimension 2z-dimension 3

y

z

y yy Vz zz V+ =+ =


Shape Morphing AnimationTiiMY


Morphing Time Histories


Optimal Shapes atVarious Flight Conditions

Optimality is defined by identifying a cost function.J=J (Current shape, Flight condition)

2 2

0.5

( ( )) ( ( ))

3 cos( ) and 2 22

y z y z

Fy z

J J J y S F z S F

S F S eπ −

= + = − + −

= + = +


6-DOF Mathematical Modelfor Dynamic Behavior

Variables

Nonlinear 6–DOF Equations– Kinematic level:– Acceleration level:

Drag Force– Function of air density, square of velocity along axis, and

projected area of the vehicle perpendicular to the axis

[ ] [ ]

[ ] [ ]

T Tc x y z c

T T

p d d d v u v w

p q rσ φ θ ψ ω

= =

= =

;

c l c a

c c d

d

p J v Jmv mv F F

I I I M M

σ ωω

ω ω ω ω

= =+ = +

+ + = +

additional dynamics due to morphing


Reinforcement Learning


1. Actor takes action based upon states and preference function2. Critic updates state value function, and evaluates action3. Actor updates preference function

Learning is done repetitively, by subjecting to different scenarios

Reinforcement Learningself training


Actor-critic method– On policy, TD method– The actor selects actions

Preference of an action:Greedy Policy:

Gibbs softmax policy:

– The critic criticize the actionsState value functions:TD error:

– Strengthen or weaken the tendency to select one action

),( asp

Policy

Value Function

Environment

Actor

Critic

Cost (Jy,Jz)

States (F,y,z)

Actions (Vy,Vz)

TD error

),(maxarg),( aspasa

t =π

∑==== ),(

),(

}Pr{),( aspa

asp

ttt eessaaasπ

),( asp

)()( 11 tttt sVsVr −+= ++ γδ)( tsV

ttttt aspasp βδ+← ),(),(



Q-LearningOnly has to learn the action value function Q(s,a)– How well agent performs action a in state s under policy pi

Q learning is an off-policy temporal difference methodProven convergence

Q(s,a)

asQasQrasQasQsra

Q(s,a)sa

s

Q(s,a)

a

return terminalis s until ss

)],(),(max[),(),( , observe ,action Take

policy)greedy - //(e.g., from derivedpolicy using from Choose

:episode) of stepeach (for Repeat Initialize

:episode)each (for Repeat yarbitraril Initialize

′←−′′++←

′

′γα

ε

Learning()-Q


Function Approximation K Nearest Neighbor Policy Iteration

The shape of the vehicle is on continuous domainsUse K-nearest neighbors method to approximate the action-value function Q(s,a)– Collect a set of state-action pair samples– Compute optimal action-values of these samples using Q-learning– The action-value of a new state-action pair is the interpolation of those

of its K nearest neighbors. – Takes a weighted average of the K nearest neighbors in the sampled

state spaceSusceptible to absence of accurate information near the desired point

),,,(

),(),(

to gneighborinnearest K ofset the},{ find ,),(For

),(Learn Collect

1 0000

000

∑=

=K

n ii

iisample

i

iisample

asasasQ

asQ

sssasQ

asQSample

Distance

KNNPI


Structured AdaptiveModel Inversion Control


Reference model

- e

my

model reference adaptive control

Controller Plantr yu

Adaptation Law

Estimated parameters

Adaptive Control


Structured Model ReferenceAdaptive Control

Akella, Schaub, Junkins (Texas A&M)

Dynamics

2nd order differential equations

Exact kinematic relationship between position and velocity

Acceleration level relationships between

forces and system parameters

x v=

Fv am

= =

F m a=


Features– Dynamic inversion inner-loop, with an MRAC outer-loop to

handle system uncertainties.– Controls are solved for explicitly:

– Undesirable dynamics are cancelled and replaced with user specified desired dynamics.

– Easily applicable to nonlinear systems.– Error dynamics can be specified.– Shown to be very effective for a wide variety of systems.

,

1

S y s te m M o d e l ( )R e fe re n c e T ra je c to ry

C o n tro l L a w ( ( ) ) s o th a t th ee r ro r d y n a m ic s b e c o m e s

r r

r

x A f x B ux x

u B x A f x ee e

λλ

−

= +

= − −= −

Structured Adaptive Model InversionSubbarao, Junkins (Texas A&M)


Structured Adaptive ModelInversion Features

Trajectory Tracking for Dynamic SystemsPlant:– Nonlinear in states, affine in control, uncertain parameters appear linearly.

Control:– Dynamic Inversion and Sliding Mode Control.– Dynamic Inversion requires knowledge of system parameters, which are

inherently uncertain. Adaptive Learning Parameters:– Updated in real-time, and used for the Dynamic Inversion

Adaptation Mechanism:– Driven by the error between the actual plant trajectory and the reference

trajectory Stability Analysis:– Guarantees that the plant trajectory, asymptotically converges to the reference

trajectory in the presence of Parametric Uncertainties, and initial condition errors.


Structured Model & Minimal Parameterization

Kinematic Level Model Acceleration level Model

c l c

a

p J vJσ ω==

c c d

d

mv mv F F

I I I M M

ω

ω ω ω ω

+ = +

+ + = +

Attitude Control * *( ) ( , ) ( )Ta a aI C P Mσ σ σ σ σ σ+ =

11

2211 12 13 1 1 2 3

3312 22 23 2 2 1 3

1213 23 33 3 3 1 2

13

23

0 0 00 0 00 0 0

II

I I I a a a aI

I I I a a a aI

I I I a a a aII

⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥= ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

Minimal Parametrization of the Inertia Matrix


Control Law & Update Law

* *( ) ( , ) ( , , )a a aI C Yσ σ σ σ σ σ σ σ θ+ =

Using Minimal Parametrization of the Inertia Matrix

unknown parameters

With the control law ˆ{ ( , , , ) }Ta a r r da daM P Y C Kσ σ σ σ θ ε ε−= − −

the closed loop dynamics take the form* *{ ( , )} ( , , )a da a da aI C C K Yε σ σ ε ε σ σ σ θ+ + + =

and, along the adaptive law ˆ ( , , , )Ta r rYθ σ σ σ σ ε=−Γ

guarantees asymptotic stability of the tracking errors


Adaptive–Reinforcement Learning Control


A-RLC Architecture


0 20 40 60 80 1000

1

2

3

4

5

6

Flight Path Distance

Flig

ht C

ondi

tion

Objective– Demonstrate optimal

shape morphing for multiplespecified flight conditions

Method– For every flight condition, learn

optimal policy that commands voltage producing the optimal shape

– Minimize total cost over the entire flight trajectory– Evaluate the learning performance after 200 learning episodes

Numerical Example

RL Module is Completely Ignorant of Optimality Relations and Morphing Control Functions:

It Must Learn On Its Own, From Scratch


Agent: Morphing Air Vehicle Reinforcement Learning Module

Environment: Various flight conditions

Goal: Fly in optimal shape that minimizes cost

States: Flight condition; shape of vehicle

Actions: Discrete voltages applied to change shape of vehicle– Action set:

Rewards: Determined by cost functions

Optimal control policy: Mapping of the state to the voltage leading to the optimal shape

reinforcement learning definitions

( , ) [2,4] [0,5]( , ) [2,4] [0,5]y F

Sz F

×⎧ ⎫= ⎨ ⎬×⎩ ⎭

{ }( , ) [0, 0.5, 1, 5] ; [0, 0.5, 1, 4]y zA V V= … …

Numerical Example


Learning Process

0

5

23

4−5

0

5

F

20 Episodes

y dimension

Err

or V

y

0

5

23

4−5

0

5

F

60 Episodes

y dimension

Err

or V

y

0

5

23

4−5

0

5

F

100 Episodes

y dimension

Err

or V

y

0

5

23

4−5

0

5

F

200 Episodes

y dimension

Err

or V

y

Error in the Action Preference Function after:


Example: Old Way


Comparison of True Optimal Shape and Learned Shape

KNN learns poorly forseveral flight conditions


Time Histories of Angular States


Time Histories of Adaptive Parameters


Trajectory Tracking Controls


Morphing Control Voltages


F Still Equals d(mv)/dt– Stiffness, frequency, damping ratio, and cost function have a major

effect on learning times and learning performance.– Slowest element in system is critical

Some Assembly Required: Tuning– Flight condition transitions, morphing dynamics, learning performance,

and adaptive control must be balanced to achieve good performance.– Coordination and timing is everything

What Happened? (1)


Function Approximation– Errors remained which could not be eliminated with additional training.

– Use Galerkin-based Sequential Function Approximation (SFA) to approximate the action-value function Q(s,a)

What Happened? (2)


Galerkin-Based Sequential Function Approximation (SFA)

Ideal for approximating sparse multi-dimensional scattered dataAdaptive, no ad-hoc user parameters to adjustProvides information on the sensitivity to the inputsMatrix construction and evaluations are avoidedComputational cost is reduced while efficiency is enhanced by the low-dimensional unconstrained optimization


Example: New WaySFA learns optimal shape well


ResultsNormalized RMS error

Y dimension Z dimension

K N N 1.42 0.821

S F A 1.27 0.661

10% reduction 20% reduction


Improved function approximation provided large improvement in performance– Galerkin-based Sequential Function Approximation method learns

optimal shape well

Shape Changes for Mission Morphing can be treated as piecewise constant parameter changes– SAMI is a favorable method for trajectory tracking control

Morphing for Control will require different control strategy– Piecewise constant approximation no longer valid

Conclusions


Future Research 1

Modify the morphing dynamics to represent SMA actuators. – Hysteretic behavior

Investigate control methodologies to handle faster shape changes– Linear Parameter Varying (LPV) control

Investigate other function approximation methods– Radial Basis Functions (GLO-MAP)

Modify the simulation to include a more advanced aircraft model– Wing-Body, Wing-Body-Empennage, etc.


STATE OF THE ART:Multiple antennas must be mounted

on spacecraft to accommodate various ground station signals

RESEARCH GOALS:Demonstrate feasibility of a reconfigurableantenna design that utilizes Reinforcement Learning to independently achieve optimal shape through use of SMA actuators

BENEFITS:A single antenna capable of altering its geometry to achieve world-wide compatibility between receivers and transmitters

Morphing Space Based Antenna

Future Research 2


Future Research 3

Directly learn voltage inputs required to achieve certain position states with time dependency.– Skips mathematical modeling and computer simulation steps – Avoids modeling and simulation errors


Questions?



Parameterized functional form

Gradient descent learning

The Temporal Displacement error drives all learning in both actor and critic

∑=

Φ=N

jjtjvt ssV

1)()( θπ ∑

=

Φ=N

jjtjpta ssp

a1

)()( θ

Ttttp

Tttpttttatptp δsp

aaaaHθHθHθθ αββδα +=−++=

−−− 111))((

Ttttv

Tttvttvtv y HθHθHθθ αδα +=−+= −−− 111 )~(

tδ

function approximation

Improved Adaptive-Reinforcement Learning Control for Morphing …€¦ · Improved...

Documents

Transcript of Improved Adaptive-Reinforcement Learning Control for Morphing …€¦ · Improved...