Improved Adaptive-Reinforcement Learning Control for Morphing …€¦ · Improved...
Transcript of Improved Adaptive-Reinforcement Learning Control for Morphing …€¦ · Improved...
Improved Adaptive-Reinforcement Learning Control for Morphing Unmanned Air Vehicles
James Doebbler, Monish D. Tandale, John ValasekTexas A&M University
Andrew J. MeadeRice University
AIAA Infotech@Aerospace Conference26-29 Sep 2005Washington, DC
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-2
Overview
Brief Introduction to Morphing Air VehiclesAdaptive-Reinforcement Learning Control ArchitectureSimplified Model of a Morphing Vehicle The Old Way– Reinforcement Learning Module– Structured Adaptive Model Inversion Control– Numerical Example
The New Way– Numerical Example
Results & ConclusionsFuture Work
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-3
Student Research Team2005 - 2006
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-4Aerospace Vehicle Systems Technology Program
Tech
nolo
gyPr
ogre
ss
Time
SOA Metal Supercritical Wing
AE Tailored Wet Composite Wing
Composite Blended Wing
Composite Wing & Fuselage VLA
Active Aero & Structures
Low-Boom SupersonicTransport(s)
Intelligent PersonalAir Vehicles
Bio/NanoSelf-Optimizing Aircraft
Advanced Concept Evolution
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-5
• Nanostructures: 100 times stronger than steel at 1/6 the weight
• Active flow control
• Distributed propulsion
• Electric propulsion, advanced fuel cells, high-efficiency electric motors
• Integrated advanced control systems and information technology
• Central “nervous system”and adaptive vehicle control
• Develop light, strong, and structurally efficient air vehicles.
• Improved aerodynamic efficiency.
• Design fuel-efficient, low-emission propulsion systems.
• Develop safe, fault-tolerant vehicle systems.
Today’s Challenges: Technology Solutions:
Revolutionary Vehicles–Technologies
Fuel Cell Propulsion
Active Flow Control
Adaptive Control
Nanotube
Electric Circuit
Anode Catalys
t
Exhaust
Cathode Catalyst
Fuel
PolymerElectrolyteMembrane
Air
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-6
WHEN to reconfigure
HOW to reconfigure
LEARNING to reconfigure
Big Picture Research Goals
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-7
Morphing for Mission Adaptation– Large scale, relatively slow, in-flight shape change
to enable a single vehicle to perform multiple diverse mission profiles
as opposed to:
Morphing for Control – In-flight physical or virtual shape change to achieve
multiple control objectives (maneuvering, flutter suppression, load alleviation, active separation control)
MissionAdaptation Control
John Davidson, NASA Langley, AFRL Morphing Controls Workshop – Feb 2004
Which Morphing?
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-8
Make Full Use Of The Physical Knowledge Of The Problem
??
Machine Learning
Reconfiguration Policy
learning
Adaptive Control
Parameters in a Known Functional Relationship
adapting
Learn When to Reconfigure With Machine Learning
Our Approach
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-9
Synthetic Jets for Virtual Shaping and Separation Control
MultiSensor MEMS Arrays for Flow Control Feedback
Adaptive Controller
Sensed Information Aggregation
Control Information Distribution
Reconfiguration Command Generation
System Performance Evaluation
Knowledge Base
Environment
Control Architecture
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-10
Adaptive-ReinforcementLearning Control (A-RLC)
Conceptual Control Architecturefor Reconfigurable Aircraft
SAMIStructured Adaptive Model Inversion
(Traditional Control)
Flight controller to handle wide variation in dynamic properties
due to shape change
RL Reinforcement Learning
(Intelligent Control)
Learn the morphing dynamics and the optimal shape at every flight
condition in real-time
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-11
Morphing Vehicle Model
Lockheed Martin
NextGen
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-12
2-D Plate Rectangular Block Ellipsoid Delta Wing Final2003 2004 2005 2006 Objective
Morphing Vehicle Evolution
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-13
Morphing Vehicle - TiiMY
ShapeEllipsoidal shape with varying axis dimensions.Constant volume (V) during morphing2 independent variables: y and z, dependent dimension
Morphing DynamicsSmart material: carbon nano-tubes or shape memory alloyMorphing Dynamics : Simple Nonlinear Differential Equations
6Vxyzπ
=
y-dimension 2z-dimension 3
y
z
y yy Vz zz V+ =+ =
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-14
Shape Morphing AnimationTiiMY
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-15
Morphing Time Histories
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-16
Optimal Shapes atVarious Flight Conditions
Optimality is defined by identifying a cost function.J=J (Current shape, Flight condition)
2 2
0.5
( ( )) ( ( ))
3 cos( ) and 2 22
y z y z
Fy z
J J J y S F z S F
S F S eπ −
= + = − + −
= + = +
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-17
6-DOF Mathematical Modelfor Dynamic Behavior
Variables
Nonlinear 6–DOF Equations– Kinematic level:– Acceleration level:
Drag Force– Function of air density, square of velocity along axis, and
projected area of the vehicle perpendicular to the axis
[ ] [ ]
[ ] [ ]
T Tc x y z c
T T
p d d d v u v w
p q rσ φ θ ψ ω
= =
= =
;
c l c a
c c d
d
p J v Jmv mv F F
I I I M M
σ ωω
ω ω ω ω
= =+ = +
+ + = +
additional dynamics due to morphing
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-18
Reinforcement Learning
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-19
1. Actor takes action based upon states and preference function2. Critic updates state value function, and evaluates action3. Actor updates preference function
Learning is done repetitively, by subjecting to different scenarios
Reinforcement Learningself training
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-20
Actor-critic method– On policy, TD method– The actor selects actions
Preference of an action:Greedy Policy:
Gibbs softmax policy:
– The critic criticize the actionsState value functions:TD error:
– Strengthen or weaken the tendency to select one action
),( asp
Policy
Value Function
Environment
Actor
Critic
Cost (Jy,Jz)
States (F,y,z)
Actions (Vy,Vz)
TD error
),(maxarg),( aspasa
t =π
∑==== ),(
),(
}Pr{),( aspa
asp
ttt eessaaasπ
),( asp
)()( 11 tttt sVsVr −+= ++ γδ)( tsV
ttttt aspasp βδ+← ),(),(
Reinforcement Learning
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-21
Q-LearningOnly has to learn the action value function Q(s,a)– How well agent performs action a in state s under policy pi
Q learning is an off-policy temporal difference methodProven convergence
Q(s,a)
asQasQrasQasQsra
Q(s,a)sa
s
Q(s,a)
a
return terminalis s until ss
)],(),(max[),(),( , observe ,action Take
policy)greedy - //(e.g., from derivedpolicy using from Choose
:episode) of stepeach (for Repeat Initialize
:episode)each (for Repeat yarbitraril Initialize
′←−′′++←
′
′γα
ε
Learning()-Q
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-22
Function Approximation K Nearest Neighbor Policy Iteration
The shape of the vehicle is on continuous domainsUse K-nearest neighbors method to approximate the action-value function Q(s,a)– Collect a set of state-action pair samples– Compute optimal action-values of these samples using Q-learning– The action-value of a new state-action pair is the interpolation of those
of its K nearest neighbors. – Takes a weighted average of the K nearest neighbors in the sampled
state spaceSusceptible to absence of accurate information near the desired point
),,,(
),(),(
to gneighborinnearest K ofset the},{ find ,),(For
),(Learn Collect
1 0000
000
∑=
=K
n ii
iisample
i
iisample
asasasQ
asQ
sssasQ
asQSample
Distance
KNNPI
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-23
Structured AdaptiveModel Inversion Control
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-24
Reference model
- e
my
model reference adaptive control
Controller Plantr yu
Adaptation Law
Estimated parameters
Adaptive Control
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-25
Structured Model ReferenceAdaptive Control
Akella, Schaub, Junkins (Texas A&M)
Dynamics
2nd order differential equations
Exact kinematic relationship between position and velocity
Acceleration level relationships between
forces and system parameters
x v=
Fv am
= =
F m a=
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-26
Features– Dynamic inversion inner-loop, with an MRAC outer-loop to
handle system uncertainties.– Controls are solved for explicitly:
– Undesirable dynamics are cancelled and replaced with user specified desired dynamics.
– Easily applicable to nonlinear systems.– Error dynamics can be specified.– Shown to be very effective for a wide variety of systems.
,
1
S y s te m M o d e l ( )R e fe re n c e T ra je c to ry
C o n tro l L a w ( ( ) ) s o th a t th ee r ro r d y n a m ic s b e c o m e s
r r
r
x A f x B ux x
u B x A f x ee e
λλ
−
= +
= − −= −
Structured Adaptive Model InversionSubbarao, Junkins (Texas A&M)
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-27
Structured Adaptive ModelInversion Features
Trajectory Tracking for Dynamic SystemsPlant:– Nonlinear in states, affine in control, uncertain parameters appear linearly.
Control:– Dynamic Inversion and Sliding Mode Control.– Dynamic Inversion requires knowledge of system parameters, which are
inherently uncertain. Adaptive Learning Parameters:– Updated in real-time, and used for the Dynamic Inversion
Adaptation Mechanism:– Driven by the error between the actual plant trajectory and the reference
trajectory Stability Analysis:– Guarantees that the plant trajectory, asymptotically converges to the reference
trajectory in the presence of Parametric Uncertainties, and initial condition errors.
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-28
Structured Model & Minimal Parameterization
Kinematic Level Model Acceleration level Model
c l c
a
p J vJσ ω==
c c d
d
mv mv F F
I I I M M
ω
ω ω ω ω
+ = +
+ + = +
Attitude Control * *( ) ( , ) ( )Ta a aI C P Mσ σ σ σ σ σ+ =
11
2211 12 13 1 1 2 3
3312 22 23 2 2 1 3
1213 23 33 3 3 1 2
13
23
0 0 00 0 00 0 0
II
I I I a a a aI
I I I a a a aI
I I I a a a aII
⎡ ⎤⎢ ⎥⎢ ⎥⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥= ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
Minimal Parametrization of the Inertia Matrix
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-29
Control Law & Update Law
* *( ) ( , ) ( , , )a a aI C Yσ σ σ σ σ σ σ σ θ+ =
Using Minimal Parametrization of the Inertia Matrix
unknown parameters
With the control law ˆ{ ( , , , ) }Ta a r r da daM P Y C Kσ σ σ σ θ ε ε−= − −
the closed loop dynamics take the form* *{ ( , )} ( , , )a da a da aI C C K Yε σ σ ε ε σ σ σ θ+ + + =
and, along the adaptive law ˆ ( , , , )Ta r rYθ σ σ σ σ ε=−Γ
guarantees asymptotic stability of the tracking errors
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-30
Adaptive–Reinforcement Learning Control
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-31
A-RLC Architecture
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-32
0 20 40 60 80 1000
1
2
3
4
5
6
Flight Path Distance
Flig
ht C
ondi
tion
Objective– Demonstrate optimal
shape morphing for multiplespecified flight conditions
Method– For every flight condition, learn
optimal policy that commands voltage producing the optimal shape
– Minimize total cost over the entire flight trajectory– Evaluate the learning performance after 200 learning episodes
Numerical Example
RL Module is Completely Ignorant of Optimality Relations and Morphing Control Functions:
It Must Learn On Its Own, From Scratch
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-33
Agent: Morphing Air Vehicle Reinforcement Learning Module
Environment: Various flight conditions
Goal: Fly in optimal shape that minimizes cost
States: Flight condition; shape of vehicle
Actions: Discrete voltages applied to change shape of vehicle– Action set:
Rewards: Determined by cost functions
Optimal control policy: Mapping of the state to the voltage leading to the optimal shape
reinforcement learning definitions
( , ) [2,4] [0,5]( , ) [2,4] [0,5]y F
Sz F
×⎧ ⎫= ⎨ ⎬×⎩ ⎭
{ }( , ) [0, 0.5, 1, 5] ; [0, 0.5, 1, 4]y zA V V= … …
Numerical Example
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-34
Learning Process
0
5
23
4−5
0
5
F
20 Episodes
y dimension
Err
or V
y
0
5
23
4−5
0
5
F
60 Episodes
y dimension
Err
or V
y
0
5
23
4−5
0
5
F
100 Episodes
y dimension
Err
or V
y
0
5
23
4−5
0
5
F
200 Episodes
y dimension
Err
or V
y
Error in the Action Preference Function after:
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-35
Example: Old Way
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-36
Comparison of True Optimal Shape and Learned Shape
KNN learns poorly forseveral flight conditions
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-37
Time Histories of Angular States
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-38
Time Histories of Adaptive Parameters
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-39
Trajectory Tracking Controls
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-40
Morphing Control Voltages
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-41
F Still Equals d(mv)/dt– Stiffness, frequency, damping ratio, and cost function have a major
effect on learning times and learning performance.– Slowest element in system is critical
Some Assembly Required: Tuning– Flight condition transitions, morphing dynamics, learning performance,
and adaptive control must be balanced to achieve good performance.– Coordination and timing is everything
What Happened? (1)
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-42
Function Approximation– Errors remained which could not be eliminated with additional training.
– Use Galerkin-based Sequential Function Approximation (SFA) to approximate the action-value function Q(s,a)
What Happened? (2)
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-43
Galerkin-Based Sequential Function Approximation (SFA)
Ideal for approximating sparse multi-dimensional scattered dataAdaptive, no ad-hoc user parameters to adjustProvides information on the sensitivity to the inputsMatrix construction and evaluations are avoidedComputational cost is reduced while efficiency is enhanced by the low-dimensional unconstrained optimization
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-44
Example: New WaySFA learns optimal shape well
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-45
ResultsNormalized RMS error
Y dimension Z dimension
K N N 1.42 0.821
S F A 1.27 0.661
10% reduction 20% reduction
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-46
Improved function approximation provided large improvement in performance– Galerkin-based Sequential Function Approximation method learns
optimal shape well
Shape Changes for Mission Morphing can be treated as piecewise constant parameter changes– SAMI is a favorable method for trajectory tracking control
Morphing for Control will require different control strategy– Piecewise constant approximation no longer valid
Conclusions
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-47
Future Research 1
Modify the morphing dynamics to represent SMA actuators. – Hysteretic behavior
Investigate control methodologies to handle faster shape changes– Linear Parameter Varying (LPV) control
Investigate other function approximation methods– Radial Basis Functions (GLO-MAP)
Modify the simulation to include a more advanced aircraft model– Wing-Body, Wing-Body-Empennage, etc.
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-48
STATE OF THE ART:Multiple antennas must be mounted
on spacecraft to accommodate various ground station signals
RESEARCH GOALS:Demonstrate feasibility of a reconfigurableantenna design that utilizes Reinforcement Learning to independently achieve optimal shape through use of SMA actuators
BENEFITS:A single antenna capable of altering its geometry to achieve world-wide compatibility between receivers and transmitters
Morphing Space Based Antenna
Future Research 2
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-49
Future Research 3
Directly learn voltage inputs required to achieve certain position states with time dependency.– Skips mathematical modeling and computer simulation steps – Avoids modeling and simulation errors
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-50
Questions?
Doebbler, Tandale, Valasek, & Meade AIAA-2005-7159-51
Reinforcement Learning
Parameterized functional form
Gradient descent learning
The Temporal Displacement error drives all learning in both actor and critic
∑=
Φ=N
jjtjvt ssV
1)()( θπ ∑
=
Φ=N
jjtjpta ssp
a1
)()( θ
Ttttp
Tttpttttatptp δsp
aaaaHθHθHθθ αββδα +=−++=
−−− 111))((
Ttttv
Tttvttvtv y HθHθHθθ αδα +=−+= −−− 111 )~(
tδ
function approximation