Rapid Prototyping of Deep Learning Models on Radiation ...epubs.surrey.ac.uk/852310/1/Rapid...

Rapid Prototyping of Deep Learning Models onRadiation Hardened CPUs.

P. BlackerUniversity of Surrey

Guildford, [email protected]

C. P. BridgesUniversity of Surrey


S. HadfieldUniversity of Surrey


Abstract—Interest is increasing in the use of neural networksand deep-learning for on-board processing tasks in the spaceindustry [1]. However development has lagged behind terres-trial applications for several reasons: space qualified computershave significantly less processing power than their terrestrialequivalents, reliability requirements are more stringent than themajority of applications deep-learning is being used for. The longrequirements, design and qualification cycles in much of the spaceindustry slows adoption of recent developments.

GPUs are the first hardware choice for implementing neu-ral networks on terrestrial computers, however no radiationhardened equivalent parts are currently available. Field Pro-grammable Gate Array devices are capable of efficiently im-plementing neural networks and radiation hardened parts areavailable, however the process to deploy and validate an inferencenetwork is non-trivial and robust tools that automate the processare not available.

We present an open source tool chain that can automaticallydeploy a trained inference network from the TensorFlow frame-work directly to the LEON 3, and an industrial case study ofthe design process used to train and optimise a deep-learningmodel for this processor. This does not directly change thethree challenges described above however it greatly acceleratesprototyping and analysis of neural network solutions, allowingthese options to be more easily considered than is currentlypossible.

Future improvements to the tools are identified along with asummary of some of the obstacles to using neural networks andpotential solutions to these in the future.

Index Terms—LEON, TensorFlow, on-board, Planetary rover,CNN, deep learning, autonomy.

I. INTRODUCTION

Recent advances in the science and application of neuralnetworks have been largely driven by improvements in tools,both hardware and software, used to design and deploy thesesolutions. Powerful frameworks now exist to design and trainnetwork models as well as automated tools to deploy trainedinference models on a range of hardware targets. Howeverthese tools are aimed at terrestrial hardware; desktop comput-ers, mobile phones, and embedded computers. These targetsare significantly more powerful than the current generation ofradiation hardened space processors, LEON 3 & RAD 750 [2],[3], meaning these tools are not able to target these processors.

Especially early in the life-cycle of a space mission whenlow TRL (Technology Readiness Level) feasibility studies arebeing undertaken, the lack of automated development toolsincreases the difficulty of evaluating potential deep-learning

options. Tools to quickly design and train networks do exist,however analysts also need to determine upper-bounds of theCPU time and memory resources required by the model. Ata time of growing interest in the use of deep learning on-board space missions there is a need for tools to make theiranalysis and implementation on standard space processors asstraightforward as possible. It is hoped the TFMin librarywill enable early analysis of deep-learning solutions duringthe requirements stage, encouraging their adoption in finalspacecraft designs.

TFMin has been developed during our own work investi-gating the use of deep-learning for processing on-board Marsrovers, and serves to address this gap in the tools which areavailable. Network design and training has been conductedusing the TF (TensorFlow) framework on desktop computerswith powerful GPUs, the TFMin library has been used toextend these tools so we can automatically deploy our trainedinference networks onto the LEON3 platform and analyse theirmemory and CPU requirements. This has allowed us to quicklyanalyse different network topologies and the effects of changesto the processor such as instruction and data cache sizes.

Our work focusses on the LEON family of SparcV8 pro-cessors, however the TFMin tool can be used with a widerrange of targets, it generates C++11 code which is dependantupon the Eigen linear algebra library [4]. This means it can bedeployed on any target platform that has a C++11 compliantcompiler including BAE’s RAD 750 space processor as wellmany smaller processors commonly used on nano and microsatellites. During testing small TF models have been success-fully deployed to the Teensy 3.2 development board (ARMCortex-M4)[5], from this we conclude it may be possible todeploy similar models to a wide range of small embeddedcomputers including the low cost ATMEL PA32 radiationhardened processor which has flight heritage or the low costVorago rad-hard ARM Cortex-M0 [6]. We intend for TFMinto be a valuable tool for OBDH developers enabling themto quickly generate accurate performance metrics for neuralnetworks on a wide range of processors used in space, as wellas generating prototype C++ code which can be used as astarting point for the development of actual flight software.

This paper summarises work on the use of neural networkmodels in space applications and the development of tools toautomatically deploy these networks onto a range of hardware

targets. We then introduce the TFMin library and describe itsarchitecture and the internal optimisations that it uses. Finallywe describe the terrain evaluation task we have used the libraryfor, as a design and optimisation case study for deep learningsolutions on the LEON 3 processor.

Deep learning inference networks have become a commonpart of terrestrial systems today, large continuously trainedmodels generally reside inside server clusters. Increasinglyhowever frozen inference models are being deployed ontomobile and embedded devices. TF officially supports twoframeworks for the deployment of trained inference models,TensorFlow-Lite, and the TensorFlow XLA (Accelerated Lin-ear Algebra) AOT (Ahead of Time) compiler. TensorFlow-Liteis a cut down version of the full TF library aimed at mobilephone grade computers, as such it is still too large for manyembedded systems having a library overhead of more than100MB. This framework has been used by [7] [8] to success-fully process actual orbital imagery on a representative flightproven single board computer. It should be noted however thatthe more powerful Zynq chip (dual core Arm Cortex-A8) usedto execute the inference was not rad-hard, it was just protectedby the less powerful rad-hard components on the CHRECSpace Processor v1 board, limiting this application to lowerradiation orbits near Earth or short duration missions. The onlyrad-hard components with a similar level of computing powerare the latest generation of processors (RAD 5500 [9], LEON4 [10]) which are currently lacking flight heritage and come ata high price. As such TensorFlow-Lite deployment is currentlyonly feasible for LEO missions and High-risk, high-value GEOand deep space missions.

TensorFlow XLA is an LLVM (Low Level Virtual Machine)compiler tool which generates machine code directly from atrained model, the resulting compiled static-libraries can belinked into larger projects with minimal library over heads.This makes it ideally suited to deploy NN (Neural Network)models on radiation hardened processors, due to it’s use of theversatile LLVM compiler architecture [11] it is able to producecode for a range of CPU targets. The Sparc V8 architectureof the LEON family of processors is not officially supportedby XLA and after some initially promising results the authorsfailed to successfully integrate it with the Gaisler LLVM com-piler [12]. Gaisler’s Clang compiler is built on LLVM version4.0 while the TF XLA compiler generates IR (IntermediateRepresentation) using features only present in versions 5.0and above. By taking the LLVM IR produced by the XLAcompiler then building it using the Gaisler clang compilersimple fully connected networks were successfully deployed tothe LEON 3. However the authors did not manage to compileworking versions of large or convolutional networks using thistechnique. The XLA compiler is considered experimental bythe TensorFlow team and it cannot be simply linked to theGaisler tool chain however this may change in the future.Another drawback of using XLA is the challenge of integrationwith aerospace V&V (Verification & Validation) practices,since it generates machine code directly its output is difficultto inspect. It also doesn’t produce human editable code which

could be used as a basis for developing formal flight software,so a different solution is preferable for Aerospace applications.

MatLab’s ML (Machine Learning) toolbox provides twooptions for deploying trained inference models onto embed-ded systems; MatLabs C coder generates equivalent C++code while MathWorks HDL (Hardware Definition Language)Coder can deploy models directly to FPGAs (Field Pro-grammable Gate Arrays). Code automatically generated fromMatLab has significant runtime library dependencies of asize comparable to TensorFlow-Lite making it appropriatefor the same range of hardware targets. MathWorks HDLCoder although targeting lower level hardware than CPUs isan interesting new option for analysing deep-learning modelson space qualified hardware. FPGA implementations of DLmodels have the potential to be of significantly higher perfor-mance than CPU implementations. This approach is still anactive area of research with open questions that need to beanswered.

In addition to these tools there is always the option ofmanually implementing a NN model on any platform andfor smaller networks this may be a practical option. Thefully-connected MNIST classification example model [13]included with TFMin is made of seven operations so couldbe easily hand coded, SqueezeNet on the other hand is madeof 93 operations while other deep learning models have manyhundreds. Importantly these deeper models are often the mostefficient way to solve complex problems and given the needto iterate the design of DL models manual implementationbecomes increasingly impractical.

Fig. 1. Common terrestrial and space processors and the tools able toautomatically deploy neural networks to them.

Figure 1 shows a range of terrestrial and space qualifiedCPU processors and which automated NN deployment toolsthey are supported by. Unsurprisingly recent mobile phonegrade processors, which includes the next generation of ra-diation hardened devices, are well supported, being the maintarget of these tools. We can see that support for the currentgeneration of radiation hardened devices is lacking, which iscritical because these processors are and for a few years willcontinue to be the most common choice for new space craft.There is a need for a new tool to fill this gap for two reasons,TF XLA compiler does not currently support the LEON familyof processors, and even when it does its use does not fitinto current coding standards for flight software. We presentthe TFMin which tool addresses both of these limitations by

avoiding the need for XLA and providing validated C++ codewhich can then be passed onto a flight software developmentteam.

II. TFMIN - GENERATING C++ FROM TENSORFLOWMODELS

The TFMin tool automatically converts trained NN modelsdeveloped using TF into verifiable C++ code which can thenbe built for any system with a C++11 compliant compiler,including the LEON family of processors used on manygeostationary and deep space missions. Being able to quicklydeploy and analyse DNNs on computers representative offlight hardware is a valuable tool allowing developers torelate high level network changes to low level performancewithout the need to manually implement the network model.This conversion is done verbatim so the C++ version ismathematically identical to the TF original, any quantisationand retraining optimisation would need to be done in TF beforethe model is converted. Additionally unlike the LLVM inter-mediate representation generated by the XLA AOT compiler,TFMin generates human readable C++ code which can beindependently verified and used as a starting point for formalflight code development.

TF is a functional programming tool hiding within a pro-cedural python/C++ framework, neural network models arecreated as large functions known as flow-graphs. These flow-graphs are networks of functions which transfer n-dimensionaltensors between each other, hence the name TensorFlow. Thepower of TF is its capability to automatically accelerate theevaluation of these flow-graphs using massively parallel GPUs,which is essential for training and in many cases inference oflarge DNNs (deep neural networks).

Since DNNs are represented as functional flow-graphs theyare mobile data structures as opposed to an algorithm that ishard-coded into source code. A python program can procedu-rally generate a flow-graph however they can also save, load,optimise, and convert them. This capability allows the TFMinlibrary to extend TF with the capability to convert DNN flow-graphs into equivalent C++ code.

When TF evaluates flow-graphs the required output ten-sors are specified, these outputs are then traced backwardsthrough the network to determine which operations need tobe evaluated to compute the solution. Because of this a singleflow graph can be used for both inference and training, whereevaluating for training accuracy output will execute a batchtraining run while evaluating for the estimate output willexecute an inference run.

TFMin uses the same approach, a flow graph and set ofoutput tensors is provided and the library works out whichoperations, weights, and inputs are required, ignoring anytraining or introspection operations. Model weights are writteninto a C++ header file as array literals, while the requiredoperations from the flow-graph are built into a C++ object asshown in Figure 2. As well as a standard inference methodtwo optional methods can be generated for verification andtiming. Verification methods include input data along with

the expected results of each operation allowing the generatedC++ model to validate itself against the TF original. Timingmethods can be used the same way as the standard inferencemethod while additionally providing total and operation levelexecution times.

Fig. 2. Architecture of the TFMin TensorFlow to C++ code generator.

TF operations are converted into C++ code using a dictio-nary of op kernels, these are objects which match themselvesto TF operations and generate equivalent code. The list ofthese op kernels has been designed to be peripheral to thecore library so it can easily be extended and optimised bydevelopers to fine tune its code output for their particularprojects.

A. Adding TFMin Exporter to a Project

There are two parts to the TFMin library a Python moduleused to analyse flow graphs and convert them to code, anda C++ header only library containing the virtual base-classthat generated model classes are derived from. Integrating thelibrary with an existing TF project in python has been made assimple as possible. Listing 1 shows how an existing TF modelcan be exported to C++ using only two lines of python code.

Listing 1. Python example exporting a trained model from TensorFlowto C++.

import tf_min as tfm

# Setup flow-graph and# load or train model weights# ...

gen = tfm.Exporter(flow_graph,[’output0’,’output1’])

gen.generate("cnn_model","cnnModel",layout=’RowMajor’)

The TF flow graph object and a list of output tensor namesis all that is needed, running the code above will generatethree files: cnn model data.h containing the model weights,and cnn model.cpp & cnn model.h which define an objectencapsulating the NN model itself.

These three files can then be included in a larger C++application or library project. The interface has again beendesigned to be as simple as possible, data is passed to and fromthe model object using raw buffers in the layout specified in thepython code above. Listing 2 shows a minimal integration withthe SqueezeNet example model; First an Eigen device object iscreated which defines how the model will be evaluated, in thiscase a single thread device will be used, multi-threaded devicesare also supported. Secondly the model is instantiated, causingrequired heap memory to be allocated. Thirdly the model isevaluated using the given device, input and output buffers arepassed as pointers. Finally heap memory is released when themodel instance goes out of scope.

Listing 2. Example C++ project integration with TFMin generated code

#include <iostream>#include "cnn_model.h"

using namespace std;

int main(){

float input[154587];float est_class[1000];

Eigen::DefaultDevice device;SqueezeNet model;

cout << "Infering with Model.\n";model.eval(device, input, est_class);cout << "Inference complete.\n";

return 0;}

B. Speed Analysis & Optimisation

As well as the standard eval() method shown above twooptional methods can be generated to analyse and verify thegenerated code.

The following section describes how to export additionalmethods for the generation of per-operation timing informationand measurement of a models memory requirements. Fulltutorials and examples can be found on the git repository wikipages [14]. Listing 3 shows how to export these two optionalmethods, it should be noted that enabling full validation cansignificantly increase the size of the generated binary since thevalid result of every operation needs to be stored.

The timing method performs the same task as eval() whileadditionally reporting the time each operation takes to execute.These results are printed to the terminal and returned in a datastructure for automatic recording. The validation method printseither a pass or fail for each operation as the model is being

evaluation, terminating in the case of failure. This is especiallyuseful for developers that are optimising or extending theoperations supported by the library as described in sectionII-D

Listing 3. Python example exporting a trained model with timing andvalidation methods.

val_input = [ ... ]

gen.generate("cnn_model","cnnModel",validation_inputs =

{input_placeholder:val_input},

validation_type=’Full’,timing=True,layout=’RowMajor’)

Figure 3 shows the timing results of the convolutionalMNIST example network on a 50 MHz LEON 3 [13], this toolcan be used to identify which parts of a network topology havethe greatest scope for being accelerated. In this simple CNNmodel the costliest operations are the two convolution filtersand the largest fully connected layer as would be expected.These operations predominantly use multiply and accumulateinstructions, accessing regions of memory larger than thecache size of the processor. Using the TSim simulator it wasfound that the cache hit rates were 70% and 82% for theconvolution and matrix multiplication operations respectively.These are very poor hit rates and indicate that the speed ofexecution is being limited by the time taken to access datain RAM. The 4KB data cache size of the simulated LEONprocessor used is a significant factor in the time taken tocalculate these layers. The pattern of memory use also hasa large effect on cache hit-rates and therefore performance,using a row-major layout results in this classifier executing in68% of the time it took to execute the same model using acolumn-major layout.

Fig. 3. CPU time per operation of a convolutional MNIST classificationmodel executed on a 50 MHz LEON 3.

Optimisation of Neural network training and inference ondifferent types of hardware is an ongoing research topic,this includes the single or multi core CPU architectures ofthe LEON family. This work is not trying to push theseboundaries, but it is important that the algorithms used were

representative of current CPU implementations so that our per-formance results are realistic. This was achieved by basing ourcode on TensorFlow’s own open source C++ implementationwhich use the Eigen linear Algebra library [4]. TFMin couldbe extended by the development of layer implementations thatare optimised for specific hardware, such at the work of Laiet. al.[15] who created a library of optimised ARM Cortex-M NN layer implementations (CMSIS-NN). Combining theautomatic code generation framework provided by TFMinwith a set of hardware specific layer implementations wouldcombine the fast work-flow of TFMin with state of the artlayer implementations.

Using Eigen to perform the fundamental tensor operationsallows the library to take advantage of verified and optimisedalgorithms. The most CPU intensive operation in most DNNsis convolution, TFMin implements this by reshaping the inputand filter tensors so that the convolution can be calculatedusing tensor contraction. The particular contraction used isa matrix multiplication allowing Eigen’s optimised GEMMalgorithms to be used.

C. Memory Analysis & Optimisation

Measuring memory use on LEON 3 computers is morechallenging than on fully featured desktop-systems, howeverrequirements on both types of computer would be expectedto be close to each other. For this reason, we recommendusing the valgrind tool massif to measure the memory use of aDNN model by compiling it for a desktop target and runningit natively, it is important to note that both heap and stackmemory need to recorded to get an accurate measurement.Taking SqueezeNet v1.1 [16] as an example we can see inFigure 4 that the model requires an approximately constantamount of memory to run with a peak of 7.7 MB.

Fig. 4. Combined heap & stack memory use of the TFMin implementationof SqueezeNet v1.1, during an inference operation.

The result shown in Figure 4 includes the memory op-timisation step performed by the TFMin library. Memoryrequirements of most NN models are fixed which means itis possible to pre-calculate an optimal memory use patternremoving the need for dynamic memory management duringits evaluation. This optimisation is performed by findingthe first assignment and final use of each tensor and theamount of memory needed to store it. Each of these blocksdefines a block of memory covering a specific duration in the

calculation, a packing algorithm is used to fit them into thesmallest fixed sized block of memory, Figure 6. ComparingSqueezeNets optimised pattern against the original pattern, thepeak memory requirement has reduced to 20% of the original,Figure 5.

Fig. 5. Combined heap & stack memory use of SqueezeNet v1.1 before andafter optimisation.

Fig. 6. Tensor memory use patterns before (A) and after optimisation (B).

D. Supported Operations

TFMin uses a set of op kernel objects which match them-selves to TF operations and generate equivalent C++ code,the library currently has 19 op kernels supporting the mostcommon operations. This is a fraction of the approximately800 operations available in the full TF framework, but issufficient to implement all the networks described in this work.It is inevitable that developers will need to extended this set ofoperations to work with a wider range of network models. Anextensible architecture has been used allowing new op kernelssupport by adding separate python files to the op kernelsfolder.

An example and walk through of this process is includedin the adding an operation tutorial on the git repository wiki[14].

E. Benchmarking CNNs on the LEON 3

Using the analysis techniques described above several wellknown tasks and network models have been built for theLEON 3, these results provide a valuable guide to what ispossible on this generation of radiation hardened processors.Several larger networks were investigated, AlexNet [17] and

TABLE IEXECUTION SPEED AND MEMORY REQUIREMENTS OF NOTABLE DNN

MODELS ON A LEON 3 PROCESSOR.

Model Peak RAM (KB) Time (s)MNIST Dense (float 32) 82.7 0.100MNIST CNN (float 32) 200 0.121SqueezeNet v1.1 [16] 7884 41.0

ResNet-50 [18] but were found to be too large to implementon the current LEON 3 development boards available to us.

C++ code generated by TFMin was succesfully built usingboth the gcc & llvm based compilers provided by GaislerAeroflex [12] no significant difference in execution speedwas found between the two compilers. All evaluation binarieshave been built with maximum optimisation for speed andno debugging symbols. Compiler optimisation is especiallyimportant for the Eigen library which has been designed tobe built this way, a 20x speed increase was observed betweenmaximum (-O3) and no optimisation.

• All LEON times were generated using a 50 MHz pro-cessor with hardware floating point unit and 16 KB dataand instruction cache sizes.

• MNIST classifier results can be generated using the freeevaluation version of TSIM (LEON simulator) availablefrom Gaisler Aeroflex.

• SqueezeNet result was generating using a Pender GR-XC3S-1500 development board with a Xilinx SpartanFPGA and 64 MB of SDRAM [19].

III. PLANETARY TERRAIN ASSESMENT, USE CASE

A. The Estimation Task

Determining which areas of terrain are safe and unsafe todrive over is an essential function of planetary rovers andterrestrial autonomous ground vehicles [20]. This experimentinvestigates the use of a deep learning model for this taskand compares its accuracy and performance to an existingbaseline algorithm. A cost mapping algorithm based uponthe GESTALT navigation system used on the MER (MarsExploration Rovers) [21] is used as this baseline, its inputis a set of elevation values from within a rover footprint andit produces a scalar navigation cost metric. These cost valueslie between zero and one, where zero is impassible and one isthe easiest possible terrain. This algorithm is repeated acrossthe DEM (digital elevation map) to generate a navigation costmap, Figure 7, which is subsequently used by a path planningalgorithm to find a safe and efficient route to the chosendestination [20].

A deep learning regression model fits the inputs and outputsof the baseline algorithm, taking a fixed size input matrixand estimating a single scalar. The existing baseline algorithmallows us to automatically generate large sets of training datafrom any existing DEM, this experiment used a 600 x 600metre map generated from drone data with an 8cm cell size,taken from the ERGO [22] [23] Morocco field trial. A setof 36 million training cost values were generated from this

Fig. 7. (left) Digital elevation map taken from the ERGO Morocco dataset,(right) cost map of navigability values.

map which was then augmented using rotations and reflectionsincreasing it eight times to 288 million. A Huber loss [24]function was used with the ADAM optimiser [25] and avariable learning rate to refine the weights of the model duringtraining.

During initial testing a problem was found with this trainingdata, the distribution of navigation costs was not uniformwith a bias towards easy terrain. The makes sense whenconsidering that the majority of the map is fairly flat, howeverthis training data bias prevented the model from fitting in areascontaining difficult terrain. Splitting the training data into tensets with evenly distributed navigation costs then selectingsamples equally from each of these sets removed this bias,so that during training a uniform distribution of cost valuesis used. This training methodology, cost function and trainingdata was then used to evaluate a range of CNN topologies.

B. Analysis of DNN Models

We present five different DNN models which were trainedas described in Section III-A and compare their performance toa LEON 3 implementation of the baseline terrain assessmentalgorithm. LEON 3 performance metrics were generated usinga 50 MHz processor with hardware floating point unit and 16kB instruction and data caches. All models were trained onthe dataset for 200k steps with a batch size of 100, takingapproximately two hours per model using an NVIDIA GTX1070 GPU.

After considering many different CNN topologies, fivemodels are presented with a range of layer counts and filtersizes. Model A was first to successfully fit to the training dataFigure 9, with 5 convolution layers and three fully connectedlayers, Figure 8 shows the cost map generated using this modeland the error between its estimates and the original trainingdata. This network topology was then gradually reduced insize, decreasing the number of layers and reducing the size ofthe convolutional filters until it was no longer capable of fittingto the training data. Models B to E represent the range of thenetwork topologies that were able to fit to the training data,

Fig. 8. (left) DNN generated cost map from ERGO DEM data, (right) residualerror between DNN model and training data.

TABLE IITERRAIN ASSESSMENT MODEL, ACCURACY AND PERFORMANCE RESULTS

ON A LEON 3.

A B C D EWeights (kB) 1130 1094 300 154 110Mean Error .0194 .0186 0.0198 .0206 .0199Runtime (s) 8.34 2.03 1.09 0.39 0.18

Baseline ratio 87.6 21.4 11.5 4.1 1.9Binary (kB) 11371 9807 9496 8868 8815

Memory (kB) 884 595.5 360 286.4 252.3

although there is a slight trend of increasing residual errors asthe networks become smaller.

Fig. 9. Terrain traversability estimation models A - E.

The baseline terrain assessment algorithm used calculatesthree geometric properties which are then combined into thefinal navigation cost; the slope of the best fit plane, the largestdistance between this plane and any point on the terrain, andthe greatest height difference between any adjacent DEM cells.The most computationally intensive task is the calculation ofthe best fit plane, a singular value decomposition (SVD) isperformed using the bidiagonal divide-and-conquer algorithm

TABLE IIITERRAIN ASSESSMENT MODEL PERFORMANCE RESULTS ON AN INTEL I7.

A B C D ETime (s) .0404 .0060 .0040 .0015 .0009

Baseline ratio 336.8 49.9 33.4 12.8 7.3

[26], this produces the principal axes of a best fit ellipsoid,the smallest eigen value will correspond to the axis matchingthe planes normal. This algorithm takes 95 milliseconds toexecute on the LEON 3 CPU described in table II.

C. Use Case Findings

As can be seen in Table II the baseline algorithm is stilloutperforming the DNN model by a factor of two even inthe best cases. This result suggests that deep learning is notcapable of estimating navigation cost maps more efficientlythan the existing baseline algorithm. Future work will developthis approach using fully convolutional networks which esti-mate multiple cost map elements at the same time, potentiallyproducing several square metre map in a single pass. Usingthis technique, it is possible to surpass the performance of thebaseline algorithm when larger areas of cost map need to begenerated at once.

The value of TFMin to generate accurate LEON 3 timingresults for these models is highlighted, when the same compar-ison between algorithms is made on an Intel i7 CPU, shownin Table III. The relative speeds of the two algorithms on thei7 are significantly different than on the LEON 3, caused bythe nature of each algorithm interacting differently with theunderlying CPU architecture and resources of each system.

There are many potential reasons for the observed differencein relative performance between the two systems. The i7 leveltwo cache is larger than the entire DNN model, while theLEON 3 has no level two cache and its level one cache cannotstore even a single layer of weights. The i7 also supportsvectorised FMA (Fused Multiply-Add) operations allowingit to execute significantly more floating point operations perclock cycle than the simpler FPU of the LEON 3. This is whyit is important to analyse the performance of an algorithm onthe same CPU architecture that it will run on in production,and why TFMin can be a valuable tool for reviewing DNNmodel performances early in the development cycle.

IV. CONCLUSIONS AND DISCUSSIONS

A new tool has been demonstrated enabling developersto quickly deploy and evaluate TensorFlow models on smallresource constrained processors such as the LEON 3. LEON3 benchmarks for several well known deep learning models[Table I] have been produced using this tool. The workflowand integration of this tool into an existing TensorFlow projecthas been described, and the methods to extract meaningfulanalyses from the deployed networks. Finally a simple use-case has been shown which used the TFMin tool to quicklyanalyse a range of different network models, highlighting whyit is important to compare algorithms on target hardware asopposed to the more easily available desktop systems.

Our terrain assessment use case has shown that a conven-tional regression CNN can accurately reproduce the outputof a traditional algorithm, although taking 1.9 times longer toexecute. Since only a small improvement in the performance ofthe CNN model would make it surpass the baseline algorithm,our future work will focus on this area. However if a moreefficient deep neural network is found there are still signif-icant challenges to their adoption on-board planetary rovers.Aerospace verification and validation practices do not coverthe use of deep neural networks, so we will investigate theuse of a novel oversight system to address this gap.

Eight significant deep learning models have been built andanalysed on the LEON 3 platform using TFMin, however thereare still many improvements that could be made to this tool.Model optimisation and quantisation in TF is available aspart of the TFlite module, with many of the standard layeroperations only supporting floating point data. During thiswork experimental micro controller support was added to theTF lite framework, the authors are currently adding LEONsupport to this tool and will investigate the performance ofthese implementations as compared to those from TFMin.Although the Eigen tensor operations used are genericallyoptimised, they are not optimised for the LEON3. Producinga set of op kernels which generate LEON 3 optimised codewould be necessary to to achieve the best performance possibleon this target. Finally support still needs to be added foradditional TensorFlow operations, it is hoped that the open-source community will be able to help with this task.

Airbus deep-learning group is currently investigating theuse of this tool, it is hoped that its open-source licence willencourage its use more widely in the space industry.

ACKNOWLEDGMENT

The author would like to thank Matthias Winter for origi-nally suggesting the CNN approach to navigability assessmentduring his time working on the Exomars rover. We wouldalso like to thank the Airbus Defence and Space AOCS/GNCdepartment at Stevenage for their assistance during this workand the ERGO project and PERASPERA cluster for providingaccess to the Morocco dataset.

This work has been sponsored by Airbus.

REFERENCES

[1] T. Campbell, “A deep learning approach to autonomous relative terrainnavigation,” 2017.

[2] G. Aeroflex, “Dual-core leon3-ft sparc v8 processor,”http://www.gaisler.com/doc/gr712rc-datasheet.pdf.

[3] R. W. Berger, D. Bayles, R. Brown, S. Doyle, A. Kazemzadeh,K. Knowles, D. Moser, J. Rodgers, B. Saari, D. Stanley et al., “Therad750/sup tm/-a radiation hardened powerpc/sup tm/processor for highperformance spaceborne applications,” in 2001 IEEE Aerospace Con-ference Proceedings (Cat. No. 01TH8542), vol. 5. IEEE, 2001, pp.2263–2272.

[4] G. Guennebaud, B. Jacob et al., “Eigen,” URl: http://eigen. tuxfamily.org, 2010.

[5] PJRC, “Teensy 3.2 & 3.1 - new features,”https://www.pjrc.com/teensy/teensy31.html.

[6] V. Technologies, “Vorago producs,”https://www.voragotech.com/vorago-products.

[7] J. Manning, D. Langerman, B. Ramesh, E. Gretok, C. Wilson,A. George, J. MacKinnon, and G. Crum, “Machine-learning spaceapplications on smallsat platforms with tensorflow,” in Proceedings ofthe 32nd Annual AIAA/USU Conference on Small Satellites, Logan, UT,USA, 2018, pp. 4–9.

[8] A. D. George and C. M. Wilson, “Onboard processing with hybrid andreconfigurable computing on small satellites,” Proceedings of the IEEE,vol. 106, no. 3, pp. 458–470, 2018.

[9] B. Systems, “Bae systems current processors and single boardcomputers,” https://www.baesystems.com/en-us/download-en-us/20190124214317/1434554723043.pdf.

[10] M. Fernandez, R. Gioiosa, E. Quinones, L. Fossati, M. Zulianello, andF. J. Cazorla, “Assessing the suitability of the ngmp multi-core processorin the space domain,” in Proceedings of the tenth ACM internationalconference on Embedded software. ACM, 2012, pp. 175–184.

[11] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelongprogram analysis & transformation,” in Proceedings of the internationalsymposium on Code generation and optimization: feedback-directed andruntime optimization. IEEE Computer Society, 2004, p. 75.

[12] J. Gaisler and K. Eisele, “Bcc-bare-c cross-compiler users manual,”Aeroflex Gaisler AB, 2012.

[13] L. Deng, “The mnist database of handwritten digit images for machinelearning research [best of the web],” IEEE Signal Processing Magazine,vol. 29, no. 6, pp. 141–142, 2012.

[14] P. Blacker, “Github repository of the tfmin code generation tool,”https://github.com/PeteBlackerThe3rd/TFMin.

[15] L. Lai, N. Suda, and V. Chandra, “Cmsis-nn: Efficient neural networkkernels for arm cortex-m cpus,” arXiv preprint arXiv:1801.06601, 2018.

[16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewerparameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,2016.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural infor-mation processing systems, 2012, pp. 1097–1105.

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[19] Pender, “Pender grxc3s,” http://www.pender.ch/products xc3s.shtml.[20] M. Winter, C. Barcaly, V. Pereira, R. Lancaster, M. Caceres, N. Mc-

Manamon, N. Silva, D. Lachat, and M. Campana, “Exomars rovervehicle: detailed description of the gnc system,” ASTRA, 2015.

[21] M. Maimone, A. Johnson, Y. Cheng, R. Willson, and L. Matthies,“Autonomous navigation results from the mars exploration rover (mer)mission,” in Experimental robotics IX. Springer, 2006, pp. 3–13.

[22] E. Consortium, “European robotic goal-oientated autonomous controller(ergo),” https://www.h2020-ergo.eu/.

[23] R. Marc, P. Wclewski, and D. Lachat, “Autonomous multi-mode rovernavigation for long-range planetary exploration using orbital and locallyperceived data,” 10 2018.

[24] S. Klanke and H. Ritter, “Variants of unsupervised kernel regression:General cost functions,” Neurocomputing, vol. 70, no. 7-9, pp. 1289–1303, 2007.

[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

[26] M. Gu and S. C. Eisenstat, “A divide-and-conquer algorithm for thebidiagonal svd,” SIAM Journal on Matrix Analysis and Applications,vol. 16, no. 1, pp. 79–92, 1995.

Rapid Prototyping of Deep Learning Models on Radiation ...epubs.surrey.ac.uk/852310/1/Rapid...

Documents

Transcript of Rapid Prototyping of Deep Learning Models on Radiation ...epubs.surrey.ac.uk/852310/1/Rapid...