Application of Deep Learning methods to analysis of …Application of Deep Learning methods to...

15
Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,* , M. Kraus a,** , M. B¨ uchele a , K. Egberts b , T. Fischer a , T.L. Holch c , T. Lohse c , U. Schwanke c , C. Steppa b , S. Funk a a Friedrich-Alexander-Universit¨atErlangen-N¨ urnberg, Erlangen Centre for Astroparticle Physics, Erwin-Rommel-Str. 1, D 91058 Erlangen, Germany b Institut f¨ ur Physik und Astronomie, Universit¨at Potsdam, Karl-Liebknecht-Str. 24/25, D 14476 Potsdam, Germany c Institut f¨ ur Physik, Humboldt University of Berlin, Newtonstr. 15, D 12489 Berlin, Germany Abstract Ground based γ -ray observations with Imaging Atmospheric Cherenkov Telescopes (IACTs) play a significant role in the discovery of very high energy (E > 100 GeV) γ -ray emitters. The analysis of IACT data demands a highly efficient background rejection technique, as well as methods to accurately determine the energy of the recorded γ -ray and the position of its source in the sky. We present results for background rejection and signal direction reconstruction from first studies of a novel data analysis scheme for IACT measurements. The new analysis is based on a set of Convolutional Neural Networks (CNNs) applied to images from the four H.E.S.S. phase-I telescopes. As the H.E.S.S. cameras pixels are arranged in a hexagonal array, we demonstrate two ways to use such image data to train CNNs: by resampling the images to a square grid and by applying modified convolution kernels that conserve the hexagonal grid properties. The networks were trained on sets of Monte-Carlo simulated events and tested on both simulations and measured data from the H.E.S.S. array. A comparison between the CNN analysis to current state- of-the-art algorithms reveals a clear improvement in background rejection performance. When applied to H.E.S.S. observation data, the CNN direction reconstruction performs at a similar level as traditional methods. These results serve as a proof-of-concept for the application of CNNs to the analysis of events recorded by IACTs. Keywords: Gamma-ray astronomy; IACT; Analysis technique; Deep learning; Convolutional neural networks; Recurrent neural networks 1. Introduction Since the beginning of the century, γ -ray as- trophysics has been progressing at a remarkable pace. The third generation instruments of ground- based Imaging Atmospheric Cherenkov Telescopes (IACTs) have been exploring the very high energy (VHE; E > 100 GeV) sky, increasing the number of known γ -ray-emitting celestial objects to more than 200 [1]. Current and future generations of IACTs primarily aim to investigate the origin and acceleration processes of Cosmic Rays (CRs) [2] and identify the nature of dark matter [3]. The IACT technique relies on the utilization of the Earth’s atmosphere as a calorimeter. When a VHE CR or γ -ray enters the atmosphere, it inter- acts with the nuclei in the air to initiate a cascade of particles and electromagnetic radiation, known * Corresponding author; Email: [email protected] ** Corresponding author; Email: [email protected] as an Extensive Air Shower (EAS). If the primary particle is a γ -ray, it undergoes a e + e - pair pro- duction which initiates a purely electromagnetic shower. The relativistic charged particles in the shower emit a very narrow cone of Cherenkov ra- diation, with an opening angle of 1 , which is detectable at ground level. The small Cherenkov angle leads to a light pool with a diameter of typ- ically 200-300 m at ground level and a nearly uni- form light density, where the integral of the inten- sity is correlated to the primary particle’s energy. Electromagnetic showers are characterized by an elliptically shaped shower image. If the primary particle is a CR, a hadronic shower develops. Al- though such hadronic showers often have electro- magnetic sub-shower components as well, they lead to a typically more irregular shape of the image. IACTs are able to detect and record images of the Cherenkov light emitted by the secondary par- ticles in the EAS. Such images generally allow one Preprint submitted to Elsevier March 29, 2018 arXiv:1803.10698v1 [astro-ph.IM] 28 Mar 2018

Transcript of Application of Deep Learning methods to analysis of …Application of Deep Learning methods to...

Page 1: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

Application of Deep Learning methods to analysis of ImagingAtmospheric Cherenkov Telescopes data

I. Shilona,∗, M. Krausa,∗∗, M. Buchelea, K. Egbertsb, T. Fischera, T.L. Holchc, T. Lohsec, U. Schwankec,C. Steppab, S. Funka

aFriedrich-Alexander-Universitat Erlangen-Nurnberg, Erlangen Centre for Astroparticle Physics, Erwin-Rommel-Str. 1,D 91058 Erlangen, Germany

bInstitut fur Physik und Astronomie, Universitat Potsdam, Karl-Liebknecht-Str. 24/25, D 14476 Potsdam, GermanycInstitut fur Physik, Humboldt University of Berlin, Newtonstr. 15, D 12489 Berlin, Germany

Abstract

Ground based γ-ray observations with Imaging Atmospheric Cherenkov Telescopes (IACTs) play asignificant role in the discovery of very high energy (E > 100 GeV) γ-ray emitters. The analysis ofIACT data demands a highly efficient background rejection technique, as well as methods to accuratelydetermine the energy of the recorded γ-ray and the position of its source in the sky. We present resultsfor background rejection and signal direction reconstruction from first studies of a novel data analysisscheme for IACT measurements. The new analysis is based on a set of Convolutional Neural Networks(CNNs) applied to images from the four H.E.S.S. phase-I telescopes. As the H.E.S.S. cameras pixelsare arranged in a hexagonal array, we demonstrate two ways to use such image data to train CNNs: byresampling the images to a square grid and by applying modified convolution kernels that conserve thehexagonal grid properties.

The networks were trained on sets of Monte-Carlo simulated events and tested on both simulationsand measured data from the H.E.S.S. array. A comparison between the CNN analysis to current state-of-the-art algorithms reveals a clear improvement in background rejection performance. When appliedto H.E.S.S. observation data, the CNN direction reconstruction performs at a similar level as traditionalmethods. These results serve as a proof-of-concept for the application of CNNs to the analysis of eventsrecorded by IACTs.

Keywords: Gamma-ray astronomy; IACT; Analysis technique; Deep learning; Convolutional neuralnetworks; Recurrent neural networks

1. Introduction

Since the beginning of the century, γ-ray as-trophysics has been progressing at a remarkablepace. The third generation instruments of ground-based Imaging Atmospheric Cherenkov Telescopes(IACTs) have been exploring the very high energy(VHE; E > 100 GeV) sky, increasing the numberof known γ-ray-emitting celestial objects to morethan 200 [1]. Current and future generations ofIACTs primarily aim to investigate the origin andacceleration processes of Cosmic Rays (CRs) [2]and identify the nature of dark matter [3].

The IACT technique relies on the utilization ofthe Earth’s atmosphere as a calorimeter. When aVHE CR or γ-ray enters the atmosphere, it inter-acts with the nuclei in the air to initiate a cascadeof particles and electromagnetic radiation, known

∗Corresponding author; Email: [email protected]∗∗Corresponding author; Email: [email protected]

as an Extensive Air Shower (EAS). If the primaryparticle is a γ-ray, it undergoes a e+e− pair pro-duction which initiates a purely electromagneticshower. The relativistic charged particles in theshower emit a very narrow cone of Cherenkov ra-diation, with an opening angle of ∼ 1◦, which isdetectable at ground level. The small Cherenkovangle leads to a light pool with a diameter of typ-ically 200-300 m at ground level and a nearly uni-form light density, where the integral of the inten-sity is correlated to the primary particle’s energy.Electromagnetic showers are characterized by anelliptically shaped shower image. If the primaryparticle is a CR, a hadronic shower develops. Al-though such hadronic showers often have electro-magnetic sub-shower components as well, they leadto a typically more irregular shape of the image.

IACTs are able to detect and record images ofthe Cherenkov light emitted by the secondary par-ticles in the EAS. Such images generally allow one

Preprint submitted to Elsevier March 29, 2018

arX

iv:1

803.

1069

8v1

[as

tro-

ph.I

M]

28

Mar

201

8

Page 2: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

to gather sufficient information to separate the γ-ray signal from the dominant CR background andreconstruct the source position and energy of theprimary γ-ray. To record a sufficiently accurateshower image, an IACT camera must be able tocapture the very brief Cherenkov light flash, whichlasts for a few nanoseconds. In addition, the opti-cal Point-Spread Function (PSF) and camera pixelsize should ideally be smaller than the angular di-mension of the γ-ray shower. Increasing the IACTcamera resolution enables a more accurate compu-tation of the shower axis, which has an intrinsictransverse angular size of only a few arc-minutes.Nevertheless, stochastic fluctuations in the showerdevelopment impose a limiting factor on the per-formance of an IACT.

The analysis of EAS images improves signifi-cantly when observing the showers from severalangles [4]. A current generation IACT systemthat utilizes such stereoscopic analysis of EASimages is the High Energy Stereoscopic System(H.E.S.S.) [5]. The H.E.S.S. array is located inthe Khomas highland in Namibia (23◦16’18” S,16◦30’01” E). It consists of four 12 m diameterCherenkov Telescopes (CT1-4), built between 2002and 2004, and a fifth, 28 m diameter telescope(CT5), built in 2012. The CT1-4 telescopes have atotal field of view (FoV) of 5◦and an energy thresh-old of about 100 GeV. Thanks to its large mir-ror surface, the minimum γ-ray energy that CT5can trigger on is ∼ 30 GeV, with an FoV of 3.5◦.Each of the small telescopes is equipped with acamera containing 960 photo-multipliers (PMTs),while the camera of CT5 contains 2048 PMTs.

Analysis of IACT images relies on the extractionof relevant features from the camera pixel data.Whether those features are a vector of parame-ters representing the image, such as the image mo-ments, or the full photo-electron intensity count ineach pixel, in order to detect and study VHE γ-raysources an IACT analysis method must be able toperform each of the following three tasks:

1. Background rejection: separate the γ-ray in-duced signal from the much more preva-lent background of hadron-induced showers,through identification of shape features in theimage.

2. Direction reconstruction: reconstruct the po-sition of the origin of those events classifiedas signal, through calculation of the showerimage axis. Observation of the EAS with astereoscopic system significantly improves thedirection reconstruction resolution.

3. Energy reconstruction: reconstruct the pri-mary particle’s energy for those events clas-sified as signal, through the total image in-

tensity and the shower impact point on theground, relative to the telescopes.

H.E.S.S. currently applies three main recon-struction techniques in its analysis chain. Thefirst relies solely on the so-called Hillas parame-ters [6], the image moments derived from the distri-bution of the measured intensities in the individualcamera pixels, calculated after an image cleaningstep [7]. Given the approximately elliptical shapeof typical signal camera images, the arrival direc-tion of an EAS event is reconstructed by tracingthe major axis of the image, which corresponds tothe projected direction of the shower in the cameraFoV, to the γ-ray source in the sky. For stereo-scopic observations, the major axes of EAS imagesfrom the participating telescopes are calculated ina common camera reference frame and the intersec-tion points of all axes pairs are found. A weightedaverage, based on image amplitude and the anglebetween the axes, is then taken of all intersectionpoints to provide an estimate of the arrival direc-tion of the primary γ-ray [7]. A similar proce-dure, involving the intersections of the directionsbetween the image centroid and the optical axis, isthen performed in a common plane perpendicularto the pointing direction, to determine the showerimpact point on the ground.

The two other techniques utilize likelihood fit-ting of camera pixel amplitudes to semi-analyticalshower models (the Model++ method [8]) or tem-plate libraries from Monte-Carlo (MC) simulations(the ImPACT method [9]). A maximum likelihoodfit is performed to find the best-fit shower param-eters. These analysis methods show significantlybetter performance with respect to the Hillas anal-ysis, particularly for direction reconstruction.

The Hillas and ImPACT reconstruction meth-ods rely on a Boosted Decision Tree (BDT) [10] forthe background rejection stage. The BDT uses aset of parameters, derived from Hillas-based eventreconstruction, to classify events. These parame-ters include a comparison of the image width andlength to the expected mean values for both γ-rayand hadron induced showers, the average spread inenergy reconstruction between the triggered tele-scopes and the reconstructed height of maximumof the air shower. The BDTs are trained in a setof energy and zenith angle bins. The Model++scheme uses a scaled shower goodness-of-fit param-eter to separate signal from background.

The analysis of IACT data obviously relies onthe correct extraction of relevant features fromthe EAS images. Huge improvements over thelast decade in computational power, particularly inthe usage of GPUs for matrix operations, suggestthat more computationally demanding algorithms

2

Page 3: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

could be utilized to boost the performance of suchanalysis chains. Specifically, Deep Learning (DL)techniques for object recognition in images and se-quences of images, where a machine learns rele-vant features from the entire image matrix, are aclear candidate for such a family of algorithms. Ingeneral, DL concerns the application of complexArtificial Neural Networks (ANNs) to hierarchicallearning tasks. For computer vision, ConvolutionalNeural Networks (CNNs) were designed and de-veloped specifically to perform image recognitiontasks. In this work we demonstrate how the ap-plication of DL techniques, relying on CNNs forrecognition of features in EAS images, to H.E.S.S.data enhances the analysis of astrophysical point-sources observed by the H.E.S.S. CT1-4 telescopes.

In the following section, we provide a short de-scription of the DL algorithms which we have usedthroughout this work. Section 3 provides detailsregarding the data-sets we created to train andtest our networks. In section 4 we describe thedata pre-processing approaches we have taken inorder to feed H.E.S.S. data into a DL framework.Sections 5 and 6 describe the training process andprovide test results for classification and directionreconstruction, respectively. Next, in section 7,we apply our DL models to real data to presentour analysis results and compare them to cur-rent H.E.S.S. analysis methods, by utilizing theH.E.S.S. Analysis Package (HAP). We concludeour findings in section 8.

2. Deep Learning Methods for IACT Data

In its essence, DL is based on the use of ANNs.The basic unit of an ANN is the multi-layer per-ceptron (MLP), that can be viewed as a weighteddirected graph in which the ”neurons” are graphnodes and directed edges with weights are connec-tions between input and output. As the name sug-gests, an MLP graph contains a number of layers,starting with an input layer where each node re-ceives all input variables xi. The number of nodesin the output layer yk is the number of desired out-put labels in the case of a regression model or thenumber of classes in a classification model. In be-tween the input and output layers, additional lay-ers are introduced, where the number of nodes ineach layer and the number of layers are free hyper-parameters. These graph layers, commonly knownas hidden layers, allow the introduction of non-linearity to the model by means of a non-linear ac-tivation function. In the context of DL, such layersare usually referred to as fully connected (FC) ordense layers.

All the MLP neurons, but those in the inputlayer, receive their input from each of the nodes

in the preceding layer. The output from each neu-ron in the hidden layer is given by f(~w · ~x + b),where f is called the activation function, wj are theweights and b is an additional degree of freedom,called the bias. Typical activation functions forDL networks are the hyperbolic tangent functionand the Rectified Linear Unit (ReLU), defined byReLU(x) = max(0, x) [11]. For the output layerf is the identity function. The learning process isaccomplished by the ’backward propagation of er-rors’ (or BackProp) algorithm [12]. The errors arecalculated by means of a loss function, predefinedin accordance with the learning task.

In a classical machine learning setting, onewould “engineer” the features in the data-set inorder to improve the performance of the ANN. Asexplained in the previous section, we wish to applyCNN based networks to H.E.S.S. data. CNNs area specialized kind of a neural network for process-ing data that has a known, grid-like, structure [13].In the 2D case, CNNs take a complete 2D grid ofimage pixels as input. Another desirable character-istic of CNNs is the fact that they automaticallyinduce feature engineering of the input, meaningthat CNNs learn to identify the relevant and im-portant features in the training data in order tooptimize their performance for a certain task.

A typical architecture of a CNN includes nu-merous Convolutional Layers (CLs), followed bya number of dense layers (usually ≤ 3). A CL typ-ically comprises three stages: a convolution stage,an activation stage and a pooling stage. It shouldbe noted that variations of this structure are com-mon and in fact we make use of a different layerstructure for our regression tasks (see Sec. 6).

The convolution stage relies on a set of learn-able filters with a fixed size, which is spatiallysmaller than the 2D input. Each of the filters isconvolved across the entire width and height di-mensions of the 2D image matrix to produce lin-ear outputs. The small spatial size of the filtersallows the CNN to detect meaningful features (e.g.edges of a shape) which occupy only a small por-tion of the image. The convolution operation in-volves an element-wise multiplication between thefilter and local patches of the image of the samesize as the kernel. This local (or sparse) connec-tivity significantly reduces the number of free pa-rameters and memory requirements of the model.Moreover, each parameter of the filter is “shared”across the image, in the sense that it is applied atevery position of the image (disregarding bound-ary effects). Parameter sharing through patchesof the image introduces translational equivariance,meaning that a shift in the input leads to the sameshift in the output.

The convolution stage of a CL is followed by

3

Page 4: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

an activation stage, where each linear activationis fed into a nonlinear activation function. In thepooling stage, a pooling function [14] replaces theoutput of the layer at a certain location with asummary statistic of the nearby outputs, such asthe maximum output in the neighbourhood of thatlocation. The pooling operation makes the outputof the convolution become approximately invariantto small translations in the input.

In many industry applications, CNNs are re-quired to deal with coloured images. In such cases,the images are represented by a 3D grid, where thethree layers along the depth dimension representthe RGB color components of each pixel. The con-volution is done with 3D filters to convolve alongthe RGB layers, called channels in DL terminology.For IACT images, we may treat each of the tele-scope images as a single channel of the event image.In this case the depth is defined by the maximumnumber of participating telescopes t and the filtershave dimensions of m×n×t, where m and n are thewidth and length of the filter, respectively, and inour case 2 ≤ t ≤ 4. This procedure will be furtherexplained below.

In addition to the telescope channel representa-tion, one may view the images of an EAS eventas a temporal sequence of images. The orderingof the sequence can be determined by the time or-der of the triggering telescopes. With such eventrepresentation, one may apply a Recurrent Neu-ral Network (RNN) [13], where for example theoutput of the CLs are fed into a recurrent cell be-fore it is sent to the dense layers. RNNs enablemachines to be persistent by finding correlationsbetween the different inputs in the sequence. Thisimplies that an RNN has a “memory” that cap-tures information from previously analysed datain the sequence. It should be noted, however, thatthe temporal correlations learned by the recurrentcell can be bi-directional as well (i.e. looking atboth past and future data during the learning pro-cess). However, the computation requirements forsuch case are heavier and we have not found suf-ficiently convincing reasoning for applying it here.The recurrent cell applied in this work is the so-called Long Short Term Memory (LSTM) cell [15].

To implement our models, incorporating the al-gorithms described above, we have been utilizingthe TensorFlow [16] DL framework. The mod-els were trained on a machine with two NvidiaTM

GeForce GTX 1080 GPUs. We took the data par-allelization approach to accelerate the training pro-cess.

3. Training and Test Data-sets

The parameter distributions that describe adata-set used to train a neural network are incor-porated in the prior probability of the resultingclassifier or regressor. This statement is clear if oneconsiders the network as a machine that predictsthe function that is likely to have generated thedata. Therefore, one must carefully consider thedistributions of parameters that describe the train-ing data-set before initiating the training process.For example, for a classification task, a trainingset should contain an equal number of examplesfrom each class. In the case of IACT data, andin particular for regression tasks, one might con-sider training on sets with different energy spectraor offset distributions to help the learner convergetowards the desired predictor. In addition, it is acommon rule of thumb that deep networks performbetter with larger training data-sets.

3.1. Training data-sets

Throughout this work, all data-sets that wereused for training the different networks compriseevents generated by Monte-Carlo (MC) simula-tions. These events are obtained by simulating theinteraction of γ-rays (i.e. signal) and protons (i.e.background) with the atmosphere using the COR-SIKA software [17]. Following the shower simu-lation, the response of the H.E.S.S. telescopes issimulated using the sim telarray package [18] inorder to generate the telescope images.

The three analysis goals described in Sec. 1 sug-gest two basic types of networks. The backgroundrejection goal calls for a classification network,where an event is classified into one of two possiblegroups (i.e. signal and background). The two re-construction goals are regression tasks, where thenetwork is trained to predict a continuous param-eter based on the image input. To that end, wechose two data-sets, one for classification trainingand another for regression training.

The classification training-set includes 2×106

events, where the ratio of signal events to back-ground events is one. The equal number of signaland background events in this data-set implies thatthe data-set is balanced in terms of the class-labels.For the regression tasks, we use a training-set com-prised of 1×106 MC γ-events. All events were sim-ulated as diffuse emission around 20◦ zenith and180◦ azimuth and powerlaw spectral index of −2.

The differences between the two data-sets arein the simulated view-cone (i.e. the solid-anglearound the telescopes pointing position, withinwhich showers are generated), the energy rangesand the telescopes’ optical efficiency. For the clas-

4

Page 5: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

sification set both particle types are simulated witha view-cone of 2.5◦ and the signal events have en-ergies from 20 GeV to 100 TeV while backgroundevents are simulated with energies that range from100 GeV to 200 TeV. The events in the regressiondata-set are simulated with a view-cone of 3◦ andhave energies between 5 GeV and 100 TeV. The dif-ferent parameter ranges come from our choice torely on existing H.E.S.S. simulations in order tosave valuable computation time.

Because we standardize our images as partof the pre-processing stage (see Sec. 4), theCNNs are blind to the optical efficiency of tele-scopes. Nevertheless, for reference we state thatthe classification sets are created from the so calledH.E.S.S. phase1 simulations and the regression setscontain H.E.S.S. phase2b5 simulations, where thephases refer to a state of the H.E.S.S. array.

The raw simulated images are cleaned accord-ing to the standard H.E.S.S. cleaning scheme [7].To take advantage of the stereoscopic observationsof H.E.S.S., our simulation sets consists of eventsthat survived the image cleaning in at least two ofthe CT1-4 telescopes (referred to as multiplicity-2). CT5 images are omitted from all data-sets.

We would like to point out that the training setnumbers given above are without the usage of pre-selection cuts. This means, particularly due to thelarge energy range of the data-set, that many of theevents (∼ 30% on average) in the training data aretruncated and would not have passed the standardpre-selection cuts of the HAP chain (see Sec. 3.2).When applying pre-selection cuts to the trainingdata, we observed enhanced performance solely forthe direction reconstruction on simulated events.In this case, the number of events in the trainingset is 620 k.

In addition, the training sets were not binned inany parameter (other than the zenith angle). Tak-ing the binned training approach could, in prin-ciple, increase the accuracy score of the classifier.However, this means that when coming to analysereal data each event would have to be sent to itscorresponding classifier. This requires knowledgeabout the particle properties (e.g. the energy forenergy binned training) prior to the reconstructionstage. As the energy and direction of the γ-ray arenot necessarily known at the classification stage,this approach was not favoured at this proof-of-concept stage of the project.

3.2. Benchmark test data-sets

In addition to the training sets, we have also cre-ated independent test data-sets in order to serve asbenchmark and test the performance of the clas-sifiers and regressors in a statistically significant

way. The benchmark sets are used to demonstratehow well a classifier or a regressor generalizes toan arbitrary set of events from the simulation data- excluding events that were used in the trainingprocess. Therefore, the benchmark test data-setsare sub-sets of the relevant event distributions fromthe data-set from which the training sets were cre-ated. One should note that the benchmark setsdid not serve as validation sets, i.e. these data-setswere not used to tune network hyper-parameters.The benchmark test results were obtained only af-ter the work on a classifier or a regressor was com-pleted.

Another purpose of creating benchmark sets isto make a comparison between the DL based re-sults and the Hillas and ImPACT analyses of HAP.These analysis schemes are typically not able tocorrectly classify or reconstruct events that do notpass a set of defined pre-selection cuts, while insome circumstances a DL based analysis is able todo so. For the HAP analyses, cuts on the mini-mum image total amplitude (denoted as the sizeparameter) and the maximum distance betweenthe Hillas ellipse centre-of-gravity and the cam-era centre (denoted as the local-distance param-eter) are applied. The local-distance cut is used toreduce effects of image truncation at the edge ofthe camera. We have used a standard set of pre-selection cuts, where the minimum size parameterof an image is 60 p.e. and the maximum local dis-tance of an image is 0.525 m (equivalent to 2◦ inthe camera FoV). Together with the multiplicitycut, the pre-selection cuts mean that each surviv-ing event must have at least two telescope imagesthat survive the pre-selection cuts.

For the classification task, the size of the freebenchmark set (i.e. without pre-selection cuts) is2×320 k events, while the pre-selected benchmarkset contains 2×196 k events. For the regressiontasks, the benchmark set contains ∼500 k γ-eventsthat pass the pre-selection cuts from a point-likesource with an offset of 0.5◦to the pointing po-sition. The test results on simulated events wepresent were obtained using these benchmark testsets.

4. Image Pre-processing

The photomultipliers (PMTs) tubes in the cam-eras of the H.E.S.S. telescopes are arranged in ahexagonal grid. However, CNN implementationsin standard DL frameworks are designed to pro-cess arrays of data points arranged in square grids.To process H.E.S.S. data with these frameworks,it is therefore necessary to pre-process the cam-era images and map the image data onto a squaregrid. Converting data points from a hexagonal

5

Page 6: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

Figure 1: Telescope images of a γ-ray (top row) and a hadronic (bottom row) event. The images consist of numerous eventimages laid in a common camera plane, for visualization. Left column: Original camera pixel intensities. The squareimages show the same event sampled with Gaussian smoothing (left) and by rebinning an hexagonal histogram (right).The square images are oversampled with a resolution of 100 × 100.

to a square grid is not trivial, as the two lat-tices respect different symmetries (6-fold symme-try for the hexagonal lattice and 4-fold symmetryfor the square lattice). The challenge is thereforeto perform discrete convolution operations on pre-processed telescope data using the methods pro-vided by DL frameworks, while conserving the spa-tial symmetries and intensity distributions of theoriginal data as much as possible.

There are in principle two possibilities to addressthis complication. The conservative approach isbased on the resampling of the hexagonal data.The idea is to transform the data itself in such away that the original image properties are approx-imately translated to a square grid. An alternativeis the construction of custom convolution kernelsthat conserve the properties of the hexagonal grid.The data points are therefore rearranged into asquare grid, to which the custom convolution ker-nels are applied.

While the resampling approach can only approx-imately conserve the hexagonal image properties(where the degree of distortion depends on thespecific method and resampling resolution), it al-lows one to take advantage of the full set of func-tionalities and methods provided by modern DLframework and apply state-of-the-art CNN archi-tectures. This is not the case for the custom hexag-onal kernels, as such task specific operations re-quire manual adaptation of the common and avail-able methods. To explore the applicability of such

costum kernels to H.E.S.S. data, an extension tothe PyTorch DL framework [19] has been devel-oped [20], providing flexible implementations ofconvolution and pooling operations for hexago-nally sampled input data. This approach showspromising initial results on simulated data, but isstill being developed to be applied to real-worlddata. We present here the principles of the tech-nique and plan to show applications of this algo-rithm in a future publication.

4.1. Comparison of resampling methods

There are several common approaches to gener-ate square images from hexagonal data. For exam-ple, viewing the camera intensity map as a grid ofpoints, each of which located at the coordinates ofthe pixel centre, holding the pixel intensity, opensup a multitude of interpolation methods such aslinear and cubic interpolation. Interpolation meth-ods are probably the most widely used approachfor image processing. To interpolate, pixel valuesare interpreted as discrete samples of a continuousfunction - the image intensity distribution in thecase of H.E.S.S.. Constructing such a function al-lows resampling the original image at an arbitraryset of points. The interpolated values obey somerelationship to neighbouring original image valuesdepending on the choice of interpolation method.For the H.E.S.S. images, problematic behaviourarises in the corners of the cameras because of the

6

Page 7: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

large distance between square and hexagonal pixelcentres. This can be circumvented by introducingartificial zero valued pixels in order to mask thecamera corners. One should note that linear andcubic interpolation do not preserve the total imageintensity.

On the other hand, interpreting the camera’sPMTs as bins in an hexagonal histogram that col-lect single photons is more realistic than a centrepoint approximation and the histogram can be effi-ciently rebinned into a square histogram. Camerapixel values are related to the number of photonscollected by the corresponding PMT’s photocath-ode. Interpreting the image as a histogram is thusmore physical than concentrating the pixel valueto a single point. For our purposes we would liketo know what an image taken with a camera con-sisting of a square grid of PMTs would look like.This can be achieved by rebinning the hexagonalhistogram into a square one. This allows usingan arbitrary resolution and implies the conserva-tion of total image intensity. Rebinning can beformulated as a single sparse matrix-vector multi-plication, yielding low computation times. For fur-ther details on the rebinning method applied here,see [21].

In [21] we also show the results of a thor-ough comparison between interpolation, rebin-ning, smoothing and oversampling methods, wheresmoothing refers to methods such as Gaussiansmoothed sampling and an example of oversam-pling IACT images can be found in [22]. Thecomparison investigated the influence of the re-sampling methods on artificial telescope images,generated by a 2D Gaussian function that wassampled to a camera-like hexagonal grid. The re-sults of this study show that cubic interpolationexcels at shape conservation. However, it is byfar the most computationally expensive method.Linear interpolation and rebinning exhibit similarperformance that is reasonably close to cubic in-terpolation. Nevertheless, the rebinning is some-what more computationally efficient relative to lin-ear interpolation. Smoothing and oversamplingmethods were ruled out as resampling method forH.E.S.S. images in the study. Fig. 1 shows ex-amples of pre-processed images, resampled with aGaussian filter and our rebinning method.

4.2. Resampling of H.E.S.S. images

Following our comparison study and to studydifferent approaches, we generate resampled im-age inputs using both rebinning and linear inter-polation methods. The rebinned images were usedfor classification models, where data-sets are gener-ally large, and the interpolated images were used

for direction reconstruction models. The resolu-tion of the resampled images significantly affectsthe performance of the analysis, where we observedthat too high or too low resolutions degrade per-formance. In both cases, images were resampledwith a resolution of 64× 64 pixels (due to the cam-eras’ aspect ratio the true resampling resolution is64× 62; To get a square image we pad the imagesappropriately). This leads to a ratio of roughlyfour between the number of resampled pixels andcamera pixels.

As part of the pre-processing stage, we standard-ize each image, so that it has a mean intensity ofzero and a standard deviation of one. Besides ac-celerating convergence, the standardization of theimages effectively makes the networks invariant todifferent telescope optical efficiencies. This meansthat the classification and direction reconstructionare relying solely on shape features and thus can beapplied to analysis of real data without account-ing for the telescope optical efficiency which de-grades over time. For energy reconstruction, onemay reconstruct the shower impact point on theground using a CNN, thereby conserving the opti-cal efficiency invariance. We discuss possible im-plementations of a DL-based energy reconstructionin more detail in Sec. 8.

4.3. Hexagonal Convolutions

The basis to design custom hexagonal kernels,while relying on standard image processing algo-rithms that assume the input to be sampled on asquare grid, is the rearrangement of the hexago-nally sampled data and the hexagonal kernels. Tothat end, our implementation of the custom kernelsassumes that the original input is squeezed into asquare layout. Sub-kernels are then convolved withspecific patches of the squeezed image according tothe hexagonal layout. Squeezing means that thehexagonal lattice is zero-padded (according to thedesired size of the hexagonal convolution kernels)and that the hexagonal lattice cells are interpretedas square cells. To complete the squeezing, eachprotruding lattice column is shifted by 1/2 of thevertical lattice spacing to get a well defined squaregrid. The horizontal lattice spacing is kept fixed.This process, depicted in the first step of Fig. 2,results in a square grid with equal length columns.

Next, the hexagonally-symmetric convolution isaccounted for by splitting the hexagonal convo-lution kernel into hexagonal sub-kernels. Thesesub-kernels need to be re-defined on the squeezedgrid in order to perform dilated convolutions us-ing standard methods. The shape, dilation andstride of each of the re-defined sub-kernels are cho-sen so that they are effectively applied only to pix-

7

Page 8: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

Padded Hexagonal Input

15

913

0

0

26

1014

0

0

37

1115

0

0

48

1216

0

0

00

00

00

00

00

1 5 9 13

0

0

2 6 10 14

0

0

3 7 11 15

0

0

4 8 12 16

0

0

0

00

00

00

00

0

Input squeezedinto square array

1 5 9 13

0

2 6 10 14

0

3 7 11 15

0

4 8 12 16

000

00

00

00

00

Padding1

+Kernel 1, stride(2

,1)

5 18

11 38

13 42

15 46

22 19

26 21

30 23

16 12

Merge

0 0

1 5 9 13

0

2 6 10 14

0

3 7 11 15

0

4 8 12 16

0

0

0

0

0

0Padding

2

+ Kernel

1,stride

(2,1)

0

0

0

0

0

0

0

0

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

Padding 3+ Kernel 2, stride (1,1)3 11 19 27

6 18 30 42

9 21 33 45

7 15 23 31

8 33 37 46

17 44 68 63

22 31 75 68

22 31 69 33

Add

833

3746

1744

6863

2231

7568

2231

6933

Hexagonal Outputof equal dimension

Figure 2: Schematic description of the individual steps of a hexagonal next-neighbour convolution. All kernel weights areset to 1. The input image consists of 4 × 4 hexagonal toy pixel values with ascending integer values and is padded withone layer of zeros in order to preserve the dimension of the input. In the first step, the hexagonal image is squeezed into asquare layout. Then, the hexagonal kernel is split into two square sub-kernels: Kernel 1 (red & orange, 2 × 2, dilation(2, 1), stride (2, 1)) and kernel 2 (green, 1 × 3, stride (1, 1)). Kernel 1 must be applied separately to even and odd columnsof the squeezed image. The output of these two operations has to be interchangeably merged to preserve the spatialrelationship of the columns. In the last step, the merged array is summed with the result of the squeezed imageconvolution with kernel 2. The resulting square array is equivalent to the result of a hexagonal convolution map.

els that are the hexagonal nth neighbours of thecentre pixel of the corresponding hexagonal imagepatch, where n is a positive integer. Combiningthe sub-kernels’ convolution outputs in an appro-priate way yields a map that is equivalent to theresult of applying a hexagonal kernel to the centrepixel of the image patch.

The detailed description of defining the customsub-kernels is out of the scope of this paper, butis presented in [20]. Here we merely supply thereader with an example, shown in Fig. 2. In the fig-ure, a toy input tensor is convolved with a hexag-onal next-neighbour (NN) convolution kernel (i.e.where the kernel covers all direct neighbour cells ofthe centering cell in the hexagonal layout) and theindividual steps for squeezing a hexagonal tensorand assigning the custom kernels to the squeezedimage are schematically shown.

5. Background Rejection Classifier

Compared to satellite-based detectors IACTsprovide large effective detector areas. However,the vast majority of events recorded by an IACTcontain hadronic CR background. The ability tocorrectly reject background events without loosingsignal events is therefore a key aspect that deter-mines the sensitivity of an IACT and hence servesas a primary goal for this work.

During this study we have learned that CNNstrained on simulated events exhibit different per-formance when tested on a MC test-set and whenanalysing real-data. This statement holds for

all three analysis tasks. For classification, theMC/real-data discrepancy manifests itself in thefollowing way. When compared to the real dataperformance of the HAP BDT classifier, we foundthat classifiers based on a standard CNN architec-ture tend to misclassify events that triggered threeor four of the telescopes, although these classifierswere showing the best performance on the bench-mark sets. However, by combining a CNN with anRNN this mismatch is considerably relaxed and thereal world performance of the classifier improvessignificantly. Therefore, we present here the re-sults obtained with the latter, denoted by CRNN.This interim solution suppresses the discrepancyeffects, but is certainly not an optimal and robustway to address it. The discrepancy between simu-lation and observation has important potential im-plications for analysis of real data and we addressthem further in Sec. 8.

In the CRNN model we treat each telescope im-age as part of a sequence ordered by the size pa-rameter of each image. This ordering assumes thatEAS images with higher size parameter are gener-ated first in telescopes that are closer to the impactpoint of the shower axis on the ground and is ex-pected to compensate for the lack of temporal datain the event images data.

The CRNN architecture is based on the ideathat a series of CLs can be used to find a vec-tor representation of the 2D event images. Then,by ordering these vectors according to a size-basedsequence, an RNN cell can discover correlationsbetween the vector representation of the different

8

Page 9: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate (1 - specitivity)

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rate (sen

sitivity

)

ROC

CRNN (preselcted events); AUC = 0.9915CRNN (no cuts); AUC = 0.9875BDT (preselcted events); AUC = 0.9642

(a) ROC curves with matching AUC values.

0.0 0.2 0.4 0.6 0.8 1.0ζ

10−1

100

101

102

Percen

tage

%

ζ di tribution (pre elect benchmark Acc. = 96.14%)

SignalBackground

(b) CRNN ζ-score distribution for simulated signal andbackground events.

Figure 3: ROC curves for the CRNN classifier and the H.E.S.S. BDT classifier (left) and the ζ-score distribution for theCRNN classifier, obtained on the benchmark data-set with pre-selection cuts.

event images. Lastly, the RNN cell output is fedinto a dense network.

To implement this idea, we used a network con-taining three CLs for the image representationstage. A 0.5 dropout rate (see [23]) was appliedto the vector outputs of the last CL. These vectorsare then fed into an LSTM cell, according to theimage size order. The output of the recurrent cellgoes through another dropout node and is then fedinto a two-layer dense network with dropout aftereach layer. To further reduce overfitting, we reg-ularize each dense layer by applying weight decaywith a constant of 0.004. The last layer of the net-work is a simple linear layer with two nodes. Toclassify inputs, one feeds the linear inputs to a soft-max function to yield a probabilistic measure andpredict the class of the input image.

To calculate our test accuracy, we used a ζthreshold of 0.5, where ζ denotes the signal classsoftmax value for a single event. An event is clas-sified as a γ-ray when it receives a ζ-score > 0.5,otherwise it is labeled as background (ζ can bethus interpreted as the “probability to be a γ-ray”). For this threshold, our network achieves atest set total accuracy of 95.4% on the base bench-mark set and 96.1% on the pre-selected benchmarkset (where total accuracy accounts for both γ-raysand protons which are correctly classified). We il-lustrate the general performance of the classifier bythe ROC curve and the derived area-under-curve(AUC) metric. The test results on the classifica-tion benchmark sets are shown in Fig. 3. Fig. 3ashows the ROC curve, along with correspondingAUC values, of the CRNN classifier when appliedto both classification benchmark sets, i.e. withand without pre-selection cuts, as well as the ROC

curve of the standard H.E.S.S. BDT when classi-fying only the pre-selected events. Fig. 3b showsthe CRNN ζ-score distribution of the events thatpass pre-selection cuts.

Fig. 3a demonstrates that, unlike the BDT clas-sifier, the CRNN classifier is quite robust againstsimulated images that are not fully contained inthe camera. A BDT classification relies on theHillas parameters, which may be significantly bi-ased when a shower image is truncated. TheCRNN, however, searches for common shape fea-tures in patches of the image and hence is lesssensitive to distortions originated by truncation orbroken pixels.

6. Direction Reconstruction Regressor

In contrast to the geometrical direction recon-struction that was described in Sec. 1, a neu-ral network learns to predict the shower directionbased on the features the convolutional filters ex-tract from the images. We have seen that for re-gression, a CNN with a channel representation ofevent images generally outperforms architecturesthat include a recurrent cell.

For the direction reconstruction task, we incor-porate convolutional layers with a slightly differ-ent structure than that presented in Sec. 2. Here,the structure of the convolution layers compriseone convolution stage, one nonlinear stage, a sec-ond convolution stage, a second nonlinear stageand only then a pooling stage. We denote thislayer structure by 2-1-CL. Multiple stacked con-volution with nonlinear stages can develop morecomplex features of the input volume, before thepooling stage down-samples the non-linear convo-

9

Page 10: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

lution output. This is generally a good idea forlarger and deeper networks, that are necessary forthe regression task. The need for a deeper networkcan be understood by viewing the continuous la-bels of a regression network as an infinite set ofdiscrete classes.

The network architecture for the interpolatedimages input includes five 2-1-CLs followed by fourdense layers. We use the same weight decay as inSec. 5 for the dense layers and apply dropout aftereach dense layer with a rate of 0.8. The loss is cal-culated with the L1 distance to control the genericdifferences between features of low and high energyevents, where the latter are present in much lowerquantities in our data-sets.

To label the training examples, we use the sourceposition vector in a cartesian coordinate systemthat is defined by the optical axis of the telescopes,where one coordinate axis is aligned with the hori-zon. This coordinate system is referred to as thenominal system. The coordinate transformationsto the Alt-Az and Ra-Dec systems from the nom-inal systems, and vice-versa, are done by HAP.

A useful quantitative estimation of the perfor-mance of our direction reconstruction on γ-raysimulations is the angular resolution, defined hereas the 68% containment radius (or the 68th per-centile) of the reconstructed event positions froma point-like source. This can be taken as a mea-sure of the device PSF. Figure 4 shows the angu-lar resolution versus true simulated event energyobtained with two CNN-based regressors, Hillas-based reconstruction and ImPACT analysis. TheCNN label in the legend refers to a regressor thatwas trained using pre-selected events, while theCNN_noCuts label refers to a regressor that wastrained without applying pre-selection cuts to thetraining set. From the figure, both CNN regressorsshow a significant improvement in the angular res-olution, particularly at the low energy range, whencompared to the Hillas-based direction reconstruc-tion. Between the regressors, an apparent improve-ment is achieved by applying pre-selection cuts tothe training set (despite the fact that the remain-ing training set contains only 62% of the originalevents). Nevertheless, the CNN resolution is stillslightly worse than the ImPACT reconstruction.In principle, as our network does not show signsof overfitting with the given architecture, it is alsopossible that a deeper network will further improvethe angular resolution we have shown here.

Despite the fact that the CRNN classifier is notsensitive to relaxation of the pre-selection cuts,the direction regressor shows a similar behaviorto the Hillas-based and ImPACT reconstructionswith growing values of the local distance, wherethe angular resolution at energies above 10 TeV

10−1 100 101 102

True E [TeV]

0.04

0.06

0.08

0.10

0.12

0.14

R 68 [

deg]

R68(E) preselected events - pointlike sourceCNNhillas_stdImPACTCNN_noCuts

Figure 4: Angular resolution vs. true simulated energy at20◦ zenith angle. The results of two CNN regressors areshown in comparison to the Hillas-based and ImPACTPSFs. The dotted curve refers to a regressor trainedwithout applying pre-selection cuts to the training data,while the ’X’-decorated curve refers to a regressor trainedon pre-selected events. All reconstructions are carried outon the same pre-selected benchmark set.

grows larger with the increase of the local distancecut. This comes from the fact that the L1 loss isused in the training. As mentioned, such a lossis less sensitive to outliers, which in our case arerepresented by high energy events, due to the softenergy index of the training set. Using an L2 lossfunction for the direction reconstruction task gen-erally tends to yield a regressor that performs bet-ter at high energies, but has nevertheless worseoverall performance than a network learning withan L1 loss. As the real data analysis presented inthe following section is done on a source with avery soft spectrum, the choice of the L1 loss seemsto be a reasonable one for this work. Nevertheless,to better generlize the regressor one may train onthe low energy events using the L1 loss and on thehigh energy events using the L2 loss.

7. Source Analysis

To demonstrate that the results observed withMC simulated events translate well into real-world performance, we test the performance ofour CRNN classifier and CNN direction regres-sor on two source observation sets of the blazarPKS 2155-304. The first set contains observationsof the source in a non-flaring state with the pri-mary goal to test the background rejection perfor-mance of the CRNN classifier. We compare theseresults to those obtained by using a BDT-basedbackground rejection. The second set consists of asingle run that was taken during a strong flaringactivity of the blazar, an ideal data-set to validatethe CNN direction reconstruction performance dueto the very low background rate. To conduct the

10

Page 11: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

source data analyses, we rely on the H.E.S.S. cali-bration, instrument response functions and lookuptables used for the Hillas and ImPACT analyses asimplemented in the HAP software. Pre-selectioncuts, as described in Sec. 3.2, were applied to allreconstruction chains in the following.

7.1. Observation runs

PKS 2155-304 is a bright point-like VHE γ-raysource at redshift z = 0.116 [24]. For the back-ground rejection test, we selected 14 observationruns which pass the standard H.E.S.S. data qual-ity criteria with a mean observed zenith distancein the range [19.5◦, 20.5◦], in accordance with thezenith angle of the training set events, from theH.E.S.S. database. The runs were taken between2004 and 2010 and sum up to a total live-time of5.9 hours.

The direction reconstruction performance wastested on observations of the PKS 2155-304 flarethat was recorded in 2006 [25], when the blazarshowed an average flux of I(> 200 GeV) = (1.72±0.05stat ± 0.34sys) × 10−9cm−2s−1, correspondingto ∼ 7 times the flux I(> 200GeV ) observed fromthe Crab Nebula (in comparison with the usual20% Crab flux of the source). One observationwith average zenith angle of 19.2◦ was chosen forthis analysis.

7.2. Analysis

The separation power of the HAP BDT classifierrelies heavily on the Mean Reduced Scaled Widthand Length (MRSW and MRSL, respectively) ap-proach introduced in [7]. These parameters arebased on the fact that the width and the lengthdistributions of a γ-ray shower image can be de-scribed well with MC simulations, in bins of energy,zenith, azimuth, offset and optical efficiency. Thescaled parameters are used as reference to iden-tify γ-like background events that have an ellipticalshape but their width and length, do not belongto the same distribution as the γ-ray events witha similar energy. The MRSW and MRSL are aweighted average of the standardized Hillas widthand length of the image for all participating tele-scopes.

A CNN-based neural network is not able to ex-plicitly learn parameters that describe the globaldistribution of the training set, unless it was specif-ically trained to predict these parameters. Includ-ing all relevant parameters as training labels wouldrequire a more complicated and computationallydemanding network architecture due to the grow-ing number of parameters needed to find a singlefunction to model the relation between an imageto all different labels. Since our labels do not re-

Configuration Non αNoff σ S/BImPACT_BDT 704 55.8 46.1 11.6ImPACT_CRNN 832 62.3 50.8 12.4

Table 1: Event statistics, significance as calculated by themethod of [26] and signal to noise ratio of the 14 runs ofPKS 2155-304 data, for two analyses: an ImPACTdirection reconstruction with a BDT classifier (top row)and an ImPACT direction reconstruction with a CRNNclassifier (bottom row).

late to such information, to minimize the loss thenetwork updates the parameters by looking onlyat intensity gradients in individual images. Thus,an elliptically shaped background image would beclassified as signal, even if its MRSW or MRSLare outliers with respect to the image width andlength distribution.

To deal with this, we use the HAP lookup ta-bles to obtain the MRSW and MRSL values foreach event in the observation runs and use themas additional rejection parameters prior to apply-ing the CRNN classifier. The cut range is definedby the traditional Hillas analysis background re-jection scheme, which does not apply a BDT butmerely cuts on the MRSW and MRSL values (asdefined in [7]).

This step was not performed in the verificationof the CRNN classifier presented in Sec. 5 and thequantified estimated performance represents thetrue classification power of the CRNN classifier(i.e. the only cut used to classify simulated eventswith the CRNN classifier is the ζ-score cut of 0.5).The reason that the classification MC test showsexcellent performance without the additional cuts,stems from the fact that the ratio of signal to back-ground events in the test set is 1:1. The effect ofthe misclassified γ-like background in the test isseen in the small excess of background counts withhigh ζ values in Fig. 3b. However, in reality, wherethe signal to background ratio is at least of the or-der 1:1000, the influence of γ-like background ismuch more severe. In the remain of this section,a reference to the CRNN classifier implies a use ofthe shape cuts as well.

The next step in classifying real world events isan optimization of the ζ-score that will be used toseparate events into their appropriate class. Usu-ally, and as done in the case of the HAP BDT, onelooks for an energy independent cut that yieldsa constant signal efficiency for all energies. Wehave not optimized our ζ cut yet in such a way,although we expect an optimized, energy indepen-dent value to improve performance. The cut valueof 0.9 we chose is based on maximizing the sig-nal to noise ratio in the observation runs of the

11

Page 12: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

Figure 5: The squared angular distance θ2 distribution forexcess events from one flare observation of PKS 2155-304,using the Hillas, ImPACT and CNN directionreconstruction methods.

non flaring PKS 2155-304, without loosing signalcounts compared to the HAP BDT classifier.

The classifier performance comparison was doneby running two separate ImPACT analyses onthe 14 PKS 2155-304 runs: once on events thatwere flagged as signal by the standard HAP BDTscheme and once on events that were identifiedas signal by the CRNN classifier (together withthe MRSW and MRSL cuts). The results aresummarized in Table 1, where the event statisticsNon (number of events from the source on-region)and αNoff (The estimated background counts) areshown together with the significance σ and signalto noise ratio S/B. The significance is calculatedaccording to the Li & Ma method [26]. The tableindicates that a CRNN classifier increases both thestatistical significance of the source and the ratio ofsignal events to background events while increasingthe number of excess counts (Non − αNoff).

The application of our direction reconstructionto a real world source calls for accounting of severalcorrection factors that affect the operation of thetelescope in reality. The largest correction is thetelescope pointing correction that compensates forsmall structural, mechanical and thermal deforma-tions. Such considerations are not part of the MCsimulations we have used to train our networks.Other factors include atmospheric refraction indexand focal length normalization factors. All the sys-tem corrections are applied by the HAP software.

The CNN direction reconstruction was done byapplying the CRNN classifier to the PKS 2155-304 flare data-set and reconstructing the directionof the surviving events by means of the CNN re-gressor that was trained on pre-selected events (seeSec. 6). Nevertheless, the real-data performanceof the two regressors presented in the previous sec-

Figure 6: A two dimensional distribution of excess eventsobserved in the direction of PKS 2155-304 for the one flareobservation, using the CRNN classifier and CNN regressor.

tion is very similar. Figure 5 shows the distributionof the squared angular distance between the recon-structed event position to the test position θ2, forexcess events from the single PKS 2155-304 obser-vation in flare state. The θ2 distribution is plottedfor the CNN regressor, a Hillas reconstruction andan ImPACT reconstruction. The Hillas and Im-PACT reconstructions use the HAP BDT for back-ground rejection, although the choice of a classifieris less relevant for flare observations. The superiorperformance of the ImPACT method is clearly ev-ident, where the ImPACT PSF is 0.067◦, preciselymatching the results shown in [9]. The CNN andHillas reconstruction are on the same level, with aPSF of 0.102◦ for both. The PSFs were calculatedby using 40 bins over a range of [0, 0.04] in θ2.

Comparing Figs. 4 and 5, the different perfor-mance of the CNN regressor when analyzing sim-ulated events versus real data is clearly evident -particularly when considering the soft spectrum ofthe source (as the CNN benchmark angular resolu-tion is very similar to the ImPACT angular resolu-tion for low energy events). This is yet anothermanifestation of the discrepancy between simu-lated images and real data, which is discussed inSec. 8.

Fig. 6 shows a two-dimensional sky map of theexcess counts in the direction of PKS 2155-304.Both figures show that the CNN direction recon-struction is able to detect a point-like source witha similar performance to a Hillas-based analysis. A2D Gaussian fit of the peak in Fig. 6 finds a devi-ation of 11.8 arcsecs from the test position, wherethe HAP estimated systematic errors are ∼30 arc-secs [7].

12

Page 13: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

8. Summary and Outlook

We have demonstrated the improvement inbackground rejection performance of a convolu-tional recurrent neural network on a real worldbright γ-ray source. Coupled to existing recon-struction techniques, this algorithm seems to havethe potential for significant improvement of theanalysis of VHE γ-ray sources. Because our classi-fier is robust against image defects and truncationeffects, it could play an even more significant rolein the future if a reconstruction technique that isable to deal with truncated images is introduced.

We have also presented a regression model toreconstruct the source sky-coordinates of a γ-rayEAS by applying a deep CNN to IACT images.This algorithm seems to perform reasonably wellon simulated images of γ-rays, and can be furtherimproved in the future - even by merely increasingthe size of the training set or introducing deepernetworks. Applying our direction regressor to realworld data seems to be adequate for detecting apoint-like source, but needs to be improved to com-pete with state-of-art methods. Nevertheless, bothsource analysis results - namely the background re-jection and direction reconstruction - are adequateto serve as a proof-of-concept for the viability of aDL based IACT data analysis.

An IACT primary analysis goal that we havenot presented here is the energy reconstruction ofγ-events. However, it is known that energy can bereconstructed with very high accuracy using thesize parameter of the IACT images together withthe EAS impact point on the ground. The recon-struction of the ground impact point is done geo-metrically in a very similar way to reconstructionof the source position, where a finite series of lin-ear transformations separate the two. This impliesthat the ability to reconstruct the source positionis a good indicator of the ability to successfullyreconstruct the energy of γ-events with DL tech-niques.

Energy reconstruction can be carried out in twoprinciple ways. In the first, one would build a net-work that predicts the ground impact coordinatesof the shower and use a lookup table into whichthe predicted values are input, together with thesize parameters. Such an approach can be com-bined with the direction reconstruction network byadding two more nodes to the output layer. Thesecond way is to build a network that predicts theenergy by concatenating a vector that contains thesize parameters per event with the vectors of fea-tures from a CNN block. We are currently explor-ing both ways. However, at this stage we merelyhave a preliminary model which performance can-not be reliably estimated on real world data. A

way to quantitatively estimate the energy recon-struction is by measuring a source flux. However,this requires dedicated effective areas that have tobe generated by taking into account the cut per-formance of the background rejection method im-plemented of the analysis chain. MC simulationsto create effective area lookup tables are plannedfor the near future.

In the future, when larger IACT arrays such asthe CTA [27] are built, one would like to be able tocombine observations with different types of tele-scopes. In the case of H.E.S.S. , hybrid reconstruc-tion (i.e. including CT5 data in the event imagesstack) can be implemented by adding the ”new”telescope images to the existing stack of images.With an RNN architecture this is done by simplyextending the sequence length. RNNs are flexi-ble enough to deal with different input lengths inthe same sequence. With the channel stack, onewould need to resample all telescope images to asingle and identical square grid. This could intro-duce the problem of oversampling one image whileheavily downsampling another. Of course, mem-ory might become an issue as well as the length ofthe image stack per event grows.

Another issue with large arrays is the differ-ent telescope combinations dedicated for a specificobservation-run. Since the number of participat-ing telescopes in a run is certainly smaller than thenumber of telescopes in the array, an observation-run of a specific source will include a sub-set of thetelescopes in the array. Then, a network needs tobe trained on the particular telescope combinationused for the run. It seems reasonable that run-wisesimulations are the way to address this issue.

The disagreements we consistently observe be-tween the performance of the CNN networks onsimulated events versus real world events couldsuggest that a network trained on the completeintensity distribution in the image learns featuresfrom simulated images that do not exist in real-data images. This could indicate several difficultieswith introducing a DL-based analysis. These diffi-culties emerge both when such a chain performs afull analysis or even when a single DL-based anal-ysis task is combined with other state-of-the-artmethods. In order to calculate fluxes, one relieson MC based effective areas, which are affected byall three analysis tasks. For example, the cut ef-ficiency of the classifier in use directly affects thenumber of surviving signal events. Since a DL-based classifier acts differently on simulation ver-sus observation data, the effective areas are not re-liable when applied to observation events and thederived fluxes could be biased.

We also note that although a CRNN is seeminglynot susceptible to the MC/real-data mismatch, it

13

Page 14: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

could certainly suffer from it as well. For exam-ple, a network with the CRNN architecture thatis trained to distinguish between simulated pro-ton images and real-data images becomes aston-ishingly efficient at performing this task, with anaccuracy of 99.5%. When testing the same classi-fier on a set comprised of MC γ’s (which were notshown to the network during training) and MCprotons, it assigns 99.6% of the events to the ’sim-ulation’ class (i.e. the MC protons class in thetraining set). This illustrates the risk of using sim-ulations for training, as DL methods for computervision are able to easily find features that do notexist in real-data images. In addition, the perfor-mance of the CRNN classifier could improve oncethe simulated images in the training sets containsimilar features to real-data images.

Another issue arises when trying to determinethe optimised network architecture for a giventask. For example, as mentioned in Sec. 5, tuninghyper-parameters to optimize the network perfor-mance is done on a hold-out test-set, which is asample of the training population. However, in ourcase such test results are not translated well to realworld performance because a network that gener-lizes well on simulated events does not necessarilyperform accordingly when applied to observationdata.

The MC/real-data discrepancy could indicate anissue with the shape conservation of images resam-pling, an over-simplification of the telescope re-sponse simulation or a strong influence of randomnoise in real-data images (or a combination of thethree). However, our earlier study of resamplingmethods leads us to rule out the first potential ori-gin of the problem. Due to the importance of thismatter we plan to thoroughly study it in the im-mediate future. For example, the distributions ofparameters such as the Hillas parameters, MSRW,MSRL, etc. can be compared to gain more insightinto the depth of the problem. Studies on featureimportance in a network trained on simulation vs.real-data could assist in this as well. Using e.g.DL based autoencoders on real data images andthen applying the decoder part to simulated eventscould be another possible way to address this is-sue. We plan to report all these findings in a futurepublication.

Lastly, in order to be able to properly analysea real world source, training the network in morezenith bins is required. Dividing the data into en-ergy bins could improve performance as well, butthe reduction in training set size will demand astage of generating simulations in the specific bins,particularly for the higher energies. The run-timeof the full analysis depends on the capabilities ofthe server at hand. Additional GPUs will reduce

the run-time considerably. On our two GPU ma-chine the analysis run-times are similar to the Im-PACT analysis.

Acknowledgements

We thank M. de Naurois, spokesperson for theH.E.S.S. Collaboration, and O. Reimer, chairper-son of the Collaboration board, for allowing us touse data from the H.E.S.S. array in this publica-tion. The authors are grateful to C. van Eldik formany helpful discussions.

References

[1] See: the TeVCat online database for VHE γ-raysources: http://tevcat.uchicago.edu/

[2] A.R. Bell, Cosmic ray acceleration, AstroparticlePhysics 43 (2013) 56-70.

[3] L. Bergstrom, Dark matter and imaging air Cherenkovarrays, Astroparticle Physics 43 (2013) 50-55.

[4] W. Hofmann, Data analysis techniques for stereoIACT systems, AIP Conference Proceedings 515(2010) 318.

[5] The H.E.S.S. Collaboration, see: www.mpi-hd.mpg.de/hfm/HESS/

[6] A.M. Hillas, Proceedings of the 19th International Cos-mic Ray Conference 3 (1985) 445-448.

[7] F.A. Aharonian (H.E.S.S. collaboration), Observationsof the Crab Nebula with H.E.S.S. , Astron. & Astro-phys. 457 (2006) 899-915.

[8] M. de Naurois and L. Rolland, A high performancelikelihood reconstruction of gamma-rays for Imag-ing Atmospheric Cherenkov Telescopes, AstroparticlePhysics 32 (2009) 231-252.

[9] R.D. Parsons and J.A. Hinton, A Monte Carlo Tem-plate based analysis for Air-Cherenkov Arrays, As-troparticle Physics 56 (2014) 26-34.

[10] S. Ohm, C. van Eldik and K. Egberts, Gamma-Hadron Separation in Very-High-Energy γ-ray astron-omy using a multivariate analysis method, Astroparti-cle Physics, 31 (2009) 383-391.

[11] X. Glorot, A. Bordes, and Y. Bengio, Deep Sparse Rec-tifier Neural Networks, AISTATS (2011) 315-323.

[12] D.E. Rumelhart, G.E. Hinton and R.J. Williams,Learning representations by back-propagating errors,Nature 323 (1986) 533-536.

[13] For more details about CNNs, RNNs and DL see:I. Goodfellow, Y. Bengio and A. Courville, DeepLearning, first ed., MIT Press, Cambridge, 2016 andreferences within.

[14] M.D. Zeiler and R. Fergus, Stochastic Pooling for Reg-ularization of Deep Convolutional Neural Networks,CoRR (2013).

[15] S. Hochreiter and J. Schmidhuber, Long Short-TermMemory, Neural Computation 9 (1997) 1735-1780.

[16] The Google Brain group, USENIX 12th Symposiumon OSDI ’16, 12 (2016) 265-283.

[17] D. Heck, J. Knapp, J. Capdevielle, G. Schatz and T.Thouw, CORSIKA: a Monte Carlo code to simulateextensive air showers, 1998.

[18] K. Bernlohr, Simulation of Imaging AtmosphericCherenkov Telescopes with CORSIKA and sim telar-ray, Astroparticle Physics, 30 (2008)149-158.

[19] See the PyTorch GitHub repository:https://github.com/pytorch/pytorch.

[20] T.L. Holch and C. Steppa, HexagDLy- Hexagonal Convolutions with PyTorch,

14

Page 15: Application of Deep Learning methods to analysis of …Application of Deep Learning methods to analysis of Imaging Atmospheric Cherenkov Telescopes data I. Shilon a,, M. Kraus , M.

https://github.com/ai4iacts/hexagdly (2018),doi:10.5281/zenodo.1166131.

[21] T.L. Holch, I. Shilon et al., Probing ConvolutionalNeural Networks for Event Reconstruction in γ-RayAstronomy with Cherenkov Telescopes, Proceedings ofICRC2017 795 (2017).

[22] Q. Feng and T.Y. Lin, The analysis of VERITAS muonimages using convolutional neural networks, Astroin-formatics 12 (2016) 173-179.

[23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskeverand R. Salakhutdinov, Dropout: A Simple Way to Pre-vent Neural Networks from Overfitting, Journal of Ma-chine Learning Research 15 (2014) 1929-1958.

[24] F. Aharonian et al. (H.E.S.S. collaboration), H.E.S.S.observations of PKS 2155-304, A&A 430 (2005) 865-875.

[25] F. Aharonian et al. (H.E.S.S. collaboration), An Ex-ceptional VHE Gamma-ray Flare of PKS 2155-304,ApJ 664 (2007) L71.

[26] T.P. Li and Y.Q. Ma, Analysis methods for results ingamma-ray astronomy, The Astrophysical Journal 272(1983) 317-324.

[27] B.S.Acharya et al. (CTA collaboration), Introducingthe CTA concept, Astropart. Phys. 43 (2013) 3-18.

15