Investigationsintoneutrinoﬂavor …1582385/... · 2021. 7. 31. · 0; (11) fromwhich e d{L R 1 2...

Degree Project C in Physics, 15 credits - VT21

Investigations into neutrino flavorreconstruction from radio detector data using

convolutional neural networks

Oscar Ericsson

Supervisor: Christian GlaserSubject Reader: Olga Botner

June 3, 2021

Abstract

As the IceCube Neutrino Observatory seeks to expand its sensitivity to highPeV-EeV energies by means of the radio technique, the need for fast, efficient andreliable reconstruction methods to recover neutrino properties from radio detectordata has emerged. The first recorded investigation into the possibilities of using aneural network based approach to flavor reconstruction is presented. More specifi-cally, a deep convolutional neural network was built and optimized for the purposeof differentiating νe charged current (CC) interaction events from events of all otherflavors and interaction modes. The approach is found to be largely successful forneutrino energies above 1018 eV, with a reported accuracy on νe- CC events of >75% for neutrino energies > 1018.5 eV while maintaining a >60% accuracy for ener-gies >1018. Predictive accuracy on non- νe- CC events varies between 80% and 90%across the considered neutrino energy range 1017 ă Eν ă 1019. The dependence ofthe accuracy on νe- CC events on neutrino energy is pronounced and attributed tothe LPM effect, which alters the features of the radio signals significantly at ener-gies above 1018 eV in contrast to non-νe- CC events. The method shows promiseas a first neural network based neutrino flavor reconstruction method, and resultscan likely be improved through further optimization.

Contents

1 Sammanfattning 2

2 Introduction 3

3 Physics 53.1 Electromagnetic Showers . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Hadronic Showers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Neutrino Detection with the Radio Technique . . . . . . . . . . . . . . . 93.4 The Landau-Pomeranchuk-Migdal effect . . . . . . . . . . . . . . . . . . 10

4 Neural Networks 124.1 Loss functions and Gradient Descent . . . . . . . . . . . . . . . . . . . . 124.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . 154.3 Training, Validating and Testing . . . . . . . . . . . . . . . . . . . . . . 164.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Method 185.1 Generating Datasets with NuRadioMC . . . . . . . . . . . . . . . . . . . . 185.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Network Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.4 Evaluation of the final network . . . . . . . . . . . . . . . . . . . . . . . 22

6 Results 236.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Final Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . 286.3 Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Discussion & Outlook 317.1 Decisions regarding final network architecture . . . . . . . . . . . . . . . 317.2 Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 Conclusions 34

A Training and Evaluation onNoiseless Data 35

1

1 SammanfattningI och med att energikänsligheten av neutrinoobservatoriet IceCube planeras att utökastill PeV-EeV energier genom den s.k. radiometoden har behovet av snabba, effektivaoch pålitliga rekonstruktionsmetoder för att återfå neutrinoegenskaper från radiodetek-tordata tydliggjorts. Den första dokumenterade undersökningen av möjligheterna tillanvändandet av CNNs (Convolutional Neural Networks) för rekonstruktion av neutri-noarom presenteras. Ett djupt CNN utvecklades och optimerades med avsikt att urskiljaνe CC (charged current) interaktioner från resterande aromer och interaktioner. Meto-den visas prestera väl för neutriner med energier över 1018 eV, med en träffsäkerhetpå νe- CC event >75% för neutrinoenergier över 1018.5 eV och >60% över 1018 eV.Träffsäkerheten för icke-νe- CC event varierar mellan 80% och 90% på hela det un-dersökta energiintervallet 1017 ă Eν ă 1019. Energiberoendet i träffsäkerheten på νe-CC event förklaras genom LPM-effekten som förändrar formen på radiopulserna av-sevärt vid energier över 1018 eV. Detta i kontrast till icke-νe- CC vars signaler i stortbibehåller formen oberoende av energi. Metoden visas vara ett lovande första försök tillneutrinoaromsrekonstruktion med hjälp av neurala nätverk, och resultat kan med storsannolikhet förbättras genom vidare optimering.

2

2 IntroductionCompleted in 2010, the Ice Cube Neutrino Observatory has opened new possibilities forextraterrestrial astronomy. As reports of the first recorded TeV-PeV neutrino events werereleased in 2013 [1; 2], the neutrino emerged as a key component in multi-messenger as-tronomy. Multi-messenger astronomy is the effort of making combined and coordinatedastronomical observations of events and object through multiple different information-carrying particles and phenomena, such as light, cosmic rays, gravitational waves andneutrinos.

Because of their exclusively weak interaction modes, neutrinos generally propagatefreely throughout the universe, not subject to scattering off matter (to any large extent)or radiation, and their trajectories not bending due to magnetic fields. Their trajectoriesthus point back to their sources and information about their creation is retained. Neu-trinos at energies above those recorded in IceCube thus far are thought to be producedin the interactions between ultra-high energy cosmic rays (UHECRs) and matter at thesource or during propagation [3], although this has yet to be experimentally verified,and so the exact production and acceleration mechanisms of UHECRs and ultra-highenergy neutrinos remain somewhat of a mystery to this day. By observing the highenergy neutrino sky, we stand to gain a great deal of knowledge of the most violent andextraordinary events in the universe.

In August of 2020, the IceCube collaboration outlined the expansion plans of thedetector under the name IceCube-Gen2 [3]. One major part of the expansion is a largeradio detector array for the purpose of stretching the sensitivity of the observatory tolower fluxes of even higher energy neutrinos, in the high PeV-EeV range. The funda-mental physical principles on which the detection of neutrinos in the radio wavelengthregime rely are twofold: the emission of radiation at radio wavelengths as a result of par-ticle cascades following neutrino interactions, and the properties of the polar ice sheetsthat allow radio waves to propagate over large distances.

The success of IceCube’s past, present and future operations has, is and always willbe highly contingent upon reliable and efficient event reconstruction methods. That is,given some event registered in the detector, one needs to be able to accurately reconstructthe properties of the neutrino that triggered the event. The particularities of the IceCubedetector and its location put constraints on the reconstruction methods; they need tobe simple enough to be run by the limited hardware on site and fast enough to preventpileup given the rate at which events are registered in the detector. The possibility ofusing neural network based reconstruction methods to this end has been brought to lightas of recently, with reported success [4]. Neural networks are in general, once trained,efficient both in terms of computational time and resources. One caveat is, however,the need for sufficiently large datasets on which to train, i.e. data for which the targetproperty is known that the network can learn to generalize from. Luckily, this need ismet by the NuRadioMC simulation code [19]; a Python-based Monte Carlo framework thatprecisely simulates radio signals as registered in detectors following neutrino interactionsin ice.

With the ability to generate realistic data from known neutrino properties in largequantities, NuRadioMC opens the door to investigations of neural network based recon-struction methods of neutrino properties from radio detector data. This work exploresthe possibilities of determining the neutrino flavor from the pulses registered in radioantennas deployed in the polar ice. Specifically, the capabilities of neural networks indistinguishing electron neutrinos having undergone charged current interactions againstall other interactions are investigated. In essence, the posed problem can be seen as abinary image classification task: there is a large dataset of images and each image is

3

to be assigned to one of two categories. These are tasks in which neural networks haveexcelled [28], as is further confirmed in this work.

4

3 PhysicsWithin the framework of the Standard Model, neutrinos exist as exceptionally low mass,charge neutral fundamental particles belonging to the group of fermions called leptons.There are six known leptons: three types of neutrinos as well as three distinct chargedleptons. The three charged leptons are the electron (e´), muon (µ´) and tau lepton(τ´), and the three associated neutrinos are denoted νe, νµ and ντ . The three neutrinosare said to be of three unique flavors, where the flavor is indicated by the subscript.The leptons are further grouped pairwise into generations in which each charged leptonis paired with its associated neutrino. There are thus three generations of leptons whichcan be represented in the form

ˆ

νee´

˙

,

ˆ

νµµ´

˙

,

ˆ

ντ

τ´

˙

, (1)

and similarly for the corresponding antiparticles. The grouping into generations is jus-tified by the fact that, to each generation, a quantum number can be assigned calledlepton number, and the lepton numbers corresponding to the three generations of leptonsseem to be individually conserved in all known interactions. The three lepton numbersLe, Lµ and Lτ are defined as

Le ” Npe´q Ńpe`q `Npνeq Ńpν̄eq, (2)

Lµ ” Npµ´q Ńpµ`q `Npννq Ńpν̄νq, (3)

Lτ ” Npτ´q Ńpτ`q `Npντ q Ńpν̄τ q, (4)

where Npeq denotes the number of electrons in the system, etc. Here, the lepton antipar-ticles, or antileptons, have been introduced, distinguishable in the case of the chargedleptons as having positive charge and in the case of the neutrinos through barred nota-tion ν̄`.

Leptons have the defining feature of not interacting via the strong interaction. Neu-trinos, being charge neutral, thus only interact weakly with other particles (Although ofcourse present, gravitational attraction between particles can for all intents and purposesbe neglected.) The weak interaction is mediated by the W˘ and Z0 bosons. Havingcomparatively large masses MW “ 80.379 ˘ 0.012 GeV/c2 and MZ “ 91.1876˘ 0.0021GeV/c2 [7] means that the range of the weak force is consequently very short, on theorder of 2 ˆ 10´3 fm. Owing to the short range of the weak interaction, cross-sectionsfor neutrino interactions are very small. That is to say, neutrinos rarely interact withmatter, allowing them to propagate relatively unobstructed through space. However,the cross-sections associated with neutrino interactions are energy dependent; they in-crease with increasing energy. At the very highest energies, on the scale of EeV, thecross-sections are large enough to prevent neutrinos from passing thorough the Earthunobstructed [18].

Naturally, weak interactions are commonly categorized according to the nature of theboson responsible for mediating the interaction. On the one hand there are the chargedW˘ bosons, carrying unit charge between particles in interactions. For this reason, aW -exchange between two particles is referred to as a charged current (CC) interactions.On the other hand there is the charge neutral Z0 boson, present in interactions whereno charge is carried between particles. By analogy to the case of W -exchange, Z-exchange interactions warrants the name neutral current (NC) interactions. Examplesof basic vertices are shown in Fig. 1. Note that lepton number L` is conserved at the

5

Figure 1: Basic vertices for charged and neutral current interactions.

vertices. Note also that these Feynman diagrams do not represent fully complete, allowedinteractions, but merely single vertices that may occur as part of such interactions.

Interactions with vertices depicted in Fig. 1 are the only possible interaction modesfor neutrinos and any attempt at detecting neutrinos or measure their properties arethus confined to said vertices.

Going forward, neutrino-nucleon interactions at very high (Eν ą 1TeV) neutrinoenergies will be of particular interest. Interactions at such high energies fall well withinthe regime of inelastic scattering reactions. These are reactions of the form

ν` ` pÑ ν` `X`, (5)

ν` ` nÑ ν` `X0 (6)

in the case of NC interactions, and

ν` ` pÑ `´ `X``, (7)

ν` ` nÑ `´ `X` (8)

for CC interactions. Inelasticity here refers to the conversion, or fragmentation, of thenucleon into hadronic states X. In the above reactions the p are protons, n neutrons andthe X represent any collection of hadrons, with total charge given by the superscript,that is allowed by conservation laws. The reactions are analogous for antineutrinos,exchanging leptons `´ for their antiparticles and being mindful of assigning charge tothe hadronic states so as to conserve charge between initial and final states.

These interactions are best analyzed by considering the protons constituent quarks.Fig. 2 illustrates a case of inelastic neutrino scattering. The upper part of the diagramis relatively straight forward: νe`dÑ e´`u through the exchange of a W boson. Thelower parts depicting the recoil and remaining u-quarks converting into hadrons involvequite complicated processes of fragmentation which will in their totality be representedhere only as line-filled in circles as shown in the figure, where the X need not correspondto the same hadronic states. A detailed analysis of the fragmentation processes will notbe needed for the purposes of this work. It will suffice to list and discuss only certainparticles as parts of possible final states X resulting from interaction (5)-(8). [5]

6

Figure 2: An example of deep inelastic neutrino scattering on a proton.

3.1 Electromagnetic ShowersAt the high neutrino energies considered in this work, the absorption process for photonsis dominated by e˘ pair production in the vicinity of atomic nuclei and e˘ energy lossin matter is dominated by bremsstrahlung. One could expect there to be some synergybetween the two processes, and rightfully so. The consequence of the synergy is infact quite pronounced: a cascade of e˘ and γ develops within the medium in which anincident high energy e˘ (or γ) interacts. Such cascades are referred to as electromagneticshowers. The fundamental features of electromagnetic showers and their developmentcan be derived through a strikingly simple model introduced by Heitler [8] and will bepresented here as summarized in more recent literature [9; 10].

Heitler’s model of electromagnetic showers is based solely on the consideration of elec-trons, positrons and photons. Electrons and positrons lose energy due to bremsstrahlungwhich is modeled as single photon emissions, and photons undergo pair production inthe vicinity of atomic nuclei. An incident high energy electron, having entered a givenmedium, will travel an average distance d before emitting a photon. This distance isreferred to as one splitting length and is related to the radiation length, LR, in themedium. The radiation length has an important physical interpretation as is apparentfrom the following line of reasoning:

The mean rate of energy loss for relativistic electrons with E " mec2{αZ1{3, where

α is the fine structure constant and Z the atomic number, can be shown to be

´dE

dx“

E

LR. (9)

The above equation can be integrated to yield

E “ E0e´x{LR , (10)

where E0 is the initial energy of the electron. It can be readily seen from Eq. 10 thatLR is the mean thickness of material in which the electron energy is reduced by a factore. [5]

7

The Heitler model assumes that the photon emitted from the electron throughbremsstrahlung carries away half the electron energy. This implies that the splittinglength d must be the average distance over which the electron loses half of its energy,and further that d is related to the radiation length LR through d “ LR ln 2. This canbe seen by requiring

Ex“d “ E0e´d{LR “

1

2E0, (11)

from whiche´d{LR “

1

2ùñ ´d “ LR ln

1

2ùñ d “ LR ln 2. (12)

So, after traveling one splitting length through the medium, the incident electronradiates a photon with energy 1

2E0. The photon and electron then propagate an-other splitting length each, whereby the electron again loses half its energy throughbremsstrahlung and the photon is converted into an e˘ pair. This process continuesuntil the energies of the final electrons and photons fall bellow the critical energy EC ,defined as the energy at which electron energy losses due to bremsstrahlung equals thatof ionization. Since ionization losses depend on the medium, so does EC , and for all butthe lightest elements EC « 600{Z MeV [5].

Simply by accounting for e´, e` and photons and their interactions with matterat high energies, not only is Heitler’s model able to describe the general shape anddevelopment of electromagnetic showers, it is also possible to derive estimates of somefundamental physical quantities from it. As an example, consider a shower developingover n splitting lengths. After each splitting length, the total number of particles N isincreased twofold. Thus, after n splitting lengths (i.e. over a total depth x “ nLR ln 2),the total number of particles is predicted to be N “ 2n. See Fig. 3. Other quantitiesinclude the depth Xmax, at which the shower has reached maximum size (i.e. numberof particles), and the elongation rate Λ which is a measure of the rate at which Xmax

increases with initial electron energy E0 [10].

3.2 Hadronic ShowersParticle cascades initiated by fragmentation of the nucleon in neutrino-nucleon (νN)interactions, called hadronic showers, are harder to accurately model step-for-step. Thisis mainly because the fragmentation process by which the nucleon is converted into jetsof hadrons is complicated and the exact composition of the jets are generally differentfrom event to event. A key feature of such cascades is, however, born out of the simplefact that the decay chains of virtually all known hadrons involve pions at some stage,either through decay directly into pions or by decay into more massive secondary hadronswhich in turn decay into pions. Additionally, hadrons interacting with the medium inwhich they are propagating before decaying contribute to pion production. The keyfeature is exactly that these cascades will to a large extent contain pions[10; 18].

To simplify matters, it can be assumed that the energy that goes into pion produc-tion in hadronic showers is divided equally between pion flavors, so that one third ofthe pions created are π0 and the remaining two thirds are π˘. Neutral π0 have a meanlifetime of 8.4ˆ 10´17 s and decay mainly into two photons, and these photons induceelectromagnetic showers in accordance with what is expected from the Heitler model.In contrast, charged pions have much longer mean lifetimes (2.60ˆ 10´8 s ) and decaystrictly into muons. However, at very high energies and in dense media, the compara-tively long lifetime results in pion interactions with the medium becoming increasinglymore probable than decay over any given distance, even expected[18]. The interactionsproduce further pions where, again, a third of the energy goes into π0 production andthe π0 in turn decay into photons which initiate electromagnetic showers. It is thus

8

Figure 3: Schematic representation of an EM shower.

clear that charged pions also contribute to the funneling of energy into electromagneticshower components at high energies, and that purely hadronic showers develop measur-able electromagnetic characteristics over time. In fact, nearly all of the initial energytransferred to the nucleon eventually ends up in electromagnetic cascades through thedescribed process.

3.3 Neutrino Detection with the Radio TechniqueAlready in the 1960’s, it was predicted by G. A. Askaryan that electromagnetic showersthat develop in dense media will give rise to measurable emissions of coherent radiationat radio wavelengths [20; 22]. This is known as the Askaryan effect. Radio emission ofthis kind has been observed experimentally at SLAC in 2000 [23] and in ice by the ANITAcollaboration in 2006 [24]. The power of the emitted radiation is found to be proportionalto the square of the energy of the incident particle, which means that neutrinos at highPeV-EeV energies are expected to produce especially intense radio pulses if they interactwithin a medium. Measuring these emissions would be an effective approach to neutrinodetection at high energies. This is the radio technique.

Conceptually, the Askaryan effect can be described as follows:As a particle shower develops, electrons scattered from the medium accumulate at

the shower front, and positrons generated in the shower undergo annihilation at theshower front. The cumulative effect is a build-up of a negative charge excess in theshower front and an overall charge separation along the shower axis. This charge excesscan quickly grow to upwards of 20% in dense media [24]. From a macroscopic point ofview, at distances much larger than the dimensions of the shower, the shower looks likea time varying dipole from which radiation is emitted. Just as in the case of Cherenkovradiation, when the propagation speed of the shower, and so the charge excess, is largerthan the speed of light in the medium the coherent radiation is confined to a cone at

9

the characteristic Cherenkov angle. For shower energies in the EeV range, the emissionintensity is typically at its largest at frequencies between 100 MHz - 1 GHz [25].

The medium often proposed for neutrino detection with the radio technique is icebecause of the large attenuation length for radio waves. In particular, the Antarctic iceexhibits an attenuation length of more than 1 km around the 100 MHz - 1 GHz frequencyrange. [26]. Combining this with the relatively low cost of radio instrumentation, theradio technique constitutes an incredibly cost-efficient way of building large detectorvolumes capable of detecting neutrinos at EeV energies and above on.

3.4 The Landau-Pomeranchuk-Migdal effectDetailed relativistic and quantum mechanical considerations of the processes that driveelectromagnetic showers, bremsstrahlung and pair production, in dense media haveshown that the cross-sections for these processes are significantly reduced at sufficientlyhigh energies (>1 PeV), decreasing as E´1{2 [13; 14; 16]. This effect is called the Landau-Pomeranchuk-Migdal (LPM) effect. A detailed physical description of the effect fallsoutside the scope of the presented work. However, the effect has important implicationsfor the development of electromagnetic showers, and thereby also for the Askaryan emis-sions generated by the showers. The effects on the shower dimensions is an elongationalong the shower axis which will be important to consider in more detail:

A decreasing probability of bremsstrahlung and pair production with increasing en-ergy means a longer average interaction length for photons, electrons and positrons.With each particle propagating further on average, the linear size of the shower willincrease. Furthermore, a departure from the Heitler model of electromagnetic showersis required in order to understand effects on the charge-excess profile, i.e. number ofelectrons minus positrons over the spatial extent of a shower. The distribution of theshower energy among the involved particles in the early stages of shower development isin reality not equal, but rather somewhat stochastic. A particle with a smaller fractionof the total energy is affected less by the LPM effect and will thus, on average, travel ashorter distance than one with a larger energy. The particles can initiate two spatiallyseparated sub-showers, leading to a more stochastic charge-excess profile[19]. See Fig. 4for a graphic representation.

In contrast, LPM elongation in hadronic showers is much less pronounced at compa-rable energies [11]. On the one hand, significant production of neutral pions is generallyassociated with the later stages of hadronic shower development, and on the other hand,neutral pion decay at energies >PeV is suppressed because they are much more likely tointeract in the medium than decay [12]. This means that neutral pions will generally becreated at somewhat lower energies where the probability of decay into photons is muchlarger than interactions with the medium, and so the electromagnetic showers initiatedby these photons will not be affected by the LPM effect to the same extent as showersinduced by the outgoing electron in a νe- CC interaction. See Fig. 5

Since the LPM effect is inherent to electromagnetic showers and becomes more pro-nounced the higher the shower energy, combined with the fact that in νe- CC interactionsroughly 80% of the initial neutrino energy is carried by the outgoing electron instigatingan electromagnetic shower [12], the features of radio pulses from νe- CC interactions willgenerally be noticeably different from non-νe- CC interactions at EeV neutrino energies,as can be seen by comparing the rightmost images in Fig. 4 and Fig. 5. This will cometo serve as the main signature that will aid in differentiating νe- CC events from non-νe-CC events in the proposed flavor reconstruction task.

10

Figure 4: (Left) Charge-excess profiles for six distinct electromagnetic showers withinitial energy 1019 eV. The profiles show multiple peaks due to sub-showers resultingfrom the LPM effect. (Right) Modeled Askaryan pulses corresponding to the profiles at1km distance, seen from two different angles Θ.Figures from C. Glaser et. al. (2020) [19] under Creative Commons licence CC BY.

Figure 5: (Left) Charge-excess profiles for six hadronic showers with initial energy 1017

eV. The profiles are more distinct and uniform relative to the EM showers, and theshower lengths are shorter. The green dotted line represents an event in which a highenergy π0 decayed in the early stages of the shower development, initiating a high energyEM shower subject to significant influence from the LPM effect. (Right) Askaryan pulsescorresponding to the different charge-excess profiles.Figures from C. Glaser et. al. (2020) [19], redistributed under Creative Commons licenseCC BY.

11

4 Neural NetworksThe machine learning landscape is vast, with many successful algorithms having beendeveloped for the purpose of performing classification tasks. Artificial neural networks isone category of such learning algorithms. The name stems from the algorithms’ likenessto the structure of the brain, being an interconnected network of neurons. In the brain,neurons behave in a rather simple way: electrical signals from other neurons are receivedvia the synapses that connect them, and once enough signals have been registered bya neuron within a few milliseconds, the neuron fires its own signal. It is the extensivenetworks of billions of neurons that give rise to the complex computational capabilitiesof the brain. Often times, such networks of neurons follow a layered structure, illustratedin Fig. 7. Modeling neurons and biological neural networks mathematically is the firststep in creating working artificial neural networks. One of the most simple yet successfulmodels is the Perceptron, devised in as early as 1957 [17]. It serves as a good startingpoint in understanding the essence of what neural networks are and how they are ableto learn.

The Perceptron consists of a collection of artificial neurons. Much like their biologi-cal counterpart, artificial neurons produce some output after having received sufficientinput. An artificial neuron can register input from multiple input connections, and eachconnection has a weight associated with it. Inputs can be denoted I1, I2, ..., In and therespective weights w1, w2, ..., wn. The neuron calculates the weighted sum of the inputs,i.e. the dot product between the input vector and the weight vector, then passes thesum through the neuron’s activation function. The output of the activation function isessentially the neuron’s output signal, which can be interpreted by the architect as anattempt at a prediction, or fed as input to other neurons. A Perceptron is technically asingle layer of such artificial neurons, much like Fig. 7, where each neuron is connectedto all inputs. In order to create a multi-layer network, several Perceptrons can be figura-tively stacked on top of each other creating a Multi-Layer Perceptron (MLP), where theinputs to each neuron in each layer are the outputs of the neurons of the previous layer.This is how neurons in a network are able to communicate. Such interconnected layersare called dense layers or fully connected layers. It is common to let the initial layerthat is fed the input data simply pass the inputs through to the second layer of neurons,creating an input layer. The final layer of the network is then accordingly called theoutput layer. Layers in between the input and output layers are referred to as hiddenlayers. [6]

With the structure of an MLP in place, the next step is to implement conditionsthat enable learning.

4.1 Loss functions and Gradient DescentThe way learning is achieved in machine learning is to have the network perform regres-sion on the training data. This means finding an optimal set of parameter values thatminimizes some given cost- or loss function. The most simple type of regression is alinear regression where the loss function is the Mean Square Error (MSE), i.e. the squareof the distance between the true value and the predicted value. Recall that when a thenetwork is asked to make a prediction of a quantity given some input data, it computesa weighted sum of the inputs at each neuron until the output is reached, and the outputvalue represents the network’s prediction of the sought after quantity. However, theweights are features of the network, so the task boils down to finding a set of networkweights that minimizes the distance between the predictions and true data points fromthe input features.

12

Figure 6: Visual representation of a simple Multi-Layer Perceptron with two outputneurons. Each unit in the pass-through input layer is connected to each unit in thehidden layer, and similarly for the output layer. Each arrow represents a connectionthat has an associated weight.

Figure 7: Many layers of neurons making up a network structure. Drawing by S. Ramony Cajal (public domain). From https://en.wikipedia.org/wiki/Cerebral_cortex

13

The network weights are updated in an iterative fashion to minimize the loss functionthrough a process called gradient decent. The local gradient of the loss function is calcu-lated with respect to the network weights, and a step, meaning a change in weight values,is taken in the direction of the negative gradient at that point. This process is repeateduntil a zero gradient, i.e. a minimum of the loss function, is reached, at which point theminimization task is completed. In practice, the vector containing all network weightis initialized in a random fashion according to a chosen distribution. The weights arethen successively tweaked to minimize the loss function through gradient descent untilconvergence is reached. The size of the descent step, i.e. to how much the weight vectoris changed at each iteration, is set through the so called learning rate hyperparameter.The term hyperparameter refers to specifications of the network and training that are notnetwork connection weights (which are commonly called parameters). Mathematically,on iteration gradient descent can be represented as

xnext “ x´ η∇xMSEpxq, (13)

where x is the weight vector and η is the learning rate hyperparameter. An appropriatevalue of the learning rate is crucial for efficient network training. If the learning rate istoo high there is a risk of missing the minimum of the loss function altogether, makingthe algorithm diverge and not finding good values of the network weights. On the otherhand, if the learning rate is too low the network might take an unnecessarily long timeto converge.

In the presented work, the goal was to create a network capable of performing clas-sification, in particular binary classification. Instead of outputting the value of somequantity, the network should output a probability that a training sample belongs to acertain category. To this end, a regression algorithm called logistic regression can beused. In this case, each node in the network outputs a number between 0 and 1 basedon the weighted sum of the input features. The weighted sum is the same as describedpreviously, just the scalar product between the inputs and the network weights, but thesum is passed as the argument to the logistic function

σptq “1

1` e´t. (14)

The output can then be interpreted as the probability p of the inputs belonging to agiven class, and the prediction is yes if the output is ě 0.5 and no if ă 0.5. This can bewritten as

y “

#

0 if p ă 0.5

1 if p ď 0.5(15)

The goal of a logistic regression model is to find a set of network weights that outputhigh probabilities for samples that belong to the given class and low probabilities forthose that do not. This is achieved by using a loss function of the form

Lpxq “

#

´ logppq if y “ 1

´ logp1´ pq if y “ 0(16)

where p is the output probability that depends on the network weights x. Since ´ logppqgrows very large when pÑ 0, the loss will grow large if the network’s prediction is closeto 0 for a sample belonging to the given class, i.e. y “ 1, and similarly for the oppositecase. The loss function on the entire collection of samples is simply the average lossacross all m samples, given by

L̄pxq “ ´1

m

mÿ

i“1

ryi logppiq ` p1´ yiq logp1´ piqs, (17)

14

where y is the class prediction (0 or 1). This loss function is luckily convex, meaningit has only one minimum. This ensures that gradient descent is guaranteed to find theglobal minimum, given that the learning rate is set appropriately. [6]

4.2 Convolutional neural networksMuch like the hitherto discussed neural networks were inspired by and designed to mimicthe computational and learning processes of the brain, convolutional neural networks(CNNs) are abstractions of the visual cortex in the brain. The visual cortex is responsiblefor our ability to distinguish features in the world around us, to separate and categorizeand recognize patterns in visual information. The idea of a CNN is to be able to do justthat, and CNNs have been found to be largely successful in many areas related to visualrecognition. There are two key parts to CNNs that distinguish them from purely fullyconnected networks; convolutional layers and pooling layers.

Convolutional layers are made up of neurons, but each neuron in the layer is onlyconnected to other neurons in its receptive filed, i.e. each neuron only "sees" and isstimulated by small number of inputs. As an input layer, the neurons in a convolutionallayer are only connected to image pixels within a small area of the image, and anotherconvolutional layer above the input layer has nodes that each connect only to nodesin the input layer that fall within a small area of the input layer, and so on. Eachneuron in the input layers is then associated with a small-scale feature of the image,while successive layers above combine these smaller features into larger ones. As realworld images in many cases show both small- and large-scale features, this partitioningof the image into different levels of features makes CNNs very successful.

The weights associated with a neuron’s connections to other neurons in its receptivefield make up a filter or kernel. The dot product between all inputs inside the perceptivefield and the filter is taken to produce a neuron output. The size of the filter can bechosen at will, and so the neuron’s perceptive field. Moreover, each neuron in the layershares the same filter weights, creating the effect of a filter of the specified size "sliding"across the input image (or lower layer), calculating the dot product between filter andinput pixels at every step. The resulting matrix of dot products is called a feature map,highlighting the parts of an image that activate a filter the most [6]. Multiple filters,and thereby feature maps, can be associated with each convolutional layer, meaningmultiple filters can be trained on the outputs of neurons inside the same perceptivefield. The filter weights that are most useful for the given recognition task are chosenby the network as it runs through all the samples in the dataset.

Each feature map in the end often reflects some feature in the input image, as thename would suggest. These features might be simple edges of different orientations,geometric shapes, or more complex patterns like scales and fur (given the input imagesare animals of course). Generally, more novel features are represented in the featuremaps in the early layers of the network, while more complex features are representedcloser to the final output. The point of stacking filters in this fashion is to make thenetwork capable of detecting multiple features in each part of the image. Because bybeing limited to just one feature map per layer, the network might only detect edges, orcircles, or brightness gradients, etc.

Pooling layers are quite similar to convolutional layers in how they are structured,although the operations they perform and outputs they produce are very different. Pool-ing layers consist of neurons with limited perceptive fields, but there are no trainableconnection weights to speak of. Instead, pooling layers calculate and output the max-imum (max pooling) or mean (average pooling) value inside the set perceptive field,or pool size. This has the intended effect of reducing the dimensions of the input im-age, which is often done in order to reduce the computational load, number of network

15

parameters, and memory usage [6].

4.3 Training, Validating and TestingThere are three main phases of neural network development beyond the assembly of theactual network: training the network on the available dataset, validating the networkperformance during training, and testing the performance once the network if fullytrained.

Training a neural network refers to the process of feeding a network instances fromthe allocated training data to learn from. During the training phase, the network’sweights are continuously updated in order to minimize the loss function and therebyperform well on the given task. Many aspects of training can be configured by theuser to best fit the particular network, dataset and task. For instance, how many datasamples are to be used in calculating the gradients for each step during gradient descent,and how many iterations over the entire dataset is to be performed during training. Oneiteration over the dataset is called one epoch.

It can be useful to track a network’s performance on the training data as it is training.A simple way to do this, for a classifier, is tracking the accuracy of the network on thetraining dataset, i.e. how often the network successfully classifies an event belonging tocategory A as such. This metric is called the training accuracy. If a network performswell on the given task, the training accuracy generally increases steadily over the courseof training. In a similar way, the training loss can be defined as the average loss acrossthe entire dataset after each epoch.

As the available data is passed through the network time and time again duringtraining, a problem is likely to arise for sufficiently complex networks, or if the datasetis limited in size: the training accuracy will continue to increase, but when training isstopped and the network is shown new samples to make predictions on it might performa lot worse than would be expected from just the training accuracy. This indicates thatthe network has developed a bias towards the dataset it has trained on and learnedfrom. This could mean memorizing certain samples or finding patterns in the datathat do not translate well to other datasets. To better determine the performance ofthe network on new data not used for training, a subset of the available data can bereserved for evaluation of the network between epochs. This process is called validation,and the dataset that is reserved for validation inherits the name validation dataset. Theaccuracy on the validation dataset is referred to as the validation accuracy and is amore unbiased gauge of the network’s performance. The validation loss is analogous tothe training loss, although of course calculated across the validation data. By trackingthe validation accuracy and loss metrics during training, networks can be optimized,e.g. tweaking hyperparameters, at any point during training to perform well on thevalidation set.

Some bias is, however, also introduced when tailoring a network to perform well on avalidation set, being a specific subset of all available data. It might turn out that somespecific hyperparameter configurations that give good performance on the validationdataset do not work as well on a different set of samples. In light of this fact, yetanother subset of data is often withheld from training and validation, called a test set.The test set is only presented to the network when it is fully trained and developed toestimate its performance. This is the final and most unbiased measure of the accuracyof the network that can reasonably be acquired.

16

4.4 OverfittingThere are plenty of challenges to machine learning that require careful consideration if asuccessful network is to be developed. Many of the most important factors are linked tothe data that the network is set to learn from. There needs to be a sufficient quantity oftraining data, the training data needs to be of high quality, i.e. not riddled with outliersor huge errors and uncertainties, and the training data needs to be representative of thetarget data that predictions are to be made from. There are, however, also challengesto overcome which are inherent to the network itself. One such issue is overfitting.Overfitting is a term used to describe the unwanted behavior of networks performingwell on training data but poorly on the validation dataset. This hints at the fact that thenetwork is not generalizing well from the training data. Different aspects of the network,the data and the interaction between the two can hamper the network’s ability to makeuseful generalizations that fit new data. Perhaps the most common cause of overfittingin most machine learning applications is a limited amount of training data. In that case,if the network being trained is sufficiently complex to learn from the data, two things arelikely to happen: the network might start searching for features and patterns in the datathat in actuality are utterly useless for the purpose of making reliable predictions, and,in case of a classification task where true labels are assigned to the data, the networkcan simply start memorizing the labels that go with each data point. The former canbe illustrated nicely by the following example [6]:

A dataset containing information about different countries alongside a measure ofthe life satisfaction in each country, on a scale from 1-10, is fed to a neural networkin order to make predictions about what factors correlate with a high life satisfaction.Surely, some information about the country could be expected to correlate with a highlife satisfaction, like GDP for instance. Other factors most likely do not. Say the lifesatisfaction in all countries in the training data containing a w in their name is 7 orhigher: Norway, Switzerland, Sweden, New Zealand. The network might conclude thathaving a w in the name is a good predictor for a country’s life satisfaction. Not only doesthis supposed w -rule not make much intuitive sense, it also fails in predicting the lifesatisfaction for countries like Rwanda or Zimbabwe which were not in the training datasetand might come to have lower ratings of life satisfaction. The w -rule was predicted bythe network by pure chance because the dataset was limited, but in the "eyes" of thenetwork it was a perfectly valid predictor.

17

5 Method

5.1 Generating Datasets with NuRadioMC

NuRadioMC is a Python-based Monte Carlo framework developed for the purpose ofsimulating the performance of ultra-high energy neutrino detectors utilizing the radiotechnique [19]. As is required of an accurate detector-simulating framework, NuRadioMCsuccessfully accounts for the initial νN interaction in ice, the subsequent Askaryan ra-dio emission, the propagation of the signal through the medium to the detector(s), andfinally the detector response. The code is versatile, allowing for simulations of vari-ous detector hardware and layouts, neutrino properties, interaction specifics, as well asmedium parameters.

Fig. 8 illustrates the detector configuration as specified in the simulation of thedatasets used for all training and evaluation of neural networks. The components thatregister Askaryan radio emissions from neutrino interactions are the four downwardsfacing (blue) LPDA antennas and the single dipole antenna. In a standard xy-coordinatesystem with the dipole antenna at the origin, the positions of the LPDA antennas arepx, yq “ p3, 0q, p0, 3q, p´3, 0q and p0,´3q. The LPDA antennas are placed at a depth of3m under the ice surface and the dipole at 15m.

Detector responses to two types of events were simulated: those resulting from νecharged current (CC) interactions, i.e. events with both electromagnetic and hadronicshower components, and those resulting from any other interaction that only produces ahadronic shower. Initial neutrino energies range from just above 1017 eV to just bellow1019 eV, although the distribution is skewed heavily towards the upper end of the energyrange. This is illustrated in Fig. 10a and 10b. For practical purposes, muon neutrinoswere simulated to represent all interactions that lead to hadronic showers, i.e. tauneutrino interactions and electron neutrino neutral current interactions.

Two sample signals from νµ interactions are shown in Fig. 9a and 9b. The eventgeneration was done with the ARZ model in NuRadioMC, the most accurate modelingof the Askaryan emissions and effects of LPM elongation that can be achieved with thecode [21]. In order to produce data that reflects realistic detector responses, NuRadioMCalso simulates noise which is added to the pure radio pulses given by the event generation.Data without added noise was also simulated.

The dimensions of the simulated data reflect the detector setup. Each event consistsof five signal traces , one for each antenna, with 512 voltage values each correspondingto a total trace length of 256 nanoseconds. 40 files of 105 events each were generatedfor each event category, which adds to a total dataset size of 8ˆ 106 events.

5.2 Network TrainingTraining of all neural networks was done on a GPU cluster consisting of three NVIDIAQuadro RTX 6000 GPUs. Access was provided by the university. Metrics includingaccuracy, validation accuracy, loss and validation loss were tracked during training aftereach epoch through Weights & Biases (W&B) [27]. W&B is a third party web basedmachine learning toolkit that can be implemented directly into the model code. Anexample of what the training of a neural network might look like as tracked by W&Bcan be seen in Fig. 11.

As is the case for all computational resources, only a limited amount of data could beloaded into memory at a given instance. The dimensions of the signal events combinedwith the large number of events adds to a cumulative dataset that far outmatches thetotal memory capacity. The feeding of data to a network during training thus had to

18

Figure 8: Detector layout as specified in the simulation of datasets using NuRadioMC. Theimage is a modified version of one found in A. Anker et. al, White Paper: ARIANNA-200high energy neutrino telescope. (https://arxiv.org/pdf/2004.09841v1.pdf). Permissionof use and modification granted by co-author C. Glaser.

(a) (b)

Figure 9: Sample signals of νµ events.

19

(a) (b)

Figure 10: The distribution of number of events per neutrino energy bin for (a) νe CCevents and (b) νµ NC and CC events. The horizontal bars mark the bin sizes. The totalnumber of events in each figure is 4ˆ 106.

be optimized to ensure that the entire dataset could be used while not exceeding thememory capacity. This was done by creation of Datasets in TensorFlow [? ] throughimplementation of a custom generator, yielding event data and corresponding labels inbatches of a 64 events from a shuffled combined array of one νe CC event file and oneνµ event file. To further increase efficiency, prefetching and interleaving was performedusing multiple CPU cores in parallel.

Training of a given network was run on subsets of the available data, or the entiretyof the dataset, and proceeded as long as the validation accuracy saw an increase over fiveepochs. The stopping condition was implemented using the EarlyStopping callback inKeras. If the model’s validation accuracy improved from one epoch to the next the modelparameters were saved, overwriting the second to best model parameter values. Thisensured that the best model parameters, corresponding to the best validation accuracy,were saved for later evaluation.

5.3 Network OptimizationThe application of deep learning methods, in particular deep convolutional neural net-works, to image classification tasks has been thoroughly explored and found to be gener-ally successful in recent years. The ImageNet Large Scale Visual Recognition Challenge(ILSVRC) has become somewhat of a proving ground for large scale visual recognitionmodels, where teams compete to develop the best performing model for various visualrecognition and classification tasks. Results of the ILSVRC are published following eachinstallment, including the competing teams’ final network designs and performances.The 2014 installment of the ILSVRC [28] included an image classification task on alarge dataset of 1000 categories. The runner-up in the image classification category wasthe VGG model developed by the Visual Geometry Group of Oxford University [29].The model’s design is relatively easily implemented in Keras, a popular Python-baseddeep learning library, and for that reason the network developed for the present workwas largely inspired by it.

The goal of the present work was to construct a neural network capable of performingthe given classification task with high accuracy, i.e. distinguishing the radio pulsesresulting from νe CC interactions from the rest as registered by the antennas. The

20

Figure 11: Metrics tracked and plotted with Weights & Biases during training of asample neural network.

optimization process in practice consisted of running a series of experiments on thenetwork architecture with the aim of maximizing the validation accuracy.

Experimentation on and development of network architecture was segmented intotwo parts: a convolutional part and a fully connected part. In a similar vein to how theVGG models are structured, blocks of convolutional layers followed by a pooling layerwere stacked in succession, after which the output from the last convolutional block wasbatch normalized and flattened in order to be fed to a sequence of fully connected layers.The output layer was set to a two node dense layer with Softmax [30] activation. Forall networks, categorical cross-entropy was used as loss function, which in the case ofjust two categories is equivalent to the loss function shown in Eq. 17. Adam [31] asoptimizer.

The structure of the convolutional part of the network is shown schematically inFig. 12. To investigate the effect of the number of convolutional blocks on accuracy,networks with m = 1, 2, 3, 4 and blocks were trained on 4ˆ 105 events, validated aftereach epoch on 2 ˆ 105 events, and results compared. Each block was assigned n “ 4convolutional layers with MaxPooling at the end. All convolutional layers used a kernelsize of 1ˆ5, ReLU activation and same padding. The pooling size was set to 1ˆ4 withsame padding and strides 1ˆ1. Convolutional and pooling kernels were one dimensionalto reflect the properties of 5 ˆ 512 input data. Each row constitutes an individualantenna (five in total), meaning one dimensional kernels ensure that convolution andpooling operations are performed on the signals registered in the antennas, one antennaat a time. The learning rate was set to 10´4. As for the fully connected part of networks,two iterations of a dropout layer with a 0.4 dropout frequency followed by a dense layerwith 1024 nodes and ReLU activation were used.

The impact of the number of convolutional layers in each convolutional block onaccuracy was investigated by varying the number of convolutional layers in every blockfor a m “ 4 convolutional block network. Training of networks starting at n “ 1 andup to n “ 6 was performed. Previously mentioned specifications remained unchanged.

A suitable structure for the fully connected part of the network was initially inves-

21

Figure 12: Illustration of the convolutional part of the network. Each block contains nconvolutional layers and one pooling layer. m blocks are stacked in sequence.

tigated by varying the number of nodes in a two-layer structure similar to those of theVGG models. Networks with m “ 4, n “ 5 convolutional structures and N “ q ˆ 512nodes per dense layer were trained, where q “ 1

2 , 1, 2, 3, 4, 5. Training was performedon 106 events to mitigate overfitting, and validation again on 2ˆ 105 events. Effects ofadding additional dense layers were also explored by training networks with two, threeand four dense layers respectively , all with N “ 512 nodes each, on the full 7.6 ˆ 106

event dataset. An additional test of n was also done by training networks with n “ 4and n “ 5 on the full dataset, where both networks had m “ 4 and two 512-node denselayers.

To make further attempts at improving performance of the final model, the learningrate was reduced by a factor of two when the validation accuracy started to stagnate.The first reduction in learning rate was done manually and the following reductions wereperformed automatically through the ReduceLROnPlateau callback, where stagnationwas defined as no improvement in validation accuracy over 4 epochs.

To investigate to which degree the size of training dataset impacts the networkperformance, an m “ n “ 4 network with two 512-node dense layers was trained ondatasets consisting of 1, 2, 3, 4, 5, 6, 7 and 7.6 million events respectively. The bestachieved validation accuracy during each training run was documented.

5.4 Evaluation of the final networkAll evaluation of the final architecture was centered around the predict method of theModel-class in Keras, by which a network outputs an array of category predictions givensome input samples. The test samples used for evaluation were a subset of 200 000events from the total dataset which had not been fed to the network during trainingor validation. This test set consisted of two category separated files of 100 000 eventseach. The performance of the final network was analyzed in three main ways: confusionmatrices were generated to gauge the per-category accuracy of the network in differentneutrino energy ranges, the accuracy of the network on the two categories as a functionof neutrino energy was analyzed, and finally an analysis of the accuracy on each categoryas a function of the signal-to-noise ratio (SNR), both across the entire test datasets andin distinct neutrino energy intervals. The confusion matrix was generated by having thenetwork make predictions on all 200 000 events in the test dataset. The energy and SNRdependence analyses were done on events of one category at a time, yielding accuracyresults for νe- CC events and νµ events separately.

22

6 Results

6.1 OptimizationThe results of the network architecture experiments are presented bellow. Experimentand model specifications, best achieved validation accuracy, and number of trainableparameters for each model is tabulated according to experiment.

Table 1: Best achieved validation accuracy and the number of trainable parameters formodels of varyingm, i.e. number of convolutional blocks as depicted in Fig. 12All blockshave n “ 4 layers and were trained on 4ˆ 105 events.

m Best val. acc. (%) Trainable parameters1 63.1 22 045 0582 70.0 11 631 2983 72.5 6 675 7784 71.2 5 202 4985 71.5 9 792 578

23

Table 2: Best achieved validation accuracy and the number of trainable parameters formodels m “ 3 and m “ 4 trained on 106 events to . All blocks have n “ 4 convolutionallayers.

m Best val. acc. (%) Trainable parameters3 75.2 6 675 7784 74.8 5 202 498

Table 3: Best achieved validation accuracy and the number of trainable parameters formodels of varying n, i.e. the number of convolutional layers in each convolutional block.All models have m “ 4 blocks and were trained on 4ˆ 105 events.

n Best val. acc. (%) Trainable parameters1 70.5 3 895 9072 71.2 4 331 6503 71.6 4 767 3304 71.2 5 203 0105 72.2 5 638 6906 70.3 6 073 858

Table 4: Best achieved validation accuracy and the number of trainable parameters formodels of a varying number of nodes in a two dense layers. All models in this experimenthave m “ 4, n “ 5 convolutional structures and were trained on 106 events.

Nodes per layer Best val. acc. (%) Trainable parameters256 74.1 2 686 498512 74.4 3 538 9781024 73.8 5 638 6901536 74.1 8 261 6662048 74.0 11 409 4422650 74.2 15 081 506

Table 5: Best achieved validation accuracy and the number of trainable parameters formodels of a varying number of dense layers, all with 512 nodes in each layer. All modelsin this experiment have m “ 4, n “ 5 convolutional structures and were trained on 106

events.

Layers Best val. acc. (%) Trainable parameters2 74.4 3 538 9783 74.3 3 801 6344 74.4 4 064 290

24

Table 6: Best achieved validation accuracy and the number of trainable parameters formodels of a varying number of dense layers, all with 512 nodes in each layer. All modelsin this experiment have m “ 4, n “ 5 convolutional structures and were trained on thefull 7.6ˆ 106 event dataset.

Layers Best val. acc. (%) Trainable parameters2 78.10 3 538 9783 78.0 3 801 6344 77.2 4 064 290

Table 7: Best achieved validation accuracy and the number of trainable parameters formodels n “ 4 and n “ 5 with m “ 4 and two 512-node dense layers. Trained on the full7.6ˆ 106 event dataset.

n Best val. acc. (%) Trainable parameters4 78.2 3 103 29885 78.1 3 538 978

Table 8: Best achieved validation accuracy for the final m “ n “ 4 model trained ondifferent amounts of data.

Number of events (106) Best val. acc. (%)1.0 74.32.0 75.33.0 76.14.0 77.25.0 77.06.0 77.27.0 77.77.6 78.2

Figure 13: Validation accuracy and loss during training of networks with different num-bers m of covolutional blocks. Training was done on 4 ˆ 105 events and validation on2 ˆ 105. The number of convolutional layers in each block is n “ 4 for the modelsplotted.

25

Figure 14: Training of an m “ 3 and m “ 4 network to further study any differences inperformance. The networks were trained on 106 events and n “ 4.

Figure 15: Training of models of varying n, i.e. the number of convolutional layers ineach convolutional block.

Figure 16: Training of an m “ 3, n “ 5 model on 4ˆ 105 events compared to previouslytrained n “ 5,m “ 4. The best validation accuracy achieved by the m “ 3, n “ 5 modelwas 72.13% as compared to 72.21% for

26

Figure 17: Comparison of the performance of them “ n “ 4 andm “ 4, n “ 5 networks.

Figure 18: Comparison of the performance of n “ 4,m “ 5 networks with two 512-nodeand 2650-node dense layers, respectively.

Figure 19: Training of networks with two, three, and four dense layers with 512 nodeseach.

27

6.2 Final Neural Network Architecture

I C1

C2

C3

C4

B FD1 D2 O

Figure 20: Final CNN architecture. The different layers are assigned colors and markedwith letters with I for input layer (cyan), Ci for convolutional block i (convolutionallayers in white, MaxPooling in orange), F for flattening (purple), D for dense layers(gray), and O for the output layer (violet).

6.3 Network Performance

28

(a) (b)

(c) (d)

Figure 21: Confusion matrices showing the class separated performance of the finalnetwork in four different energy ranges. The test set consisted of 200 000 events splitevenly between the categories. After evaluation on the test set, the predictions weresorted according to neutrino energy and accuracy in each energy range was retrieved.The main diagonal corresponds to accurate predictions as can be inferred from the axislabels. The color bar is scaled according to absolute number of events.

29

(a) (b)

Figure 22: Predictive accuracy (blue) of the model on (a) νe CC events and (b) non-νe-CC events, for neutrino energy intervals marked by the horizontal bars. The accuracyvalues indicated by the dots are the total accuracy values for all events inside the energyinterval, i.e. correct predictions divided by the number of events within the interval.The events in each energy bin (orange) is also shown. Predictions were made on testdatasets of (a) 105 νe- CC events and (b) 105 non-νe- CC.

(a) (b)

Figure 23: Predictive accuracy of the model on (a) νe- CC events and (b) νµ eventswith a signal-to-noise ratio (SNR) within the intervals marked by the horizontal bars.The number of events within each SNR bin is also plotted. The predictions were madeon an independent test dataset unused during training and validation. The bin size is∆SNR “ 0.25.

30

(a) (b)

Figure 24: Predictive accuracy of the model on (a) νe- CC events and (b) νµ eventswith a signal-to-noise ratio (SNR) within the intervals marked by the horizontal bars.The number of events within each SNR bin is also plotted. The predictions were madeon an independent test dataset unused during training and validation. The bin size is∆SNR “ 0.25.

7 Discussion & Outlook

7.1 Decisions regarding final network architec-tureWhen deciding upon the final neural network architecture, both experimental results interms of best global performance as well as behavior during training were considered.The reason for not making architecture decisions based exclusively on best achievedaccuracy globally is that networks of different structures and complexities start over-fitting at different points in their training. This can be seen in Fig. 13 by comparingthe m “ 3 and m “ 4 models. Validation accuracy is greater and validation loss lesserfor the m “ 4 model up until the fifth epoch, at which point performance starts todeteriorate due to overfitting. Preventing overfitting can be done by introducing moreregularization, e.g. a higher dropout frequency, or by increasing the size of dataset usedfor training. The latter was done for the two models in question and results are shownin Fig. 14. It is clear that the performance of the two models is similar once trainedon more data, and even though the m “ 3 model once again shows a higher globalvalidation accuracy, signs of overfitting can still be seen earlier for the m “ 4 model.By observing the training accuracy and loss (the lower two plots in Fig. 14), the m “ 4model can be seen outperforming its counterpart at every stage of training, even beforeany signs of overfitting. Even though the differences in validation accuracy and lossare minuscule for this amount of training data, and even favors m “ 3 by a marginalamount, the m “ 4 model is nevertheless opted for since by extrapolation of the trainingbehavior to larger datasets m “ 4 is expected to perform better.

The number of convolutional layers n in each convolutional block was set to n “ 5,based on results tabulated in Table 3. A comparison betweenm “ 4, n “ 5 andm “ n “4 shows, however, that m “ n “ 4 could possibly outperform an m “ 4, n “ 5 networksince the validation loss of the former is lesser in the early stages of training. By settingto number of convolutional layers to n “ 5, the choice of m “ 4 blocks over m “ 3 is

31

further supported by the comparison shown in Fig. 16, in which the m “ 4, n “ 5 modelis seen to slightly outperform the m “ 3, n “ 5 model that should be expected to be theoptimal choice based purely on best global performance results from experiments.

From the results of the experimentation on nodes per dense layer it can be seenthat simply increasing the number of nodes does not necessarily increase the network’sperformance on the given task. Fig. 18 illustrates the similarities in performance betweena 2 ˆ 512-node and a 2 ˆ 2650-node network. It could be argued also here that theearlier overfitting of the 2ˆ 2650-node network warrants further investigations into theperformance on larger datasets. However, the 2ˆ512-node network delivers comparableperformance with just over one quarter the number of trainable parameters. Sincescaling back model complexity while keeping performance high is generally consideredadvantageous, a node count of 512 per dense layer was chosen.

As for the number of 512-node dense layers, Table 5 would suggest that there nosignificant difference between two layers and four layers. Fig. 19 does not paint a clearpicture of the situation either, and so the argument that was resorted to in the end wasthe same as for choosing the node count; given comparable performance, less parametersis preferred.

Performance on validation data of different network architectures in many of the per-formed experiments differ only by small amounts. Decisions about the final architectureare based on as little as a tenth or even a hundredth of a percent in some cases. Thisdecision making process completely overlooks two important points:

The first is that the achieved performance of a certain network may differ significantlydepending on the size of the dataset used for training, as can be seen from Table 8. Anetwork that outperformed another after training on a dataset of e.g. 4 ˆ 105 eventsmight not outperform on larger training datasets. Ideally, then, all experiments shouldbe performed on all available data to gauge performance on large datasets. Additionally,since the final network was trained on the entirety of the data, the grounds for usingfeatures for the final network found to be favorable from experiments would be strongerif experiments were also run on all data.

The second point is that without any knowledge of the training-to-training uncer-tainty of the networks no definitive statements can be made about how large the dif-ference in best achieved validation accuracy between two networks needs to be in orderfor the difference to be truly significant. Of course, this could be addressed by simplytraining a few models on the same amount of data enough times each to explore thestatistical fluctuations in the best achieved validation accuracy for each model. How-ever, since more than just the best achieved validation accuracies from experimentswere taken into consideration (validation loss curves for instance), and because multiplemodels were trained on the whole dataset and compared, any final network architectureproduced by a more careful decision making process is not expected to deviate from thefinal network presented in this work to any large extent.

7.2 Network PerformanceThe final convolutional neural network trained on 7.6 ˆ 106 events and the network’sperformance was evaluated. The confusion matrices presented in Fig. 21a, 21b, 21c and21d give an overview of the category-separated accuracies of the final network in fourdifferent energy ranges. To produce the confusion matrices, the network was evaluatedon a test dataset consisting of 105 νe-CC events and equally many non-νe-CC events. Itis clear that, across the test dataset as a whole, the model manages to correctly identifynon-νe-CC events with a higher accuracy than νe-CC events. The accuracy on νe-CCevents can be seen to differ greatly between the considered energy ranges, whereas theaccuracy on non-νe-CC events is more stable.

32

Fig. 22a and 22b show how the accuracy of the final network depends on neutrino en-ergy for the two interaction categories. The predictive accuracy on νe-CC events shownin Fig. 22a shows clear dependence on neutrino energy across the considered energyrange. 1017 ă Eν ă 1019 eV, where increasing energy results in better performance.Above 1018 eV the network performs well. It scores an average accuracy between 70%and 80% across bin sizes of ∆ logEν “ 0.125, meaning νe-CC are classified as such70%-80% of the time. For neutrino energies below 1018 eV the performance decreasessignificantly. An interesting area to consider is where the the network scores an ac-curacy bellow 50%; at energies bellow 1017.5 eV. The network predicts νe- CC eventsto be non-νe- CC events a majority of the time at these energies. This would suggestthat νe- CC events at energies bellow 1017.5 eV show features that the network haslearned to associate with non-νe- CC events during training; νe-CC look like non- νe-CC events to the network. At energies above 1018 eV, however, the two signal categoriesare comparatively a lot easier to distinguish from one another.

In contrast, the accuracy on non-νe- CC events is high across the considered neutrinoenergy range, varying between 80% and 90%. This would suggest that the signal featuresthat the network associates with non-νe- CC events are present and discernible to a largerextent regardless of the energy on the neutrino.

The energy dependence of the accuracy on νe- CC events is largely attributed tothe LPM effect, as described in section 3.4. Elongation of the electromagnetic showerprofiles has pronounced effects on the shapes of the radio signals at energies above1018 eV, at which point the neural network is able to more reliably differentiate signalsstemming from νe- CC against all other interactions. At lower energies where the LPMeffect is less pronounced, signals from νe- CC interactions resemble those of all otherinteractions more closely, which sheds light on the networks difficulty in distinguishingthe two categories.

Evidently, as can be seen by inspecting Fig. 23a, predictive accuracy on νe- CC eventsis also dependent on the signal-to-noise ratio of the input signal. The network performspoorly on events which have amplitudes closer to the noise level, whereas performanceis better on events with larger SNR, reaching upwards of 75%-80% in the upper SNRrange. On the other hand, accuracy on events of all other interactions, shown in Fig 23b,shows only weak SNR dependence. These results hint at the possibility that the featuresthat the network has learned to associates with νe- CC events are harder to make outwhen the SNR is small, and when said features cannot be distinguished the network isprone to predicting the event to be from a non- νe- CC interaction. This could alsoexplain why the accuracy is relatively constant across the entire SNR range for non-νe-CC events; the specific νe- CC features are never present in these signals and the networkpredicts the event as being non-νe- CC regardless of SNR. Drawing conclusions aboutthe specific features that the network has learned to associate with νe- CC events arerequires an investigation of the feature maps produced by the network.

Furthermore, the accuracy on νe- CC events is shown to be dependent on SNR forall energy ranges explored an plotted in Fig. 24a and 24b. The network performs at it’sbest when both neutrino energy and SNR are high.

The effect of increasing the size of the training dataset is clear: more data results inbetter performance. The relationship between data and performance could be quantifiedby means of fitting some function to the data points in Table 8, although the type offunction to be fit is not entirely clear. A decent fit could conceivably be done with afunction of the form Ae´Bx ` C, since the increase in accuracy with more data canbe expected to level out at the upper end as there are other bottlenecks that limit theaccuracy, like the LPM effect and the many νe- CC events with SNR in the range wherethe model does not perform as well. However, the accuracy is determined in large bythe energy distribution of the training dataset as has been shown, which means that

33

any attempt to fit a relationship between accuracy and amount of training data is likelyto only be valid only for a certain energy distribution. The only general statement thatcan be made is that performance increases with the size of the training dataset.

Seeing as the neutrino energy range considered in this work is quite large, developingmultiple networks for different energy ranges might prove to yield better results on flavorreconstruction than was achieved with just one network. However, this approach wouldrequire the possibility of reliable estimates of the neutrino energy in order to know whatnetwork to pass the event to for flavor reconstruction. This is in fact a weakness of thedeveloped network on νe- CC events as well; without knowledge of the neutrino energythe certainty of the flavor prediction cannot be stated.

The scope of the presented project has only allowed for investigations into a VGG-inspired deep convolutional neural network architecture to perform the neutrino flavorreconstruction. The deep learning landscape is vast and numerous successful networkarchitectures have been devised, for image recognition tasks in particular. Further ex-ploration of different types of neural networks for the stated reconstruction task couldyield improved results. Additionally, a deeper understanding of deep learning and neu-ral networks for image recognition could further a discussion into why certain networkfeatures and properties work better than others for the given task.

8 ConclusionsThe presented work constitutes the first ever reconstruction of neutrino flavor fromsimulated radio detector data. This was achieved through use of a convolutional neuralnetwork. The NuRadioMC simulation framework has provided precise simulations of theexpected radio signals from neutrino interactions in large quantities, which is crucialfor the implementation of neural networks. The deep neural network developed in thisproject was able to identify νe- CC interactions as such with an accuracy of 70-80%above 1018 eV up to 1019 eV against any other type of neutrino interaction. Further, thenetwork was able to correctly identify non-νe- CC against νe- CC with an accuracy of85-90% across the entire considered neutrino energy range 1017 ă Eν ă 1019 eV. LPMelongation of electromagnetic showers affects the shape of the emitted radio pulses, andthe effect becomes increasingly pronounced with increasing energy. Thus, at the lowerend of the considered energy range, pulses from νe- CC interactions are expected tobe more similar to pulses from all other neutrino interactions than at the upper end.This is supported by the finding that predictions on νe- CC interaction against all otherneutrino interactions are less reliable for neutrino energies bellow 1018 eV. At neutrinoenergies bellow 1017.5 eV the accuracy drops to sub 50%, reaching values as low asaround 30%.

The signal-to-noise ratio is also found to affect the accuracy of the network, wherelarger SNR results in more accurate predictions for all neutrino energies in the consideredenergy range. The best performance reached by the network was on events close to 1019

eV with SNR between 9.5 and 10. On this subset of events the network scored anaccuracy of just over 81%.

A vetting of simulated data against experimental data could help to further solidifyresults since the validity hinges on the training data being reflective of real data. Themethod delivers fast per-event reconstruction on an NVIDIA Quadro RTX 6000 GPUand requires about 43 MB of memory for storage. It is expected that the performanceof the method can be improved by use of other network architectures better suited forthe given task. Performance of the proposed architecture is expected to increase furtherwith a larger training dataset, possibly of the order of one percent, although the exactrelationship between training data and performance has not been analyzed.

34

A Training and Evaluation onNoiseless DataThe final network presented in Fig. 20 was also trained on a dataset of 7.6ˆ106 noiselessevents to explore how well the network is able to distinguish the two event classes basedon the pure radio pulses produced by the ARZ model in NuRadioMC. The network wasevaluated on 2ˆ 105 noiseless events and accuracy as a function of neutrino energy wasanalyzed. As an additional test, the network trained on noiseless data was evaluatedon noisy data, meaning flavor predictions were made on noisy data. In this case, thenetwork predicted νe- CC events 100% of the time on both classes, regardless of neutrinoenergy. It is concluded that training a network on noiseless data for the purpose ofreconstructing the flavor from noisy signals is not a valid approach.

35

(a) (b)

(c) (d)

Figure 25: Confusion matrices showing the class separated performance of the finalnetwork, trained on noiseless data, in four different energy ranges. The test set consistedof 200 000 noiseless events split evenly between the two event classes. After evaluationon the test set, the predictions were sorted according to neutrino energy and accuracy ineach energy range was retrieved. The main diagonal corresponds to accurate predictionsas can be inferred from the axis labels. The color bar is scaled according to absolutenumber of events.

36

(a) (b)

Figure 26: Predictive accuracy (blue) of the final network, trained on noiseless data, on(a) νe CC events and (b) non-νe- CC events, for neutrino energy intervals marked bythe horizontal bars. The accuracy values indicated by the dots are the total accuracyvalues for all events inside the energy interval, i.e. correct predictions divided by thenumber of events within the interval. The events in each energy bin (orange) is alsoshown. Predictions were made on separate test datasets of (a) 105 νe- CC events and(b) 105 non-νe- CC, respectively.

37

References[1] IceCube Collaboration, Evidence for High-Energy Extraterrestrial Neutrinos at the

IceCube Detector. Science. November 2013; 342(6161):947.

[2] M. G. Aartsen, R. Abbasi, Y. Abdou, M. Ackermann, J. Adams, J. A. Aguilaret. al., First Observation of PeV-Energy Neutrinos with IceCube. Physical ReviewLetters. July 2013;111(2):021103-1 - 7.

[3] M. G. Aartsens, R. Abbasir, M. Ackermannbo, J. Adamss, J. A. Aguilarl, M.Ahlers et. al., IceCube-Gen2: The Window to the Extreme Universe. [website].New York: Cornell University. [Submitted: 2020-08-10; read 2021-05-17]. Availableat: https://arxiv.org/abs/2008.04323

[4] R. Abbasi, M. Ackermann, J. Adams, J. A. Aguilar, M. Ahlerset. al., A Convo-lutional Neural Network based Cascade Reconstruction for the IceCube NeutrinoObservatory. [website]. New York: Cornell University. [Submitted: 2021-01-27; read2021-05-18]. Available at: https://arxiv.org/abs/2101.11589

[5] B. R. Martin, G. Shaw. Particle Physics. 3rd Edition. Chichester, West Sussex,United Kingdom: John Wiley & Sons, Ltd; 2013.

[6] A. Geron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:Concepts, Tools, and Techniques to Build Intelligent Systems. 2nd Edition. Se-bastopol, California , USA: O’Reilly Media, Inc.; 2019. p. 28-29, 279,

[7] P.A. Zyla et al. [Particle Data Group], PTEP 2020 (2020) no.8, 083C01doi:10.1093/ptep/ptaa104

[8] W. Heitler. The Quantum Theory of Radiation. 3rd Edition. Oxford, United King-dom: Clarendon P.; 1954.

[9] H. Dehghani, S. J. Fatemi, P. Davoudifar, Studying depth of shower maximum usingvariable interaction length. Astrophysics and Space Science. March 2017;362(4):1-8.

[10] J. Matthews, A Heitler model of extensive air showers. Astroparticle Physics. Jan-uary 2005;22(5-6):387-397.

[11] J. Alvarez-Muñiz, E. Zas, The LPM effect for EeV hadronic showers in ice: implica-tions for radio detection of neutrinos. Physics letters B. August 1998;434(3):396-406.

[12] J. Alvarez-Muñiz, W. R. Carvalho Jr., M. Tueros, E. Zas, Coherent Cherenkov radiopulses from hadronic showers up to EeV energies. Astroparticle Physics. October2011;35(6):287–299.

[13] L. D Landau, I. J. Pomeranchuk, The limits of applicability of the theory ofBremsstrahlung by electrons and of the creation of pairs at large energies Dokl.Akad. Nauk SSSR. 1953;92:535-536.

[14] A. B. Migdal, Bremsstrahlung and Pair Production in Condensed Media at HighEnergies. Physical Review. September 1956;103(6):1811-1820.

[15] P. L. Anthony, R. Becker-Szendy, P. E. Bosted, M. Cavalli-Sforza, L. P. Keller, L.A. Kelley et. al., An Accurate Measurement of the Landau-Pomeranchuk-MigdalEffect. Pysical Review Letters. September 1995;75(10):1949-1952.

38

[16] T. Stanev, C. Vankov, R. E. Streitmatter, R. W. Ellsworth, T. Bowen, Develop-ment of ultrahigh-energy electromagnetic cascades in water and lead including theLandau-Pomeranchuk-Migdal effect. Physical Review D. March 1982;25(5): 1291 -1304.

[17] F. Rosenblatt, The Perceptron - a perceiving and recognizing automaton. CornellAeronautical Laboratory - Report 85-460-1. January 1957.

[18] J. Alvarez-Muñiz, E. Zas, The LPM effect for EeV hadronic showers in ice: impli-cations for radio detection of neutrinos. Physical letters B. August 1998;434(3):396-406.

[19] C. Claser, D. García-Fernández, A. Nelles, J. Alvarez-Muñiz, S. W. Barwick, D.Z. Besson et. al., NuRadioMC: simulating the radio emission of neutrinos frominteraction to detector. Eur. Phys. J. C. January 2020;80(2):1-35.

[20] G.A. Askaryan, EXCESS NEGATIVE CHARGE OF AN ELECTRON-PHOTONSHOWER AND ITS COHERENT RADIO EMISSION. JETP. Febuary1962;14(2):441-3.

[21] J. Alvarez-Muniz, P. M. Hansen, A. Romero-Wolf, E. Zas, Askaryan radiation fromneutrino-induced showers in ice. Physical Review D. April 2020;101(8):083005.

[22] G.A. Askaryan, COHERENT RADIO EMISSION FROM COSMIC SHOWERS INAIR AND IN DENSE MEDIA. JETP. September 1965;21(3):658-9.

[23] D. Saltzberg, P. W. Gorham, D. Walz, C. Field, R. Iverson, A. Odian, et. al., Ob-servation of the Askaryan Effect: Coherent Microwave Cherenkov Emission fromCharge Asymmetry in High Energy Particle Cascades. Phys. Rev. Lett. March2001;86(13):2802-5

[24] P. W. Gorham, S. W. Barwick, J. J. Beatty, D. Z. Besson, W. R. Binns, C.Chen et. al., Observations of the Askaryan Effect in Ice. Phys. Rev. Lett. Octo-ber 2007;99(17):171101

[25] J. Alvarez-Muñiz, A. Romero-Wolf, E. Zas, Practical and accurate calculations ofAskaryan radiation. Physical review D. November 2011;84(10):103003

[26] A. Silvestri, S. W. Barwick, J. J. Beatty, D. Z. Besson, W. R. Binns, et al. Statusof ANITA and ANITA-lite. In: Neutrinos and Explosive Events in the Universe.Dordrecht, Netherlands: Springer Publishing; 2005. p. 297 - 306.

[27] Biewald, L. Experiment Tracking with Weights and Biases. [website]. [written Jan-uary 2020; retrieved 2021-06-03]. Available at: https://wandb.ai/site/academic

[28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma et. al., ImageNetLarge Scale Visual Recognition Challenge. International journal of computer vision.December 2015;115(3):211 - 252.

[29] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-ScaleImage Recognition. Published as a conference paper at ICLR 2015. [website]. NewYork: Cornell University. [Submitted: 2015-04-10; read 2021-05-16]. Available at:https://arxiv.org/abs/1409.1556v6

[30] Chollet, F. and others. Keras API - Layer activation functions- softmax activation. [website]. [retrieved 2021-06-03]. Availabel at:https://keras.io/api/layers/activations/softmax-function

[31] Chollet, F. and others. Keras API - Optimizers - Adam. [website]. [retrieved 2021-06-03]. Available at: https://keras.io/api/optimizers/adam

39

Investigationsintoneutrinoﬂavor …1582385/... · 2021. 7. 31. · 0; (11) fromwhich e d{L R 1 2...

Documents

Transcript of Investigationsintoneutrinoﬂavor …1582385/... · 2021. 7. 31. · 0; (11) fromwhich e d{L R 1 2...