Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU...

29
Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund

Transcript of Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU...

Page 1: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube

Tim Ruhe, TU Dortmund

Page 2: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

2

Outline

Data mining is more... Why is IceCube interesting (from a machine learning point of view) Data preprocessing and dimensionality reduction Training and validation of a learning algorithm Results Other Detector configuration? Summary & Outlook

Page 3: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

3

Data Mining is more...

Model

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

Page 4: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

4

Data Mining is more...

Model

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

Preprocessing

Garbage in/Garbage out

Page 5: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

5

Data Mining is more...

Model

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

Preprocessing

Garbage in/Garbage out

Validation

Page 6: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

6

Why is IceCube interesting from a machine learning point of view?

Huge amount of data Highly imbalanced distribution of event

classes (signal and background) Huge amount of data to be processed by

the learner (Big Data)

Real life problem

Page 7: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

7

Preprocessing (1): Reducing the Data Volume Through Cuts

Background Rejection: 91.4%Signal Efficiency: 57.1%

BUT: Remaining Background

is significantly harder to reject!

Page 8: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

8

Preprocessing (2): Variable Selection

Tim Ruhe | Statistische Methoden der Datenanalyse

Check for missing values.

Check for potential bias.

Check for correlations.

Exclude if number of missing values exceed a 30%.

Exclude everything that is useless, redundant or a source of potential bias.

Exclude everything that has a

correlation of 1.0.Automated

Feature Selection

2600 variables

477 variables

Page 9: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

9

Relevance vs. Redundancy: MRMR (continuous case)

Relevance: Redundancy:

MRMR: or

Page 10: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

10

Feature Selection Stability

BA

BAJ

Jaccard:

Average over many sets of variables:

Page 11: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

11

Comparing Forward Selection and MRMR

Page 12: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

12

Training and Validation of a Random Forest

treesn

ii

trees

sn

s0

1

use an ensemble of simple decision trees

Obtain final classification as an average over all trees

Page 13: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

13

Training and Validation of a Random Forest

treesn

ii

trees

sn

s0

1

use an ensemble of simple decision trees

Obtain final classification as an average over all trees

5-fold cross validation to validate the performance of the forest.

Page 14: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

14

Random Forest and Cross Validation in Detail (1)

Background Muons750,000 in total

CORSIKA, Polygonato

Neutrinos70,000 in total

NuGen, E-2 Spectrum

600,000 available for training

56,000 available for training

27,000

27,000

Sam

plin

g

Page 15: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

15

Random Forest and Cross Validation in Detail (2)

150,000 available for testing

14,000 available for testing

27,000

27,000

Train Apply

Repeat (x5)

500 Trees

Page 16: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

16

Random Forest Output

Page 17: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

17

Random Forest Output

We need an additional

cut on the output of the

Random Forest!

Page 18: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

18

Random Forest Output: Cut at 500 trees

We need an additional

cut on the output of the

Random Forest!

28830 ± 480 expected neutrino candidates

28830 ± 480 expected background muons

27,771 neutrino candidates

Background Rejection: 99.9999% Signal Efficiency 18.2% Estimated Purity: (99.59±0.37)%

Apply to experimental data

This yields

Page 19: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

19

Unfolding the spectrum

TRUEE

This is no Data Mining...

...but it ain‘t magic either

Page 20: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

20

Moving on... IC79

212 neutrino candidates per day 66885 neutrino candidates in total 330±200 background muons

Entire analysis chain can be applied on other detector configurations

...with minor changes (e.g. ice model)

Page 21: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

21

Summary and Outlook

99.9999% Background Rejection

Purities above 99% are routinely achieved

Future Improvements???

By starting at an earlier analysis level...

MRMRRandom Forest

Page 22: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

22

Backup Slides

Page 23: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

23

RapidMiner in a Nutshell

Developed at the Department of Computer Science at TU Dortmund(YALE) Operator based, written in Java It used to be open source Many, many plugins due to a rather active community One of the most widely used data mining tools

Page 24: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

24

What I like about it

Data flow is nicely visualized and can be easily followed and comprehended

Rather easy to learn, even without programming experience Large Community (Updates, Bugfixes, Plugins) Professional Tool (They actually make money with that!) Good support Many tutorials can be found online, even special one Most operators work like a charm Extendable

Page 25: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

25

Relevance vs. Redundancy: MRMR (discrete case)

Relevance: Redundancy:

MRMR: or

Mutual Information

Page 26: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

26

Feature Selection Stability

BA

BAJ

||

||||

)(),(

2

BAr

kBA

knk

krnBAIC

Jaccard:

Kuncheva:

Page 27: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

27

Ensemblemethoden

Tim Ruhe | Statistische Methoden der Datenanalyse

Ensemble methods

With Weight (e.g. Boosting)

Without Weight (e.g. Random Forest)

Page 28: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

28

Random Forest: What is randomized?

Randomness 1: Events the tree is trained on (bagging)

Randomness 2: Variables that are available for a split

Page 29: Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU Dortmund.

29

Are we actually better, than simpler methods?