Post on 15-Jan-2016
description
Multivariate Analysis with TMVA
Basic Concept and New Features
Helge Voss(*) (MPI–K, Heidelberg)
LHCb MVA Workshop, CERN, July 10, 2009
(*) On behalf of the author team: A. Hoecker, P. Speckmayer, J. Stelzer, J.Therhaag, E.v.Toerne , H. Voss
And the contributors: See acknowledgments on page xx
On the web: http://tmva.sf.net/ (home), https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome (tutorial)
2 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
A linear boundary? A nonlinear one?Rectangular cuts?
S
B
x1
x2 S
B
x1
x2 S
B
x1
x2
How can we decide?
Once decided on a class of boundaries, how to find the “optimal” one?
Event Classification
Suppose data sample of two types of events: with class labels Signal and Background (will restrict here to two class cases. Many classifiers can in principle be extended to several classes, otherwise, analyses can be staged)
how to set the decision boundary to select events of type S ? we have discriminating variables x1, x2, …
3 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Regression In High Energy Physics
Most HEP event reconstruction require some sort of “function estimate”,
of which we do not have an analytical expression:
energy deposit in a the calorimeter: shower profile calibration entry location of the particle in the calorimeter distance between two overlapping photons in the calorimeter … you can certainly think of more..
Maybe you could even imagine some final physics parameter that needs to be fitted
to your event data and you would rather use a non analytic fit function from Monte
Carlo events than an analytic parametrisation of the theory ???:
reweighing method used in the LEP W-mass analysis …
Maybe your application can be successfully learned by a machine using Monte Carlo events ??
somewhat wild guessing, might not be reasonable to do..
4 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Regression
linear?
x
f(x)
x
f(x)
x
f(x) constant ? non - linear?
how to estimate a “functional behaviour” from a given set of ‘known measurements” ?
Assume for example “D”-variables that somehow characterize the shower in your calorimeter.
from Monte Carlo or testbeam measurements, you have a data sample with events that measure all D-Variables x and the corresponding particle energy
the particle energy as a function of the D observables x is a surface in the D+1 dimensional space given by D-observables + energy
seems trivial??
maybe… the human eye and brain behind have very good pattern recognition capabilities!
but what if you have more variables?
5 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Event Classification: Why MVA ?
• What do you think. Is this event is more “likely” signal or background ??
• Assume 4 Uncorrelated Gaussians:
6 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Event Classification
Each event, if Signal or Background, has “D” measured variables.
D
“feature space”
y(x)
most general formy = y(x); x D x={x1,….,xD}: input variables
y(x): RnR:
If one histogramms the resulting y(x) values:
Find a mapping from D-dimensional input/observable/”feature” space
to one dimensional output
class lables
Who sees how this would look like for the regression problem?
7 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
y(x)
Event Classification Each event, if Signal or Background, has “D” measured variables.
D
“feature space”
y(B) 0, y(S) 1
y(x): “test statistic” in D-dimensional space of input variables
y(x): RnR:
distributions of y(x): PDFS(y) and PDFB(y)
overlap of PDFS(y) and PDFB(y) separation power , purity
>cut: signal=cut: decision boundary<cut: background
y(x): used to set the selection cut!
Find a mapping from D-dimensional input/observable/”feature” space
y(x)=const: surface defining the decision boundary.
efficiency and purity
to one dimensional output
class lables
8 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
MVA and Machine Learning
The previous slide was basically the idea of “Multivariate Analysis” (MVA)rem: What about “standard cuts” ?
Finding y(x) : RnR ( “basically” the same for regression and classification)given a certain type of model class y(x)
in an automatic way using “known” or “previously solved” eventsi.e. learn from known “patterns”
such that the resulting y(x) has good generalization properties when applied to “unknown” events
that is what the “machine” is supposed to be doing: supervised machine learning
Of course… there’s no magic, we still need to:choose the discriminating variables
choose the class of models (linear, non-linear, flexible or less flexible)
tune the “learning parameters” bias vs. variance trade off
check generalization properties
consider trade off between statistical and systematic uncertainties
9 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Overview/Outline
Idea: rather than just implementing new MVA techniques and making them
available in ROOT (i.e. like TMulitLayerPercetron does):
one common platform / interface for all MVA classification and regression algorithms.
easy to use and switch between classifiers
(common) data pre-processing capabilities
train / test all classifiers on same data sample and evaluate consistently
common analysis, application and evaluation framework (ROOT Macros, C++ executables, python scripts..)
application of trained classifier/regressor using libTMVA.so and corresponding “weight files” or stand alone C++ code (latter is currently disabled (TMVA 4.01) )
Outline
The TMVA project
Survey of the available classifiers/regressors
Toy examples
General considerations and systematics
10 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
TMVA Development and Distribution
TMVA is a sourceforge (SF) package for world-wide accessHome page ………………. http://tmva.sf.net/
SF project page …………. http://sf.net/projects/tmva
View SVN ………………… http://tmva.svn.sourceforge.net/viewvc/tmva/trunc/TMVA/
Mailing list .……………….. http://sf.net/mail/?group_id=152074
Tutorial TWiki ……………. https://twiki.cern.ch/twiki/bin/view/TMVA/WebHome
Active project
Currently 6 core developers, and 42 registered contributors at SF
>4864 downloads since March 2006 (without CVS/SVN checkouts and ROOT users)
current version. 4.0.1 (29 Jun 2009)
try to be fast on users requests bug fixes etc.
Integrated and distributed with ROOT since ROOT v5.11/03 (newest version 4.0.1 in ROOT production release 5.23)
11 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
New (and future) TMVA Developments
MV-Regression
Code re-design allowing for:
boosting any classifier (not only decision trees)
combination of any pre-processing algorithms
classifier training in multi-regions
multi-class classifiers (will slowly creep in to some classifiers…)
cross-validation
New MVA-Technique
PDE-Foam (classification + regression)
Data preprocessing
Gauss-transformation
arbitrary combination of preprocessing
Numerous small add-ons etc..
12 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Currently implemented classifiers :
Rectangular cut optimisation
Projective (1D) likelihood estimator
multidimensional likelihood estimators (PDE range search, PDE-Foam, k-Nearest Neighbour) (also as Regressor)
Linar discirminant analysis (Fisher, H-Matrix general Linear (also as Regressor) )
Function Discriminant (also as Regressor)
Artificial neural networks (3 multilayer perceptron implementations (MLP also as Regressor with multi-dimensional target) )
Boosted/bagged/randomised decision trees (also as Regressor)
Rule Ensemble Fitting
Support Vector Machine
(Plugin for any other MVA classifier (e.g. NeuroBayes))
Currently implemented data preprocessing :
Linear decorrelation
Principal component analysis
Gauss transformation
The TMVA Classifiers (Regressors)
13 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Note that decorrelation is only complete, if Correlations are linear
Input variables are Gaussian distributed
Example for Data Preprocessing: Decorrelation
Commonly realised for all classifiers in TMVA (centrally in DataSet class)
using the “square-root” of the correlation matrix
using the Principal Component Analysis
originaloriginal SQRT derorr.SQRT derorr. PCA derorr.PCA derorr.
Removal of linear correlations by rotating input variables
14 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
“Gaussian-isation”
Improve decorrelation by pre-Gaussianisation of variables
First: transformation to achieve uniform (flat) distribution:
event
evefl
ntat variabl , es
kx
k k
i
k kix p x kx d
Transformed variable k Measured value PDF of variable k
Second: make Gaussian via inverse error function:
Gauss 1 flevent even
at
t 2 erf 2 1 var iables,k ki i kx x
2
0
2erf
xte tx d
The integral can be solved in an unbinned way by event counting,
or by creating non-parametric PDFs (see later for likelihood section)
Third: decorrelate (and “iterate” this procedure)
15 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
OriginalOriginal
“Gaussian-isation”
Signal - GaussianisedSignal - Gaussianised
We cannot simultaneously Gaussianise both signal and background ?!
Background - GaussianisedBackground - Gaussianised
16 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Pre-processing
The decorrelation/gauss transform can be determined from either: Signal + Background together
Signal only
Background only
For all but the “Likelihood” methods (1D or multi-dimensional) we need to decide on either of those. (TMVA default: signal+background)
For “Likelihood” methods where we test “Signal” hypothesis against “Background” hypothesis, BOTH are used.
17 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Rectangular Cut Optimisation
Simplest method: indiviudual cuts on each variable
cuts out a single rectangular volume from the variable space
Technical challenge: how to find optimal cuts ? MINUIT fails due to non-unique solution space
TMVA uses: Monte Carlo sampling, Genetic Algorithm, Simulated Annealing
Huge speed improvement of volume search by sorting events in binary tree
Cuts usually benefit from prior decorrelation of cut variables
18 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Projective (1D) Likelihood Estimator (PDE Approach)
Probability density estimators for each input variable
combined in likelihood estimator (ignoring correlations)
Optimal approach if zero correlations (or linear decorrelation)
Otherwise: significant performance loss
Technical challenge: how to estimate the PDF shapes
Difficult to automate for arbitrary PDFs
Easy to automate, can create artefacts/suppress information
Automatic, unbiased, but suboptimal
TMVA uses binned shape interpolation using spline functions
Or: unbinned adaptive Gaussian kernel density estimation
TMVA performs automatic validation of goodness-of-fit
3 ways: parametric fitting (function) event counting nonparametric fitting
19 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Multidimensional PDE Approach
Use a single PDF per event class (sig, bkg), which spans Nvar dimensions
2 different implementations of the same idea:PDE Range-Search: count number of signal and background events in “vicinity” of test event preset or adaptive volume defines “vicinity”
Carli-Koblitz, NIM A501, 576 (2003)
H1
H0
x1
x2
test event
Improve simple event counting by use of kernel functions to penalise large distance from test event
eventPDERS , 0.86y i V
Increase counting speed with binary search tree
Genuine k-NN algorithm (author R. Ospanov)
Intrinsically adaptive approach
Very fast search with kd-tree event sorting
Regression: assume the function value distance weighted averaged over the “volume”
PDE-Foam: self adaptive phase space binning divides space in hyperqubes as “volumes”…
20 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
H1
H0
x1
x2 H1
H0
x1
x2 H1
H0
x1
x2 H1
H0
x1
x2
Fisher’s Linear Discriminant Analysis (LDA)
LDA determines axis in the input variable hyperspace such that a projection of events onto this axis pushes signal and background as far away from each other as possible
Classifier response couldn’t be simpler:
event evevari
Fabl
i 0e
nts
k kk
i iy F x F
“Fisher coefficients”
projection axis
Well known, simple and elegant classifier
Function discriminant analysis (FDA)
Fit any user-defined function of input variables requiring that signal events return 1 and background 0
Parameter fitting: Genetics Alg., MINUIT, MC and combinations Easy reproduction of Fisher result, but can add nonlinearities Very transparent discriminator
21 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Nonlinear Analysis: Artificial Neural Networks
Achieve nonlinear classifier response by “activating” output nodes using nonlinear weights
1( ) 1 xA x e
1
i
. . .N
1 input layer k hidden layers 1 ouput layer
1
j
M1
. . .
. . . 1
. . .Mk
2 output classes (signal and background)
Nvar discriminating input variables
11w
ijw
1jw. . .. . .
1( ) ( ) ( ) ( 1)
01
kMk k k k
j j ij ii
x w w xA
var
(0)1..i Nx
( 1)1,2kx
(“Activation” function)
with:
Fee
d-fo
rwar
d M
ultil
ayer
Per
cept
ron
TMlpANN: Interface to ROOT’s MLP implementation
MLP: TMVA’s own MLP implementation for increased speed and flexibility
CFMlpANN: ALEPH’s Higgs search ANN, translated from FORTRAN
Three different implementations in TMVA (all are Multilayer Perceptrons)
Regression: uses the real valued output of ONE node
22 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Boosted Decision Trees (BDT)DT: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background
BDT: combine forest of DTs, with differently weighted events in each tree (trees can also be weighted)
Bottom-up “pruning” of a decision tree
Remove statistically insignificant nodes to reduce tree overtraining
e.g., “AdaBoost”: incorrectly classified events receive larger weight in next DT
GradientBoost
“Bagging”: random poisson event weights
resampling with replacement
“Randomised Forest”, like bagging and usaging of random sub-sets of variables
Boosting/bagging/randomising create set of “basis functions”: final classifier is linear combination of these improves stability
23 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Boosted Decision Trees (BDT)DT: Sequential application of cuts splits the data into nodes, where the final nodes (leafs) classify an event as signal or background
Bottom-up “pruning” of a decision tree
Remove statistically insignificant nodes to reduce tree overtraining
Decision tree before pruningDecision tree after pruning
Randomised Forests don’t need pruning VERY ROBUSTSome people even claim that any “boosted” decision trees don’t need to be pruned, but
according to our experience, this is not always true.
Regression: assume the function value of average in the leaf node
24 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Learning with Rule Ensembles
One of the elementary cellular automaton rules (Wolfram 1983, 2002). It specifies the next color in a cell, depending on its color and its immediate neighbors. Its rule outcomes are encoded in the binary representation 30=000111102.
Following RuleFit approach by Friedman-Popescu Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003
Model is linear combination of rules, where a rule is a sequence of cuts
RF 01 1
ˆ ˆR RM n
m m k km k
x x xy a ra b
rules (cut sequence rm=1 if all cuts satisfied, =0 otherwise)
normalised discriminating event variables
RuleFit classifier
Linear Fisher termSum of rules
The problem to solve is
Create rule ensemble: use forest of decision trees
Fit coefficients am, bk: gradient direct regularization minimising Risk (Friedman et al.)
Pruning removes topologically equal rules” (same variables in cut sequence)
25 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Boosting
Training Sampleclassifier
C(0)(x)
Weighted Sample
re-weight
classifier C(1)(x)
Weighted Sample
re-weight
classifier C(2)(x)
Weighted Sample
re-weight
Weighted Sample
re-weight
classifier C(3)(x)
classifier C(m)(x)
ClassifierN(i)
ii
y(x) w C (x)
26 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Adaptive Boosting (AdaBoost)
Training Sampleclassifier
C(0)(x)
Weighted Sample
re-weight
classifier C(1)(x)
Weighted Sample
re-weight
classifier C(2)(x)
Weighted Sample
re-weight
Weighted Sample
re-weight
classifier C(3)(x)
classifier C(m)(x)
err
err
err
1 fwith :
f
misclassified eventsf
all events
ClassifierN (i)(i)err
(i)i err
1 fy(x) log C (x)
f
AdaBoost re-weights events misclassified by previous classifier by:
AdaBoost weights the classifiers also using the error rate of the individual classifier according to:
27 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
AdaBoost: A simple demonstration
The example: (somewhat artificial…but nice for demonstration) : • Data file with three “bumps”• Weak classifier (i.e. one single simple “cut” ↔ decision tree stumps )
B S
var(i) > x var(i) <= x
Two reasonable cuts: a) Var0 > 0.5 εsignal=66% εbkg ≈ 0% misclassified events in total 16.5%
or b) Var0 < -0.5 εsignal=33% εbkg ≈ 0% misclassified events in total 33%
the training of a single decision tree stump will find “cut a)”
a)b)
28 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
AdaBoost: A simple demonstration
The first “tree”, choosing cut a) will give an error fraction: err = 0.165
.. and hence will chose: “cut b)”: Var0 < -0.5
b)
The combined classifier: Tree1 + Tree2the (weighted) average of the response to
a test event from both trees is able to separate signal from background as good as one would expect from the most powerful classifier
before building the next “tree”: weight wrong classified training events by ( 1-err/err) ) ≈ 5
the next “tree” sees essentially the following data sample:
re-weight
29 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
AdaBoost: A simple demonstration
Only 1 tree “stump” Only 2 tree “stumps” with AdaBoost
30 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Boosting Fisher Discriminant
• Fisher Discriminant: • y(x) = A linear combination of the input variables
• A linear combination of Fisher discriminant is again a Fisher discriminant. Hence “simple boosting doesn’t help”• However:
• apply the “Fisher Discriminant” CUT in each step (or any other non-linear transformation”
then it works!
31 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Support Vector Machine
There are methods to create linear decision boundaries using only measures of distances
leads to quadratic optimisation processs
• Using variable transformations, i.e moving into a higher dimensional space (i.e. using var1*var1 in Fisher Discriminant) allows in principle to separate with linear decision boundaries non linear problems
• The decision boundary in the end is defined only by training events that are closest to the boundary
Support Vector Machine
32 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Support Vector Machines
x1
x2
margin
support vectors
Sep
arab
le d
ata
Find hyperplane that best separates signal from background
optimal hyperplane
Best separation: maximum distance (margin) between closest events (support) to hyperplane
Linear decision boundary
Non
-sep
arab
le d
ata
Solution of largest margin depends only on inner product of support vectors (distances)
quadratic minimisation problem
1
2
4
3
If data non-separable add misclassification cost parameter C·ii to minimisation function
33 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Support Vector Machines
x1
x2
x1
x3
x1
x2
Non-linear cases:
Kernel size paramter typically needs careful tuning! (Overtraining!)
Transform variables into higher dimensional feature space where again a linear boundary (hyperplane) can separate the data
Explicit transformation doesn’t need to be specified. Only need the “scalar product” (inner product) x·x Ф(x)·Ф(x).
certain Kernel Functions can be interpreted as scalar products between transformed vectors in the higher dimensional feature space. e.g.: Gaussian, Polynomial, Sigmoid
Choose Kernel and fit the hyperplane using the linear techniques developed above
(x1,x2)Sep
arab
le d
ata
Non
-sep
arab
le d
ata
Find hyperplane that best separates signal from background
Best separation: maximum distance (margin) between closest events (support) to hyperplane
Linear decision boundary
Solution depends only on inner product of support vectors quadratic minimisation problem
If data non-separable add misclassification cost parameter C·ii to minimisation function
34 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Data Preparation and Preprocessing
Data input format: ROOT TTree or ASCII
Supports selection of any subset or combination or function of available variables
Supports application of pre-selection cuts (possibly independent for signal and bkg)
Supports global event weights for signal or background input files
Supports use of any input variable as individual event weight
Supports various methods for splitting into training and test samples:
Block wise
Randomly
Periodically (i.e. periodically 3 testing ev., 2 training ev., 3 testing ev, 2 training ev. ….)
User defined separate Training and Testing trees
Preprocessing of input variables (e.g., decorrelation)
35 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Code Flow for Training and Application Phases
T MVA tutorialScripts can be ROOT scripts, C++ executables or python scripts (via PyROOT)Scripts can be ROOT scripts, C++ executables or python scripts (via PyROOT)
36 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
MVA Evaluation Framework
TMVA is not only a collection of classifiers, but an MVA framework
After training, TMVA provides ROOT evaluation scripts (through GUI)
Plot all signal (S) and background (B) input variables with and without pre-processing
Correlation scatters and linear coefficients for S & B
Classifier outputs (S & B) for test and training samples (spot overtraining)
Classifier Rarity distribution
Classifier significance with optimal cuts
B rejection versus S efficiency
Classifier-specific plots:• Likelihood reference distributions• Classifier PDFs (for probability output and Rarity)• Network architecture, weights and convergence• Rule Fitting analysis plots• Visualisation of decision trees
37 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Evaluating the Classifier Training
Projective likelihood PDFs, MLP training, BDTs, …
average no. of nodes before/after pruning: 4193 / 968
38 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Evaluating the Classifier Training
Classifier output distributions for test and training samples …
39 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Evaluating the Classifier Training
Optimal cut for each classifiers …
Using the TMVA graphical output one can determine the optimal cut on a classifier output (working point in “statistical significance)
40 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Evaluating the Classifier Training
Background rejection versus signal efficiencies …
ˆ( ) ( )y
R y y y dy
Best plot to compare classifier performance
An elegant variable is the Rarity: transforms to uniform background. Height of signal peak direct measure of classifier performance
If background in data non-uniform problem in training sample
41 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Evaluating the Classifiers (taken from TMVA output…)
--- Fisher : Ranking result (top variable is best ranked)--- Fisher : ------------------------------------------------ Fisher : Rank : Variable : Discr. power--- Fisher : ------------------------------------------------ Fisher : 1 : var4 : 2.175e-01--- Fisher : 2 : var3 : 1.718e-01--- Fisher : 3 : var1 : 9.549e-02--- Fisher : 4 : var2 : 2.841e-02--- Fisher : ---------------------------------------------B
ette
r va
riabl
e
--- Factory : Inter-MVA overlap matrix (signal):--- Factory : --------------------------------- Factory : Likelihood Fisher--- Factory : Likelihood: +1.000 +0.667--- Factory : Fisher: +0.667 +1.000--- Factory : ------------------------------
Input Variable Ranking
Classifier correlation and overlap
How discriminating is a variable ?
Do classifiers select the same events as signal and background ? If not, there is something to gain !
42 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Evaluating the Classifiers (taken from TMVA output…)
Evaluation results ranked by best signal efficiency and purity (area)------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi-
Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance:------------------------------------------------------------------------------
Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682
------------------------------------------------------------------------------Testing efficiency compared to training efficiency (overtraining check)
------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from traing sample)
Methods: @B=0.01 @B=0.10 @B=0.30------------------------------------------------------------------------------
Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873)MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873)LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872)PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881)RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848)HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868)BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911)CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715)Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677)
-----------------------------------------------------------------------------
Bet
ter
clas
sifie
r
Check for over-training
43 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Subjective Summary of TMVA Classifier Properties
Criteria
Classifiers
CutsLikeli-hood
PDERS/ k-NN
H-Matrix Fisher MLP BDTRule
fittingSVM
Perfor-mance
no / linear correlations nonlinear
correlations
Speed
Training Response /
Robust-ness
Overtraining Weak input variables
Curse of dimensionality Transparency
The properties of the Function discriminant (FDA) depend on the chosen function
44 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Regression Example
Example: Target as a quadratic function of 2 variables
•approximated by:• “Linear Regressor”
• Neural Network
45 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
NEW! TMVA Users Guide for 4.0Available on http://tmva.sf.net
TMVA Users Guide97pp, incl. code examples
arXiv physics/0703039
web page with quick reference summary of all training options!
46 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Some T o y E x a m p l e sSome T o y E x a m p l e s
47 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
The “Schachbrett” Toy (chess board)
Performance achieved without parameter tuning: PDERS and BDT best “out of the box” classifiers
After some parameter tuning, also SVM und ANN(MLP) perform equally well
Theoretical maximum
Events weighted by SVM response
Event distribution
48 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
More Toys: Linear-, Cross-, Circular Correlations
Illustrate the behaviour of linear and nonlinear classifiers
Linear correlations(same for signal and background)
Linear correlations(opposite for signal and background)
Circular correlations(same for signal and background)
49 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
How does linear decorrelation affect strongly nonlinear cases ?
Original correlations
SQRT decorrelation
PCA decorrelation
50 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Weight Variables by Classifier Output
Linear correlations(same for signal and background)
Cross-linear correlations(opposite for signal and background)
Circular correlations(same for signal and background)
How well do the classifier resolve the various correlation patterns ?
LikelihoodLikelihood - DPDERSFisherMLPBDT
51 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Final Classifier Performance
Background rejection versus signal efficiency curve:
LinearExampleCrossExampleCircularExample
52 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Stability with Respect to Irrelevant Variables
Toy example with 2 discriminating and 4 non-discriminating variables ?
53 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Stability with Respect to Irrelevant Variables
Toy example with 2 discriminating and 4 non-discriminating variables ?
use only two discriminant variables in classifiers
use only two discriminant variables in classifiers
54 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Stability with Respect to Irrelevant Variables
Toy example with 2 discriminating and 4 non-discriminating variables ?
use only two discriminant variables in classifiers
use only two discriminant variables in classifiersuse all discriminant variables in classifiers
use all discriminant variables in classifiers
55 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
g e n e r a l r e m a r k s a n d
a b o u t s y s t e m a t i c s
g e n e r a l r e m a r k s a n d
a b o u t s y s t e m a t i c s
56 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
General Advice for (MVA) AnalysesThere is no magic in MVA-Methods:
no need to be too afraid of “black boxes”
you typically still need to make careful tuning and do some “hard work”
The most important thing at the start is finding good observables
good separation power between S and B
little correlations amongst each other
no correlation with the parameters you try to measure in your signal sample!
Think also about possible combination of variables
can this eliminate correlation
Always apply straightforward preselection cuts and let the MVA only do the rest.
“Sharp features should be avoided” numerical problems, loss of information when binning is applied
simple variable transformations (i.e. log(var1) ) can often smooth out these areas and allow signal and background differences to appear in a clearer way
Treat regions in the detector that have different features “independent”
can introduce correlations where otherwise the variables would be uncorrelated!
57 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Some Words about Systematic Errors Typical worries are:
What happens if the estimated “Probability Density” is wrong ?
Can the Classifier, i.e. the discrimination function y(x), introduce systematic uncertainties?
What happens if the training data do not match “reality”
Any wrong PDF leads to imperfect discrimination function
Imperfect ! (calling it “wrong” wouldn’t “right”) y(x) loss of discrimination power
that’s all!
classical cuts face exactly the same problem,
But: in addition to cutting on features that are not correct, now you can also “exploit” correlations that are in fact not correct HOWEVER: see next slide
P(x | S)y(x)
P(x | B)
Systematic error are only introduced once “Monte Carlo events” with imperfect modeling are used for
efficiency; purity
#expected events
same problem with classical “cut” analysis
use control samples to test MVA-output distribution (y(x))
Combined variable (MVA-output, y(x)) might “hide” problems in ONE individual variable
train classifier with few variables only and compare with data
check all input variables individually ANYWAY
58 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Systematic “Error” in Correlations• Use as training sample events that have correlatetions
• optimize CUTs• train an propper MVA (e.g. Likelihood, BDT)
• Assume in “real data” there are NO correlations SEE what happens!!
59 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Systematic “Error” in Correlations•Compare “Data” (TestSample) and Monte-Carlo
(both taken from the same underlying distribution)
60 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Systematic “Error” in Correlations•Compare “Data” (TestSample) and Monte-Carlo
(both taken from the same underlying distributions that
differ by the correlation!!! )
Differences are ONLY visible in the MVA-output plots (and if you’d look at cut sequences….)
61 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Treatment of Systematic Uncertainties
“Calibration uncertainty” may shift the central value and hence worsen (or increase) the discrimination power of “var4”
Is there a strategy however to become ‘less sensitive’ to possible systematic uncertainties
i.e. classically: variable that is prone to uncertainties do not cut in the region of steepest gradient
classically one would not choose the most important cut on an uncertain variable
Try to make classifier less sensitive to “uncertain variables”
i.e. re-weight events in training to decrease separation
in variables with large systematic uncertainty
(certainly not yet a recipe that can strictly be followed, more
an idea of what could perhaps be done)
62 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Treatment of Systematic Uncertainties
1st Way
Cla
ssifi
er o
utpu
t dis
trib
utio
ns fo
r si
gnal
onl
yC
lass
ifier
out
put d
istr
ibut
ions
for
sign
al o
nly
63 Helge Voss Graduierten-Kolleg, Freiburg, 11.-15. Mai 2009 ― Multivariate Data Analysis and Machine Learning
Treatment of Systematic Uncertainties
2nd Way
Cla
ssifi
er o
utpu
t dis
trib
utio
ns fo
r si
gnal
onl
yC
lass
ifier
out
put d
istr
ibut
ions
for
sign
al o
nly
64 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
C o p y r i g h t s & C r e d i t sC o p y r i g h t s & C r e d i t s
TMVA is open source software
Use & redistribution of source permitted according to terms in BSD license
Acknowledgments: The fast development of TMVA would not have been possible without the contribution and feedback from many developers and users to whom we are indebted. Andreas Hoecker (CERN, Switzerland), Jörg Stelzer (CERN, Switzerland), Peter Speckmayer (CERN, Switzerland), Jan Therhaag (Universität Bonn, Germany), Eckhard von Toerne (Universität Bonn, Germany), Helge Voss (MPI für Kernphysik Heidelberg, Germany), Tancredi Carli (CERN, Switzerland), Asen Christov (Universität Freiburg, Germany), Or Cohen (CERN, Switzerland and Weizmann, Israel), Krzysztof Danielowski (IFJ and AGH/UJ, Krakow, Poland), Dominik Dannheim (CERN, Switzerland), Sophie Henrot-Versille (LAL Orsay, France), Matthew Jachowski (Stanford University, USA), Kamil Kraszewski (IFJ and AGH/UJ, Krakow, Poland), Attila Krasznahorkay Jr. (CERN, Switzerland, and Manchester U., UK), Maciej Kruk (IFJ and AGH/UJ, Krakow, Poland), Yair Mahalalel (Tel Aviv University, Israel), Rustem Ospanov (University of Texas, USA), Xavier Prudent (LAPP Annecy, France), Arnaud Robert (LPNHE Paris, France), Doug Schouten (S. Fraser U., Canada), Fredrik Tegenfeldt (Iowa University, USA, until Aug 2007), Alexander Voigt (CERN, Switzerland), Kai Voss (University of Victoria, Canada), Marcin Wolter (IFJ PAN Krakow, Poland), Andrzej Zemla (IFJ PAN Krakow, Poland). Backup slides on:
(i) some illustrative toy examples (ii) treatment of systematic uncertainties (iii) sensitivity to weak input variables
http://tmva.sf.net/
65 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Questions/Discussion pointsHow to treat systematic uncertainties in MVA methods?
check multidimensioal training samples including their correlations
which differences we care about?
Can one give a particular classifier that is:
always the best?
always the best for specific class of models?
How could one teach the classifiers to optimise for certain regions of interest (i.e. very large background rejection only ?)
is that actually necessary, or does the cut on the MVA output provide just this?
Train according to “smallest error on measurement” rather than best separation of signal/background
Anyone a good idea for maybe better treatment of negative weights?
How to best divide the data sample into “test and training” sample?
Do we want an additional validation sample?
automatic tuning of parameters
cross validation
66 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Bias Variance Trade off -- Overtraining
model complexityhighlow
pred
ictio
n er
ror
low bias
high variance
high bias
low variance
test sample
training sample
S
B
x1
x2 S
B
x1
x2
Overtraining!
67 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Cross Validation many (all) classifiers have tuning parameters “” that need to be controlled against
overtraining
number of training cycles, number of nodes (neural net)
smoothing parameter h (kernel density estimator)
….
the more free parameter a classifier has to adjust internally more prone to overtraining
more training data better training results
division of data set into “training” and “test” sample reduces the “training” data
Train TrainTrainTrainTest Train
Cross Validation: divide the data sample into say 5 sub-sets
Train TrainTrainTrainTest TrainTrain TrainTrainTrain TestTrain TrainTrain TestTrainTrain TrainTrainTrain TestTrain
train 5 classifiers: yi(x,) : i=1,..5,
classifier yi(x,) is trained without the i-th sub sample
calculate the test error:events
i kkevents
1CV( ) L(y (x , )) L : loss function
N
choose tuning parameter a for which CV() is minimum and train the final classifier using all data
68 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Minimisation
Robust global minimum finder needed at various places in TMVA
Default solution in HEP: (T)Minuit/Migrad [ How much longer do we need to suffer …. ? ]
Gradient-driven search, using variable metric, can use quadratic Newton-type solution
Poor global minimum finder, gets quickly stuck in presence of local minima
Specific global optimisers implemented in TMVA:Genetic Algorithm: biology-inspired optimisation algorithm
Simulated Annealing: slow “cooling” of system to avoid “freezing” in local solution
Brute force method: Monte Carlo SamplingSample entire solution space, and chose solution providing minimum estimator
Good global minimum finder, but poor accuracy
TMVA allows to chain minimisers
For example, one can use MC sampling to detect the vicinity of a global minimum, and then use Minuit to accurately converge to it.
69 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Monte Carlo Brute force methods: Random Monte Carlo or Grid search
•Sample entire solution space, and chose solution providing minimum estimator
•Good global minimum finder, but poor accuracy
MinuitDefault solution in HEP: Minuit
• Gradient-driven search, using variable metric, can use quadratic Newton-type solution
• Poor global minimum finder, gets quickly stuck in presence of local minima
Grid Search
Minimisation Techniques
Simulated Annealing
Genetic Algorithm
Like heating up metal and slowly cooling it down (“annealing”)
Atoms in metal move towards the state of lowest energy while for sudden cooling atoms tend to freeze in intermediate higher energy states slow “cooling” of system to avoid “freezing” in local solution
Biology-inspired
• “Genetic” representation of points in the parameter space
• Uses mutation and “crossover”
• Finds approximately global minima
70 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Binary Trees
Tree data structure in which each node has at most two childrenTypically the child nodes are called left and right
Binary trees are used in TMVA to implement binary search trees and decision trees
Root node
Leaf node
><
Dep
th o
f a
tree Amount of computing time to
search for events:
• Box search: (Nevents)Nvar
• BT search: Nevents·Nvarln2(Nevents)
71 LHCb MVA Workshop, CERN, July 10, 2009 H. Voss ― Multivariate Analysis with TMVA
Gauss-Decorr-Gauss etc..
G
GDG
GDGDGDG