New approaches for boosting to uniformity
-
Upload
nikita-kazeev -
Category
Data & Analytics
-
view
29 -
download
0
Transcript of New approaches for boosting to uniformity
New approaches for boosting to uniformity
Alex Rogozhnikova,b, Aleksandar Bukvac , Vladimir Gligorovd ,Andrey Ustyuzhaninb,e,f and Mike Williamsg
a Lomonosov Moscow State University, Moscowb Yandex School of Data Analysis, Moscow
c Faculty of Physics, Belgraded Organisation Européenne pour la Recherche Nucléaire (CERN), Geneva
e Moscow Institute of Physics and Technology, Moscowf Imperial College, London
g Massachusetts Institute of Technology, Cambridge
AIST, 10 April 2015
Outline
I What is uniformity of classifier efficiency?I Why can it be important?I How to measure it?I How to achieve it?
What is uniformity?
Predictions are called uniform in variables var1, . . . , varn ifprediction and set of this variables is statistically independent.This (and only this) guarantees that the classifier will produce thesame selection efficiency in every region over var1, . . . , varn
mass
pre
dic
tions
(a) Uniform predictionsmass
pre
dic
tions
(b) Non-uniform
What is uniformity?
Predictions are called uniform in variables var1, . . . , varn ifprediction and set of this variables is statistically independent.This (and only this) guarantees that the classifier will produce thesame selection efficiency in every region over var1, . . . , varn
mass
pre
dic
tions
(a) Uniform predictionsmass
pre
dic
tions
(b) Non-uniform
Motivation
In general, when we want to measure some statistic, such as mean.In particle physics
I Dalitz-plot analysis (or any angular or amplitude analysis)I search for a new particle (not to get fake peak)I sensitivity for new signal in wide range of mass (lifetime ...),
(i.e. train only one classifier, not separate for each mass)
Desirable properties of metrics
The metric should ...
1. not depend strongly on the number of events used to testuniformity
2. not depend on the total weight
3. depend on order of predictions, not the exact values ofpredictions (example: Pearson correlation does not satisfy thisproperty)
4. be stable against free parameters (number of bins, k in knn)
Similarity-based approach for measuring uniformity
Idea: uniformity means that distribution of predictions in every binis equal.
Let's compare the global distribution (blue hist) with distibution in one bin(yellow hist). Yellow rectangle shows the events in selected bin overmass.
mass
pre
dic
tions
mass
pre
dic
tions
Similarity-based approach
Let F (x) = P(prediction < x) --- cdf of all predictions,Fbin(x) = P(prediction in bin < x) --- cdf of predictions in bin overmass. Define weightbin =
weight of events in binweight of all events
Kolmogorov-Smirnov measure (uninformative)∑bin
weightbinmaxx
|Fbin(x)− F (x)| ,
Cramér--von Mises similarity∑bin
weightbin
∫|Fbin(x)− F (x)|p dF (x)
Cut-based approach (1/2)
Select some set of efficiencies (in examples: 0.1, 0.3, 0.5, 0.7, 0.9),for each one can compute global cut and look at efficincies in eachbin:
mass
pre
dic
tions
0.0 0.2 0.4 0.6 0.8 1.0mass
0.0
0.2
0.4
0.6
0.8
1.0
pre
dic
tions
mass
pre
dic
tions
0.0 0.2 0.4 0.6 0.8 1.0mass
0.0
0.2
0.4
0.6
0.8
1.0
pre
dic
tions
Cut-based approach (2/2)
Standard deviation of efficiency
SDE2(eff) =∑bin
weightbin × (effbin − eff)2
SDE2 =1
k
∑eff∈[eff1...effk ]
SDE2(eff)
Theil index of x1, . . . , xn
Theil =1
N
∑i
xi
< x >ln
xi
< x >,
Theil index of efficiency
Theil(eff) =∑bin
weightbineffbineff
lneffbineff
Theil =1
k
∑eff∈[eff1...effk ]
Theil(eff).
Connection Between SDE and Distribution SimilarityApproach
Consider the SDE with global efficiencies= [1/N , 2/N , . . . ,N/N ].
limN→∞
SDE2 = limN→∞
1
N
∑eff
SDE2(eff) =∫ 1
0SDE2(eff)d eff =
∫ 1
0
∑bin
weightbin |effbin − eff|2 d eff
From the other side, we can write the expression forsimilarity-based measure (for p = 2)∑
bin
weightbin
∫|Fbin(x)− F (x)|2 dF (x) =∫ ∑
bin
weightbin |Fbin(x)− F (x)|2 dF (x)
Achieving uniformity
I Typical solution: choose such features that don't give an abilityto reconstruct 'mass' (or other selected 'uniform variables').
I Prediction quality suffers.
Previous work
uBoostJ. Stevens and M. Williams, uBoost: A boosting method for producinguniform selection efficiencies from multivariate classifiers, JINST 8, P12013(2013). [arXiv:1305.7248]
Problem: uBoost is 100 times more computationally expensive thanordinary boosting.
Idea: add uniformity to the loss function
FL doesn't take into account the quality of predictions, onlyuniformity.So what we use in practice is linear combination of FlatnessLossand AdaLoss:
loss = FL+ α Lada
First one penalizes non-uniformity, second one --- poor predictions,α is usually taken small.
Boosting: Gradient Boosting with plain AdaLoss
Gradient boosting on trees is widely used algorithm, it's built upondecision tree regressors with usage of some loss function.Usual AdaLoss (yi = +1 for signal,−1 for background):
Lada =∑
i∈eventsexp[−scorei yi ]
Gradient boosting tries to minimize the loss function by 'steepestdescent'-like procedure.Pseudo-residual of AdaLoss:
− ∂ Lada
∂ scorei= yi exp[−scorei yi ],
Boosting: Gradient Boosting with FlatnessLoss (uGBFL)
CvM measure of non-uniformity:∑bin
weightbin
∫|Fbin(x)− F (x)|p dF (x),
Let's modify this function:
FL =∑bin
weightbin
∫|Fbin(x)− F (x)|p dx
so that it becomes differentiable
∂
∂ scoreiFL ∼= wi p
∣∣Fbin(i)(x)− F (x)∣∣p−1 sgn[Fbin(i)(x)−F (x)]
∣∣∣∣∣x=scorei
Tradeoff uniformity vs quality
In uGBFL we can choose different values of α thus adjustingquality/uniformity.
Figure : ROC curves
Tradeoff uniformity vs quality
In uGBFL we can choose different values of α thus adjustingquality/uniformity.
(a) ROC vs #trees(greater is better)
(b) SDE vs #trees(less is better)
Summary on classifiers
New classifiers
I faster (than uBoost)
I can target uniformity in both signal and bck
I uGBFL is highly tunable and proved to be able to fight severecorrelation
uGBFL will be useful for:
I Dalitz-plot or angular/amplitude analyses (as shown);
I mass measurements since both signal and background can be keptunbiased near mass peak
I searches for new particles since efficiency can be kept optimal formass, lifetime values not used in the generated training samples
Further
Read:http://arxiv.org/abs/1410.4140
Use:https://github.com/anaderi/lhcb_trigger_ml