New approaches for boosting to uniformity

22
New approaches for boosting to uniformity Alex Rogozhnikov a,b , Aleksandar Bukva c , Vladimir Gligorov d , Andrey Ustyuzhanin b,e,f and Mike Williams g a Lomonosov Moscow State University, Moscow b Yandex School of Data Analysis, Moscow c Faculty of Physics, Belgrade d Organisation Européenne pour la Recherche Nucléaire (CERN), Geneva e Moscow Institute of Physics and Technology, Moscow f Imperial College, London g Massachusetts Institute of Technology, Cambridge [email protected] AIST, 10 April 2015

Transcript of New approaches for boosting to uniformity

New approaches for boosting to uniformity

Alex Rogozhnikova,b, Aleksandar Bukvac , Vladimir Gligorovd ,Andrey Ustyuzhaninb,e,f and Mike Williamsg

a Lomonosov Moscow State University, Moscowb Yandex School of Data Analysis, Moscow

c Faculty of Physics, Belgraded Organisation Européenne pour la Recherche Nucléaire (CERN), Geneva

e Moscow Institute of Physics and Technology, Moscowf Imperial College, London

g Massachusetts Institute of Technology, Cambridge

[email protected]

AIST, 10 April 2015

Outline

I What is uniformity of classifier efficiency?I Why can it be important?I How to measure it?I How to achieve it?

What is uniformity?

Predictions are called uniform in variables var1, . . . , varn ifprediction and set of this variables is statistically independent.This (and only this) guarantees that the classifier will produce thesame selection efficiency in every region over var1, . . . , varn

mass

pre

dic

tions

(a) Uniform predictionsmass

pre

dic

tions

(b) Non-uniform

What is uniformity?

Predictions are called uniform in variables var1, . . . , varn ifprediction and set of this variables is statistically independent.This (and only this) guarantees that the classifier will produce thesame selection efficiency in every region over var1, . . . , varn

mass

pre

dic

tions

(a) Uniform predictionsmass

pre

dic

tions

(b) Non-uniform

Motivation

In general, when we want to measure some statistic, such as mean.In particle physics

I Dalitz-plot analysis (or any angular or amplitude analysis)I search for a new particle (not to get fake peak)I sensitivity for new signal in wide range of mass (lifetime ...),

(i.e. train only one classifier, not separate for each mass)

Desirable properties of metrics

The metric should ...

1. not depend strongly on the number of events used to testuniformity

2. not depend on the total weight

3. depend on order of predictions, not the exact values ofpredictions (example: Pearson correlation does not satisfy thisproperty)

4. be stable against free parameters (number of bins, k in knn)

Similarity-based approach for measuring uniformity

Idea: uniformity means that distribution of predictions in every binis equal.

Let's compare the global distribution (blue hist) with distibution in one bin(yellow hist). Yellow rectangle shows the events in selected bin overmass.

mass

pre

dic

tions

mass

pre

dic

tions

Similarity-based approach

Let F (x) = P(prediction < x) --- cdf of all predictions,Fbin(x) = P(prediction in bin < x) --- cdf of predictions in bin overmass. Define weightbin =

weight of events in binweight of all events

Kolmogorov-Smirnov measure (uninformative)∑bin

weightbinmaxx

|Fbin(x)− F (x)| ,

Cramér--von Mises similarity∑bin

weightbin

∫|Fbin(x)− F (x)|p dF (x)

Cut-based approach (1/2)

Select some set of efficiencies (in examples: 0.1, 0.3, 0.5, 0.7, 0.9),for each one can compute global cut and look at efficincies in eachbin:

mass

pre

dic

tions

0.0 0.2 0.4 0.6 0.8 1.0mass

0.0

0.2

0.4

0.6

0.8

1.0

pre

dic

tions

mass

pre

dic

tions

0.0 0.2 0.4 0.6 0.8 1.0mass

0.0

0.2

0.4

0.6

0.8

1.0

pre

dic

tions

Cut-based approach (2/2)

Standard deviation of efficiency

SDE2(eff) =∑bin

weightbin × (effbin − eff)2

SDE2 =1

k

∑eff∈[eff1...effk ]

SDE2(eff)

Theil index of x1, . . . , xn

Theil =1

N

∑i

xi

< x >ln

xi

< x >,

Theil index of efficiency

Theil(eff) =∑bin

weightbineffbineff

lneffbineff

Theil =1

k

∑eff∈[eff1...effk ]

Theil(eff).

Connection Between SDE and Distribution SimilarityApproach

Consider the SDE with global efficiencies= [1/N , 2/N , . . . ,N/N ].

limN→∞

SDE2 = limN→∞

1

N

∑eff

SDE2(eff) =∫ 1

0SDE2(eff)d eff =

∫ 1

0

∑bin

weightbin |effbin − eff|2 d eff

From the other side, we can write the expression forsimilarity-based measure (for p = 2)∑

bin

weightbin

∫|Fbin(x)− F (x)|2 dF (x) =∫ ∑

bin

weightbin |Fbin(x)− F (x)|2 dF (x)

Achieving uniformity

I Typical solution: choose such features that don't give an abilityto reconstruct 'mass' (or other selected 'uniform variables').

I Prediction quality suffers.

Previous work

uBoostJ. Stevens and M. Williams, uBoost: A boosting method for producinguniform selection efficiencies from multivariate classifiers, JINST 8, P12013(2013). [arXiv:1305.7248]

Problem: uBoost is 100 times more computationally expensive thanordinary boosting.

Idea: add uniformity to the loss function

FL doesn't take into account the quality of predictions, onlyuniformity.So what we use in practice is linear combination of FlatnessLossand AdaLoss:

loss = FL+ α Lada

First one penalizes non-uniformity, second one --- poor predictions,α is usually taken small.

Boosting: Gradient Boosting with plain AdaLoss

Gradient boosting on trees is widely used algorithm, it's built upondecision tree regressors with usage of some loss function.Usual AdaLoss (yi = +1 for signal,−1 for background):

Lada =∑

i∈eventsexp[−scorei yi ]

Gradient boosting tries to minimize the loss function by 'steepestdescent'-like procedure.Pseudo-residual of AdaLoss:

− ∂ Lada

∂ scorei= yi exp[−scorei yi ],

Boosting: Gradient Boosting with FlatnessLoss (uGBFL)

CvM measure of non-uniformity:∑bin

weightbin

∫|Fbin(x)− F (x)|p dF (x),

Let's modify this function:

FL =∑bin

weightbin

∫|Fbin(x)− F (x)|p dx

so that it becomes differentiable

∂ scoreiFL ∼= wi p

∣∣Fbin(i)(x)− F (x)∣∣p−1 sgn[Fbin(i)(x)−F (x)]

∣∣∣∣∣x=scorei

Tests on Dalitz data

(a) ROC vs #trees(greater is better)

(b) SDE vs #trees(less is better)

Tradeoff uniformity vs quality

In uGBFL we can choose different values of α thus adjustingquality/uniformity.

Figure : ROC curves

Tradeoff uniformity vs quality

In uGBFL we can choose different values of α thus adjustingquality/uniformity.

(a) ROC vs #trees(greater is better)

(b) SDE vs #trees(less is better)

Summary on classifiers

New classifiers

I faster (than uBoost)

I can target uniformity in both signal and bck

I uGBFL is highly tunable and proved to be able to fight severecorrelation

uGBFL will be useful for:

I Dalitz-plot or angular/amplitude analyses (as shown);

I mass measurements since both signal and background can be keptunbiased near mass peak

I searches for new particles since efficiency can be kept optimal formass, lifetime values not used in the generated training samples

Further

Read:http://arxiv.org/abs/1410.4140

Use:https://github.com/anaderi/lhcb_trigger_ml

Try yourself!

Figure : https://www.kaggle.com/c/higgs-boson