Analisi Statistica dei dati nella Fisica Nucl. e Subnucl...

184
Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio] Gabriele Sirri Istituto Nazionale di Fisica Nucleare 1

Transcript of Analisi Statistica dei dati nella Fisica Nucl. e Subnucl...

Page 1: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Gabriele SirriIstituto Nazionale di Fisica Nucleare

1

Page 2: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Questa è una raccolta delle slides mostrate nell’A.A. 2015-2016 …

… in continuo aggiornamento e rielaborazione.

2

Page 3: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Credits:RooFit slides extracted or adapted from original presentations by Wouter Verkerke .

3

Page 4: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

4

Page 5: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Terminology

5

Page 6: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

7

Page 7: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Data modelling

What’s it all about ?EstimatorsThe maximum likelihood estimator

9

Page 8: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

10

Page 9: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Data modelling – analysis examples

• Typical questions

– We obtained a mass distribution from data

– what are the signal and background(s) yields?

– what is the significance of the signal?

• Typical tasks

– Creation of an adequate model for the data

– Description of detector effects such as acceptances and resolutions

– Make sure the model is correctly implemented - toy Monte Carlo studies

– Fit the model to the data

– Graphical representation of the data and fit results

11

Page 10: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

12

Page 11: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Intermezzo – Functions vs probability density functions

• Why use probability density functions rather than ‘plain’ functions to describe your data?

– Easier to interpret your models. If Blue and Green pdf are each guaranteed to be normalized to 1, then fractions of Blue,Green can be cleanly interpreted as #events

– Many statistical techniques onlyfunction properly with PDFs(e.g maximum likelihood)

– Can sample ‘toy Monte Carlo’ eventsfrom p.d.f because value is always guaranteed to be >=0

• So why is not everybody always using them

– The normalization can be hard to calculate(e.g. it can be different for each set of parameter values p)

– In >1 dimension (numeric) integration can be particularly hard

– RooFit aims to simplify these tasks

13

Page 12: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Estimators

i

i

i

i

xN

xV

xN

x

2)(1

)(ˆ

1)(ˆ

Estimator of the mean

Estimator of the variance

aan )ˆ(lim

2)ˆˆ()ˆ( aaaV This is called theMinimum Variance Bound

Note: Cramer-Rao theorem says there is a limit to the accuracy of an estimator

ie. that there is some estimator for which the variance is a minimum (MVB).

Estimators are called efficient if V(estimator)=MVB

• An estimator is a procedure giving a value for a parameter or a property of a distribution as a function of the actual data values, i.e.

• A perfect estimator is

– Consistent:

– Unbiased – With finite statistics you get the right answer on average

– Efficient

– There are no perfect estimators for real-life problems

14

Page 13: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

The Likelihood estimator

)...;();();()(i.e.,);()( 210 pxFpxFpxFpLpxFpLi

i

i

i pxFpL );(ln)(ln

0)(ln

ˆ

ii pp

pd

pLd

Functions used in likelihoods must be Probability Density Functions:

0);(,1);( pxFxdpxF

• Definition of Likelihood

– given D(x) and F(x;p)

– For convenience the negative log of the Likelihood is often used

• Parameters are estimated by maximizing the Likelihood, or equivalently minimizing –log(L)

15

Page 14: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Maximum Likelihood – Variance on ML parameter estimates

p

1

2

22 ln

)(ˆ)(ˆ

pd

LdpVp

pd

Ld

dpdb

pV2

2 ln

1)ˆ(

From Rao-Cramer-Frechetinequality

b = bias as function of p,inequality becomes equalityin limit of efficient estimator

2

1ln)(ln

ˆ2

)ˆ(ln

2

)ˆ(lnln

)ˆ(ln

)ˆ(ln

)ˆ(ln)(ln

max2

2

max

2

ˆ

2

2

max

2

ˆ

2

2

21

ˆ

LpLpp

L

pp

pd

LdL

pppd

Ldpp

dp

LdpLpL

p

pp

pppp

-lo

g(L)

0.5

• Estimator for the parameter variance is

– I.e. variance is estimated from 2nd derivative of –log(L) at minimum

– Valid if estimator is efficient and unbiased!

• Visual interpretation of variance estimate

– Taylor expand –log(L) around minimum

16

Page 15: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Maximum Likelihood – Properties of MLEs

22ˆ pp Use of 2nd derivative of –log(L)

for variance estimate is usually OK

• In general, Maximum Likelihood Estimators are

– Consistent (gives right answer for N)

– Mostly unbiased (bias 1/N, may need to worry at small N)

– Efficient for large N (you get the smallest possible error)

– Invariant: (a transformation of parameters will Not change your answer, e.g

• MLE efficiency theorem: the MLE will be unbiased and efficient if an unbiased efficient estimator exists

17

Page 16: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Maximum Likelihood – Extended ML

)...;();();()(i.e.,);()( 210 pxFpxFpxFpLpxFpLi

i

)log()),(log()(log expexp NNNpxgpL obs

D

i

Log of Poisson(Nexp,Nobs) (modulo a constant)

• Maximum likelihood information only parameterizes shape of distribution

– I.e. one can determine fraction of signal events from ML fit, but not number of signal events

• Extended Maximum likelihood add extra term

– Clever choice of parameters will allows us to extract Nsig and Nbkg in one pass ( Nexp=Nsig+Nbkg, fsig=Nsig/(Nsig+Nbkg) )

18

Page 17: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introduction & Overview1

24

Page 18: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introduction -- Focus: coding a probability density function• Focus on one practical aspect of many data analysis in

HEP: How do you formulate your p.d.f. in ROOT– For ‘simple’ problems (gauss, polynomial) this is easy

– But if you want to do unbinned ML fits, use non-trivial functions, or work with multidimensional functions you quickly find that you need some tools to help you

25

Page 19: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introduction – Why RooFit was developed

);|BkgResol();(BkgDecay);BkgSel()1(

);|SigResol())2sin(,;(SigDecay);SigSel(

bkgbkgbkgsig

sigsigsigsig

rdttqtpmf

rdttqtpmf

• BaBar experiment at SLAC: Extract sin(2) from time dependent CP violation of B decay: e+e-

Y(4s) BB

– Reconstruct both Bs, measure decay time difference

– Physics of interest is in decay time dependent oscillation

• Many issues arise

– Standard ROOT function framework clearly insufficient to handle such complicated functions must develop new framework

– Normalization of p.d.f. not always trivial to calculate may need numeric integration techniques

– Unbinned fit, >2 dimensions, many events computation performance important must try optimize code for acceptable performance

– Simultaneous fit to control samples to account for detector performance

26

Page 20: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introduction – Relation to ROOT

C++ command line interface & macros

Data management &histogramming

Graphics interface

I/O support

MINUIT

ToyMC dataGeneration

Data/ModelFitting

Data Modeling

Model Visualization

Extension to ROOT – (Almost) no overlap with existing functionality

29

Page 21: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooFit core design philosophy

variable RooRealVar

function RooAbsReal

PDF RooAbsPdf

space point RooArgSet

list of space points RooAbsData

integral RooRealIntegral

RooFit classMathematical concept

)(xf

x

x

dxxf

x

x

max

min

)(

)(xf

• Mathematical objects are represented as C++ objects

31

Page 22: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooFit core design philosophy

f(x,y,z)

RooRealVar x RooRealVar y RooRealVar z

RooAbsReal f

RooRealVar x(“x”,”x”,5) ;

RooRealVar y(“y”,”y”,5) ;

RooRealVar z(“z”,”z”,5) ;

RooBogusFunction f(“f”,”f”,x,y,z) ;

Math

RooFitdiagram

RooFitcode

• Represent relations between variables and functionsas client/server links between objects

32

Page 23: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Object-oriented data modeling

RooRealVar mass(“mass”,”Invariant mass”,5.20,5.30) ;

RooRealVar width(“width”,”B0 mass width”,0.00027,”GeV”);

RooRealVar mb0(“mb0”,”B0 mass”,5.2794,”GeV”) ;

RooGaussian b0sig(“b0sig”,”B0 sig PDF”,mass,mb0,width);

Objects representinga ‘real’ value.

PDF object

Initial range

Initial value Optional unit

References to variables

– All objects are self documenting

• Name - Unique identifier of object

• Title – More elaborate description of object

33

Page 24: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Object-oriented data modeling

RooRealVar mass(“mass”,”Invariant mass”,5.20,5.30) ;

RooRealVar width(“width”,”B0 mass width”,0.00027,”GeV”);

RooRealVar mb0(“mb0”,”B0 mass”,5.2794,”GeV”) ;

RooGaussian b0sig(“b0sig”,”B0 sig PDF”,mass,mb0,width);

Objects representinga ‘real’ value.

PDF object

Initial range

Initial value Optional unit

References to variables

• In RooFit every variable, data point, function, PDF represented in a C++ object

– Objects classified by data/function type they represent,not by their role in a particular setup

– All objects are self documenting

• Name - Unique identifier of object

• Title – More elaborate description of object

34

Page 25: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basic use2

35

Page 26: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

The simplest possible example

RooRealVar x(“x”,”Observable”,-10,10) ;

RooRealVar mean(“mean”,”B0 mass”,0.00027,”GeV”);

RooRealVar sigma(“sigma”,”B0 mass width”,5.2794,”GeV”) ;

RooGaussian model(“model”,”signal pdf”,mass,mean,sigma)

Objects representinga ‘real’ value.

PDF object

Initial range

Initial value Optional unit

References to variables

Name of object Title of object

• We make a Gaussian p.d.f. with three variables: mass, mean and sigma

36

Page 27: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basics – Creating and plotting a Gaussian p.d.f

// Create an empty plot frame

RooPlot* xframe = w::x.frame() ;

// Plot model on frame

model.plotOn(xframe) ;

// Draw frame on canvas

xframe->Draw() ;

Plot range taken from limits of x

Axis label from gauss title

Unit normalization

Setup gaussian PDF and plot

A RooPlot is an empty frame

capable of holding anythingplotted versus it variable

37

Page 28: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basics – Generating toy MC events

// Generate an unbinned toy MC set

RooDataSet* data = w::gauss.generate(w::x,10000) ;

// Generate an binned toy MC set

RooDataHist* data = w::gauss.generateBinned(w::x,10000) ;

// Plot PDF

RooPlot* xframe = w::x.frame() ;

data->plotOn(xframe) ;

xframe->Draw() ;

Generate 10000 events from Gaussian p.d.f and show distribution

Can generate both binned andunbinned datasets

43

Page 29: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basics – Importing data

// Import unbinned data

RooDataSet data(“data”,”data”,w::x,Import(*myTree)) ;

// Import unbinned data

RooDataHist data(“data”,”data”,w::x,Import(*myTH1)) ;

• Unbinned data can also be imported from ROOT TTrees

– Imports TTree branch named “x”.

– Can be of type Double_t, Float_t, Int_t or UInt_t.

All data is converted to Double_t internally

– Specify a RooArgSet of multiple observables to import

multiple observables

• Binned data can be imported from ROOT THx histograms

– Imports values, binning definition and SumW2 errors (if defined)

– Specify a RooArgList of observables when importing a TH2/3.

44

Page 30: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basics – ML fit of p.d.f to unbinned data

// ML fit of gauss to data

w::gauss.fitTo(*data) ;(MINUIT printout omitted)

// Parameters if gauss now

// reflect fitted values

w::mean.Print()

RooRealVar::mean = 0.0172335 +/- 0.0299542

w::sigma.Print()

RooRealVar::sigma = 2.98094 +/- 0.0217306

// Plot fitted PDF and toy data overlaid

RooPlot* xframe = w::x.frame() ;

data->plotOn(xframe) ;

w::gauss.plotOn(xframe) ;

PDFautomaticallynormalizedto dataset

45

Page 31: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basics – ML fit of p.d.f to unbinned data

RooFitResult* r = w::gauss.fitTo(*data,Save()) ;

r->Print() ;

RooFitResult: minimized FCN value: 25055.6,

estimated distance to minimum: 7.27598e-08

coviarance matrix quality:

Full, accurate covariance matrix

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

mean 1.7233e-02 +/- 3.00e-02

sigma 2.9809e+00 +/- 2.17e-02

r->correlationMatrix().Print() ;

2x2 matrix is as follows

| 0 | 1 |

-------------------------------

0 | 1 0.0005869

1 | 0.0005869 1

• Can also choose to save full detail of fit

46

Page 32: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Basics – Integrals over p.d.f.s

w::x.setRange(“sig”,-3,7) ;

RooAbsReal* ig = w::g.createIntegral(x,NormSet(x),Range(“sig”)) ;

cout << ig.getVal() ;

0.832519

mean=-1 ;

cout << ig.getVal() ;

0.743677

xdxFxCx

x

min

)()(

RooAbsReal* cdf = gauss.createCdf(x) ;

• It is easy to create an object representing integral over a normalized p.d.f in a sub-range

• Similarly, one can also request the cumulative distribution function

47

Page 33: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooFit core design philosophy - Workspace

f(x,y,z)

RooRealVar x RooRealVar y RooRealVar z

RooAbsReal f

RooRealVar x(“x”,”x”,5) ;

RooRealVar y(“y”,”y”,5) ;

RooRealVar z(“z”,”z”,5) ;

RooBogusFunction f(“f”,”f”,x,y,z) ;

RooWorkspace w(“w”) ;

w.import(f) ;

Math

RooFitdiagram

RooFitcode

RooWorkspace

• The workspace serves a container class for allobjects created

48

Page 34: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Using the workspace

RooWorkspace w(“w”) ;

RooRealVar x(“x”,”x”,-10,10) ;

RooRealVar mean(“mean”,”mean”,5) ;

RooRealVar sigma(“sigma”,”sigma”,3) ;

RooGaussian f(“f”,”f”,x,mean,sigma) ;

// imports f,x,mean and sigma

w.import(myFunction) ;

• Workspace

– A generic container class for all RooFit objects of your project

– Helps to organize analysis projects

• Creating a workspace

• Putting variables and function into a workspace

– When importing a function or pdf, all its components (variables) are automatically imported too

49

Page 35: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Using the workspace

w.Print() ;

variables

---------

(mean,sigma,x)

p.d.f.s

-------

RooGaussian::f[ x=x mean=mean sigma=sigma ] = 0.249352

// Variety of accessors available

RooPlot* frame = w.var(“x”)->frame() ;

w.pdf(“f”)->plotOn(frame) ;

• Looking into a workspace

• Getting variables and functions out of a workspace

50

Page 36: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Using the workspace

// Variety of accessors available

w.exportToCint() ;

RooPlot* frame = w::x.frame() ;

w::f.plotOn(frame) ;

w.writeToFile(“wspace.root”) ;

• Alternative access to contents through namespace

– Uses CINT extension of C++, works in interpreted code only

• Writing workspace and contents to file

51

Page 37: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Using the workspace

void driver() {

RooWorkspace w(“w”0 ;

makeModel(w) ;

useModel(w) ;

}

void makeModel(RooWorkspace& w) {

// Construct model here

}

void useModel(RooWorkspace& w) {

// Make fit, plots etc here

}

• Organizing your code –Separate construction and use of models

52

Page 38: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooFit core design philosophy - Factory

f(x,y,z)

RooRealVar x RooRealVar y RooRealVar z

RooAbsReal f

RooWorkspace w(“w”) ;

w.factory(“BogusFunction::f(x[5],y[5],z[5])”) ;

Math

RooFitdiagram

RooFitcode

RooWorkspace

• The factory allows to fill a workspace with pdfs and variables using a simplified scripting language

53

Page 39: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Factory and Workspace

w.factory(“Gaussian::f(x[-10,10],mean[5],sigma[3])”) ;

RooRealVar x(“x”,”x”,-10,10) ;

RooRealVar mean(“mean”,”mean”,5) ;

RooRealVar sigma(“sigma”,”sigma”,3) ;

RooGaussian f(“f”,”f”,x,mean,sigma) ;

• One C++ object per math symbol provides ultimate level of control over each objects functionality, but results in lengthy user code for even simple macros

• Solution: add factory that auto-generates objects from a math-like language. Accessed through factory() method of workspace

• Example: reduce construction of Gaussian pdf and its parameters from 4 to 1 line of code

54

Page 40: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Factory language – Goal and scope

• Aim of factory language is to be very simple.

• The goal is to construct pdfs, functions and variables

– This limits the scope of the factory language (and allows to keep it simple)

– Objects can be customized after creation

• The language syntax has only three elements

1. Simplified expression for creation of variables

2. Expression for creation of functions and pdf is trivial1-to-1 mapping of C++ constructor syntax of corresponding object

3. Multiple objects (e.g. a pdf and its variables) can be nested in a single expression

• Operator classes (sum,product) provide alternate syntax in factory that is closer to math notation

55

Page 41: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Factory syntax

x[-10,10] // Create variable with given range

x[5,-10,10] // Create variable with initial value and range

x[5] // Create initially constant variable

Gaussian::g(x,mean,sigma)

RooGaussian(“g”,”g”,x,mean,sigma)

Polynomial::p(x,{a0,a1})

RooPolynomial(“p”,”p”,x”,RooArgList(a0,a1));

ClassName::Objectname(arg1,[arg2],...)

• Rule #1 – Create a variable

• Rule #2 – Create a function or pdf object

– Leading ‘Roo’ in class name can be omitted

– Arguments are names of objects that already exist in the workspace

– Named objects must be of correct type, if not factory issues error

– Set and List arguments can be constructed with brackets {}

56

Page 42: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Factory syntax

Gaussian::g(x[-10,10],mean[-10,10],sigma[3])

x[-10,10]

mean[-10,10]

sigma[3]

Gaussian::g(x,mean,sigma)

Gaussian::g(x[-10,10],0,3)

SUM::model(0.5*Gaussian(x[-10,10],0,3),Uniform(x)) ;

• Rule #3 – Each creation expression returns the name of the object created

– Allows to create input arguments to functions ‘in place’ rather than in advance

• Miscellaneous points

– You can always use numeric literals where values or functions are expected

– It is not required to give component objects a name, e.g.

57

Page 43: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Model building – (Re)using standard components

RooArgusBG

RooPolynomial

RooBMixDecay

RooHistPdf

RooGaussian

BasicGaussian, Exponential, Polynomial,…Chebychev polynomial

Physics inspiredARGUS,Crystal Ball, Breit-Wigner, Voigtian,B/D-Decay,….

Non-parametricHistogram, KEYS

Easy to extend the library: each p.d.f. is a separate C++ class

• RooFit provides a collection of compiled standard PDF classes

58

Page 44: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Model building – (Re)using standard components

• List of most frequently used pdfs and their factory spec

Gaussian Gaussian::g(x,mean,sigma)

Breit-Wigner BreitWigner::bw(x,mean,gamma)

Landau Landau::l(x,mean,sigma)

Exponential Exponental::e(x,alpha)

Polynomial Polynomial::p(x,{a0,a1,a2})

Chebychev Chebychev::p(x,{a0,a1,a2})

Kernel Estimation KeysPdf::k(x,dataSet)

Poisson Poisson::p(x,mu)

Voigtian Voigtian::v(x,mean,gamma,sigma)

(=BW⊗G)

59

Page 45: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Model building – Making your own

w.factory(“EXPR::mypdf(‘sqrt(a*x)+b’,x,a,b)”) ;

w.factory(“CEXPR::mypdf(‘sqrt(a*x)+b’,x,a,b)”) ;

• Interpreted expressions

• Customized class, compiled and linked on the fly

• Custom class written by you

– Offer option of providing analytical integrals, custom handling of toy MC generation (details in RooFit Manual)

• Compiled classes are faster in use, but require O(1-2) seconds startup overhead

– Best choice depends on use context

60

Page 46: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Model building – Adjusting parameterization

w.factory(“expr::w(‘(1-D)/2’,D[0,1])”) ;

w.factory(“BMixDecay::bmix(t,mixState,tagFlav,

tau,expr(‘(1-D)/2’,D[0,1]),dw,....”) ;

• RooFit pdf classes do not require their parameter arguments to be variables, one can plug in functions as well

• Simplest tool perform reparameterization is interpreted formula expression

– Note lower case: expr builds function, EXPR builds pdf

• Example: Reparameterize pdf that expects mistag rate in terms of dilution

61

Page 47: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Composite models3

62

Page 48: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooBMixDecay

RooPolynomial

RooHistPdf

RooArgusBG

Model building – (Re)using standard components

RooAddPdf+

RooGaussian

• Most realistic models are constructed as the sum of one or more p.d.f.s (e.g. signal and background)

• Facilitated through operator p.d.f RooAddPdf

63

Page 49: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Adding p.d.f.s – Mathematical side

)()1()()( xGfxfFxS

)(1)(...)()()(1,0

111100 xPcxPcxPcxPcxS n

ni

inn

• From math point of view adding p.d.f is simple

– Two components F, G

– Generically for N components P0-PN

• For N p.d.f.s, there are N-1 fraction coefficients that should sum to less 1

– The remainder is by construction 1 minus the sum of all other coefficients

64

Page 50: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Adding p.d.f.s – Factory syntax

w.factory(“Gaussian::gauss1(x[0,10],mean1[2],sigma[1]”) ;

w.factory(“Gaussian::gauss2(x,mean2[3],sigma)”) ;

w.factory(“ArgusBG::argus(x,k[-1],9.0)”) ;

w.factory(“SUM::sum(g1frac[0.5]*gauss1, g2frac[0.1]*gauss2, argus)”)

SUM::name(frac1*PDF1,frac2*PDF2,...,PDFN)

• Additions created through a SUM expression

– Note that last PDF does not have an associated fraction

• Complete example

65

Page 51: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Extended ML fits

NNxBfxSfxF exp;)()1()()(

BS

BS

B

BS

S NNNxBNN

NxS

NN

NxF

exp;)()()(

BS NNNf ,,

SUM::name(Nsig*S,Nbkg*B)

Write like this, extended term automatically included in –log(L)

shape normalization

• In an extended ML fit, an extra term is added to the likelihood

Poisson(Nobs,Nexp)

• This is most useful in combination with a composite pdf

66

Page 52: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Component plotting - Introduction

// Plot only argus components

w::sum.plotOn(frame,Components(“argus”),LineStyle(kDashed)) ;

// Wildcards allowed

w::sum.plotOn(frame,Components(“gauss*”),LineStyle(kDashed)) ;

• Plotting, toy event generation and fitting works identically for composite p.d.f.s

– Several optimizations applied behind the scenes that are specific to composite models (e.g. delegate event generation to components)

• Extra plotting functionality specific to composite pdfs

– Component plotting

67

Page 53: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Operations on specific to composite pdfs

RooAddPdf::sum[ g1frac * g1 + g2frac * g2 + [%] * argus ] = 0.0687785

RooGaussian::g1[ x=x mean=mean1 sigma=sigma ] = 0.135335

RooGaussian::g2[ x=x mean=mean2 sigma=sigma ] = 0.011109

RooArgusBG::argus[ m=x m0=k c=9 p=0.5 ] = 0

• Tree printing mode of workspace reveals component structure – w.Print(“t”)

– Can also make input files for GraphViz visualization(w::sum.graphVizTree(“myfile.dot”))

– Graph output on ROOT Canvas in near future(pending ROOT integrationof GraphViz package)

68

Page 54: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Convolution

=

• Many experimental observable quantities are well described by convolutions

– Typically physics distribution smeared with experimental resolution (e.g. for B0 J/y KS exponential decay distribution

smeared with Gaussian)

– By explicitly describing observed distribution with a convolution p.d.f can disentangle detector and physics

• To the extent that enough information is in the data to make this possible

69

Page 55: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Common fittingissues4• Understanding MINUIT output

• Instabilities and correlation coefficients

79

Page 56: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

What happens when you do

pdf->fitTo(*data) ?

80

Page 57: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fitting and likelihood minimization

// Construct function object representing –log(L)

RooAbsReal* nll = pdf.createNLL(data) ;

// Minimize nll w.r.t its parameters

RooMinuit m(*nll) ;

m.migrad() ;

m.hesse() ;

• What happens when you do pdf->fitTo(*data)

– 1) Construct object representing –log of (extended) likelihood

– 2) Minimize likelihood w.r.t floating parameters using MINUIT

• Can also do these two steps explicitly by hand

81

Page 58: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Let take a closer look at

Minuit

82

Page 59: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A brief description of MINUIT functionality

1

2

22 ln

)(ˆ)(ˆ

pd

LdpVp

• MIGRAD

– Find function minimum. Calculates function gradient, follow to (local) minimum, recalculate gradient, iterate until minimum found

• To see what MIGRAD does, it is very instructive to do RooMinuit::setVerbose(1). It will print a line for each step through parameter space

– Number of function calls required depends greatly on number of floating parameters, distance from function minimum and shape of function

• HESSE

– Calculation of error matrix from 2nd derivatives at minimum

– Gives symmetric error. Valid in assumption that likelihood is (locally parabolic)

– Requires roughly N2 likelihood evaluations (with N = number of floating parameters)

83

Page 60: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A brief description of MINUIT functionality

• MINOS

– Calculate errors by explicit finding points (or contour for >1D) where D-log(L)=0.5

– Reported errors can be asymmetric

– Can be very expensive in with large number of floating parameters

• CONTOUR

– Find contours of equal D-log(L) in two parameters and draw corresponding shape

– Mostly an interactive analysis tool

84

Page 61: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Note of MIGRAD function minimization

Reason: There may exist multiple (local) minimain the likelihood or c2

p

-lo

g(L)

Local minimum

True minimum

• For all but the most trivial scenarios it is not possible to automatically find reasonable starting values of parameters

– So you need to supply ‘reasonable’ starting values for your parameters

– You may also need to supply ‘reasonable’ initial step size in parameters. (A step size 10x the range of the above plot is clearly unhelpful)

– Using RooMinuit, the initial step size is the value of RooRealVar::getError(), so you can control this by supplying

initial error values

85

Page 62: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function MIGRAD

**********

** 13 **MIGRAD 1000 1

**********

(some output omitted)

MIGRAD MINIMIZATION HAS CONVERGED.

MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.

COVARIANCE MATRIX CALCULATED SUCCESSFULLY

FCN=257.304 FROM MIGRAD STATUS=CONVERGED 31 CALLS 32 TOTAL

EDM=2.36773e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER STEP FIRST

NO. NAME VALUE ERROR SIZE DERIVATIVE

1 mean 8.84225e-02 3.23862e-01 3.58344e-04 -2.24755e-02

2 sigma 3.20763e+00 2.39540e-01 2.78628e-04 -5.34724e-02

ERR DEF= 0.5

EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 2 ERR DEF=0.5

1.049e-01 3.338e-04

3.338e-04 5.739e-02

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.00430 1.000 0.004

2 0.00430 0.004 1.000

Parameter values and approximate errors reported by MINUIT

Error definition (in this case 0.5 for a likelihood fit)

Progress information,watch for errors here

• Purpose: find minimum

86

Page 63: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function MIGRAD

**********

** 13 **MIGRAD 1000 1

**********

(some output omitted)

MIGRAD MINIMIZATION HAS CONVERGED.

MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.

COVARIANCE MATRIX CALCULATED SUCCESSFULLY

FCN=257.304 FROM MIGRAD STATUS=CONVERGED 31 CALLS 32 TOTAL

EDM=2.36773e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER STEP FIRST

NO. NAME VALUE ERROR SIZE DERIVATIVE

1 mean 8.84225e-02 3.23862e-01 3.58344e-04 -2.24755e-02

2 sigma 3.20763e+00 2.39540e-01 2.78628e-04 -5.34724e-02

ERR DEF= 0.5

EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 2 ERR DEF=0.5

1.049e-01 3.338e-04

3.338e-04 5.739e-02

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.00430 1.000 0.004

2 0.00430 0.004 1.000

Approximate Error matrix

And covariance matrix

Value of c2 or likelihood at minimum

(NB: c2 values are not divided by Nd.o.f)

• Purpose: find minimum

87

Page 64: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function MIGRAD

• Purpose: find minimum

**********

** 13 **MIGRAD 1000 1

**********

(some output omitted)

MIGRAD MINIMIZATION HAS CONVERGED.

MIGRAD WILL VERIFY CONVERGENCE AND ERROR MATRIX.

COVARIANCE MATRIX CALCULATED SUCCESSFULLY

FCN=257.304 FROM MIGRAD STATUS=CONVERGED 31 CALLS 32 TOTAL

EDM=2.36773e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER STEP FIRST

NO. NAME VALUE ERROR SIZE DERIVATIVE

1 mean 8.84225e-02 3.23862e-01 3.58344e-04 -2.24755e-02

2 sigma 3.20763e+00 2.39540e-01 2.78628e-04 -5.34724e-02

ERR DEF= 0.5

EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 2 ERR DEF=0.5

1.049e-01 3.338e-04

3.338e-04 5.739e-02

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.00430 1.000 0.004

2 0.00430 0.004 1.000

Status: Should be ‘converged’ but can be ‘failed’

Estimated Distance to Minimumshould be small O(10-6)

Error Matrix Qualityshould be ‘accurate’, but can be ‘approximate’ in case of trouble

88

Page 65: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function HESSE

2

2

dp

Ld

**********

** 18 **HESSE 1000

**********

COVARIANCE MATRIX CALCULATED SUCCESSFULLY

FCN=257.304 FROM HESSE STATUS=OK 10 CALLS 42 TOTAL

EDM=2.36534e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER INTERNAL INTERNAL

NO. NAME VALUE ERROR STEP SIZE VALUE

1 mean 8.84225e-02 3.23861e-01 7.16689e-05 8.84237e-03

2 sigma 3.20763e+00 2.39539e-01 5.57256e-05 3.26535e-01

ERR DEF= 0.5

EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 2 ERR DEF=0.5

1.049e-01 2.780e-04

2.780e-04 5.739e-02

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.00358 1.000 0.004

2 0.00358 0.004 1.000

Error matrix (Covariance Matrix)

calculated from1

2 )ln(

ji

ijdpdp

LdV

• Purpose: calculate error matrix from

89

Page 66: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function HESSE

2

2

dp

Ld

**********

** 18 **HESSE 1000

**********

COVARIANCE MATRIX CALCULATED SUCCESSFULLY

FCN=257.304 FROM HESSE STATUS=OK 10 CALLS 42 TOTAL

EDM=2.36534e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER INTERNAL INTERNAL

NO. NAME VALUE ERROR STEP SIZE VALUE

1 mean 8.84225e-02 3.23861e-01 7.16689e-05 8.84237e-03

2 sigma 3.20763e+00 2.39539e-01 5.57256e-05 3.26535e-01

ERR DEF= 0.5

EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 2 ERR DEF=0.5

1.049e-01 2.780e-04

2.780e-04 5.739e-02

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.00358 1.000 0.004

2 0.00358 0.004 1.000

Correlation matrix rij

calculated from

ijjiijV r

• Purpose: calculate error matrix from

90

Page 67: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function HESSE

2

2

dp

Ld

**********

** 18 **HESSE 1000

**********

COVARIANCE MATRIX CALCULATED SUCCESSFULLY

FCN=257.304 FROM HESSE STATUS=OK 10 CALLS 42 TOTAL

EDM=2.36534e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER INTERNAL INTERNAL

NO. NAME VALUE ERROR STEP SIZE VALUE

1 mean 8.84225e-02 3.23861e-01 7.16689e-05 8.84237e-03

2 sigma 3.20763e+00 2.39539e-01 5.57256e-05 3.26535e-01

ERR DEF= 0.5

EXTERNAL ERROR MATRIX. NDIM= 25 NPAR= 2 ERR DEF=0.5

1.049e-01 2.780e-04

2.780e-04 5.739e-02

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.00358 1.000 0.004

2 0.00358 0.004 1.000

Global correlation vector:correlation of each parameter

with all other parameters

• Purpose: calculate error matrix from

91

Page 68: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit function MINOS

**********

** 23 **MINOS 1000

**********

FCN=257.304 FROM MINOS STATUS=SUCCESSFUL 52 CALLS 94 TOTAL

EDM=2.36534e-06 STRATEGY= 1 ERROR MATRIX ACCURATE

EXT PARAMETER PARABOLIC MINOS ERRORS

NO. NAME VALUE ERROR NEGATIVE POSITIVE

1 mean 8.84225e-02 3.23861e-01 -3.24688e-01 3.25391e-01

2 sigma 3.20763e+00 2.39539e-01 -2.23321e-01 2.58893e-01

ERR DEF= 0.5

Symmetric error

(repeated result from HESSE)

MINOS errorCan be asymmetric

(in this example the ‘sigma’ error is slightly asymmetric)

• Error analysis through Dnll contour finding

92

Page 69: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Illustration of difference between HESSE and MINOS errors

MINOS error

HESSE error

Extrapolationof parabolicapproximationat minimum

• ‘Pathological’ example likelihood with multiple minima and non-parabolic behavior

93

Page 70: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Practical estimation – Fit converge problems

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL 1 2

1 0.99835 1.000 0.998

2 0.99835 0.998 1.000

Signs of trouble…

• Sometimes fits don’t converge because, e.g.

– MIGRAD unable to find minimum

– HESSE finds negative second derivatives (which would imply negative errors)

• Reason is usually numerical precision and stability problems, but

– The underlying cause of fit stability problems is usually by highly correlated parameters in fit

• HESSE correlation matrix in primary investigative tool

– In limit of 100% correlation, the usual point solution becomes a line solution (or surface solution) in parameter space. Minimization problem is no longer well defined

94

Page 71: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Mitigating fit stability problems

),;()1(),;(),,,;( 221121 msxGfmsxfGssmfxF

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL [ f] [ m] [s1] [s2]

[ f] 0.96973 1.000 -0.135 0.918 0.915

[ m] 0.14407 -0.135 1.000 -0.144 -0.114

[s1] 0.92762 0.918 -0.144 1.000 0.786

[s2] 0.92486 0.915 -0.114 0.786 1.000

HESSE correlation matrix

Widths s1,s2

strongly correlatedfraction f

• Strategy I – More orthogonal choice of parameters

– Example: fitting sum of 2 Gaussians of similar width

95

Page 72: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Mitigating fit stability problems

),;()1(),;( 2212111 mssxGfmsxfG

PARAMETER CORRELATION COEFFICIENTS

NO. GLOBAL [f] [m] [s1] [s2]

[ f] 0.96951 1.000 -0.134 0.917 -0.681

[ m] 0.14312 -0.134 1.000 -0.143 0.127

[s1] 0.98879 0.917 -0.143 1.000 -0.895

[s2] 0.96156 -0.681 0.127 -0.895 1.000

– Different parameterization:

– Correlation of width s2 and fraction f reduced from 0.92 to 0.68

– Choice of parameterization matters!

• Strategy II – Fix all but one of the correlated parameters

– If floating parameters are highly correlated, some of them may be redundant and not contribute to additional degrees of freedom in your model

96

Page 73: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Mitigating fit stability problems -- Polynomials

• Warning: Regular parameterization of polynomials a0+a1x+a2x

2+a3x3 nearly always results in strong

correlations between the coefficients ai.

– Fit stability problems, inability to find right solution common at higher orders

• Solution: Use existing parameterizations of polynomials that have (mostly) uncorrelated variables

– Example: Chebychev polynomials

97

Page 74: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Minuit CONTOUR tool also useful to examine ‘bad’ correlations

• Example of 1,2 sigma contour of two uncorrelated variables

– Elliptical shape. In this example parameters are uncorrelation

• Example of 1,2 sigma contourof two variables with problematic correlation

– Pdf = fG1(x,0,3)+(1-f)G2(x,0,s) with s=4 in data

98

Page 75: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Practical estimation – Bounding fit parameters

Bou

nd

ed

Param

ete

r s

pace

MINUIT internal parameter space (-∞,+∞)

Internal Error

Exte

rn

al Erro

r• Sometimes is it desirable to bound the allowed range of

parameters in a fit

– Example: a fraction parameter is only defined in the range [0,1]

– MINUIT option ‘B’ maps finite range parameter to an internal infinite range using an arcsin(x) transformation:

99

Page 76: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Working withLikelihood8• Using discrete variable to classify data

• Simultaneous fits on multiple datasets

100

Page 77: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fitting and likelihood minimization

// Construct function object representing –log(L)

RooAbsReal* nll = pdf.createNLL(data) ;

// Minimize nll w.r.t its parameters

RooMinuit m(*nll) ;

m.migrad() ;

m.hesse() ;

• What happens when you do pdf->fitTo(*data)

– 1) Construct object representing –log of (extended) likelihood

– 2) Minimize likelihood w.r.t floating parameters using MINUIT

• Can also do these two steps explicitly by hand

101

Page 78: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting the likelihood

RooAbsReal* nll = w::model.createNLL(data) ;

RooPlot* frame = w::param.frame() ;

nll->plotOn(frame,ShiftToZero()) ;

• A likelihood function is a regular RooFit function

• Can e.g. plot is as usual

102

Page 79: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Constructing a c2 function

// Construct function object representing –log(L)

RooAbsReal* chi2 = pdf.createChi2(data) ;

// Minimize nll w.r.t its parameters

RooMinuit m(chi2) ;

m.migrad() ;

m.hesse() ;

• Along similar lines it is also possible to construct a c2

function

– Only takes binned datasets (class RooDataHist)

– Normalized p.d.f is multiplied by Ndata to obtain c2

– MINUIT error definition for c2 automatically adjusted to 1 (it is 0.5 for likelihoods) as default error level is supplied through virtual method of function base class RooAbsReal

103

Page 80: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Automatic optimizations in the calculation of the likelihood

• Several automatic computational optimizations are applied the calculation of likelihoods inside RooNLLVar

– Components that have all constant parameters are pre-calculated

– Dataset variables not used by the PDF are dropped

– PDF normalization integrals are only recalculated when the ranges of their observables or the value of their parameters are changed

– Simultaneous fits: When a parameters changes only parts of the total likelihood that depend on that parameter are recalculated

• Lazy evaluation: calculation only done when intergal value is requested

• Applicability of optimization techniques is re-evaluated for each use

– Maximum benefit for each use case

• ‘Typical’ large-scale fits see significant speed increase

– Factor of 3x – 10x not uncommon.

104

Page 81: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Statistical procedures involving likelihood

• ‘Simple’ Parameter and error estimation (MINUIT/HESSE/MINOS)

• Construct Bayesian credible intervals

– Likelihood appears in Bayes theorem for hypothesis with continuous parameters

• Construct (Profile) Likelihood Ratio intervals

– ‘Approximate Confidence intervals’ (Wilks theoreom)

– Connection to MINOS errors

• NB: Can also construct Frequentist intervals (Neyman construction), but these are based on PDFs, not likelihoods

105

Page 82: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Likelihood minimization – class RooMinuit

• Class RooMinuit is an interface to the ROOT implementation of the MINUIT minimization and error analysis package.

• RooMinuit takes care of

– Passing value of miminized RooFit function to MINUIT

– Propagated changes in parameters both from RooRealVar to MINUIT and back from MINUIT to RooRealVar, i.e. it keeps the

state of RooFit objects synchronous with the MINUIT internal state

– Propagate error analysis information back to RooRealVar

parameters objects

– Exposing high-level MINUIT operations to RooFit uses (MIGRAD,HESSE,MINOS) etc…

– Making optional snapshots of complete MINUIT information (e.g. convergence state, full error matrix etc)

106

Page 83: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Demonstration of RooMinuit use

// Start Minuit session on above nll

RooMinuit m(nll) ;

// MIGRAD likelihood minimization

m.migrad() ;

// Run HESSE error analysis

m.hesse() ;

// Set sx to 3, keep fixed in fit

sx.setVal(3) ;

sx.setConstant(kTRUE) ;

// MIGRAD likelihood minimization

m.migrad() ;

// Run MINOS error analysis

m.minos()

// Draw 1,2,3 ‘sigma’ contours in sx,sy

m.contour(sx,sy) ;

107

Page 84: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

What happens if there are problems in the NLL calculation

[#0] WARNING:Minization -- RooFitGlue: Minimized function has error status.

Returning maximum FCN so far (99876) to force MIGRAD to back out of this region.

Error log follows. Parameter values: m=-7.397

RooGaussian::gx[ x=x mean=m sigma=sx ] has 3 errors

• Sometimes the likelihood cannot be evaluated do due an error condition.

– PDF Probability is zero, or less than zero at coordinate where there is a data point ‘infinitely improbable’

– Normalization integral of PDF evaluates to zero

• Most problematic during MINUIT operations. How to handle error condition

– All error conditions are gather and reported in consolidated way by RooMinuit

– Since MINUIT has no interface deal with such situations, RooMinuit passes instead a large value to MINUIT to force it to retreat from the region of parameter space in which the problem occurred

108

Page 85: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

What happens if there are problems in the NLL calculation

pdf and data

-log(L) vs m0dropping problematic events

-log(L) vs m0with ‘wall’ (RooFit default)

• Classic example in B physics: floating the end point of the ARGUS function

– Probability density of ARGUS above end point is zero If end

point is moved to low value in fit you end up with events above end point Probility is zero Likelihood is –log(0) = infinity

109

Page 86: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

What happens if there are problems in the NLL calculation

[#0] WARNING:Minization -- RooFitGlue: Minimized function has error status.

Returning maximum FCN so far (-1e+30) to force MIGRAD to back out of this region.

Error log follows

Parameter values: m=-7.397

RooGaussian::gx[ x=x mean=m sigma=sx ]

getLogVal() top-level p.d.f evaluates to zero or negative number

@ x=x=9.09989, mean=m=-7.39713, sigma=sx=0.1

getLogVal() top-level p.d.f evaluates to zero or negative number

@ x=x=6.04652, mean=m=-7.39713, sigma=sx=0.1

getLogVal() top-level p.d.f evaluates to zero or negative number

@ x=x=2.48563, mean=m=-7.39713, sigma=sx=0.1

• Can request more verbose error logging to debug problem

– Add PrintEvalError(N) with N>1

110

Page 87: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Bayesian formalism

∝ ∗

Area that integrates X% of posterior

• Original Bayes Thm:

P(B|A) ∝ P(A|B) P(B).

• Let probability density function p(x|μ) be the conditional pdf for data x, given parameter μ. Then Bayes’ Thm becomes

p(μ|x) ∝ p(x|μ) p(μ).

• Substituting in a set of observed data, x0, and recognizing the likelihood, written as L(x0|μ) ,L(μ), then

p(μ|x0) ∝ L(x0|μ) p(μ),

111

Page 88: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Illustration of nuisance parameters in Bayesian intervals

∫ =

MLE fit fit data-logLR(mean,sigma)

LR(mean,sigma) prior(mean,sigma) posterior(mean)

• Example: data with Gaussian model (mean,sigma)

112

Page 89: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Bayesian formalism and integration

• Bayesian formalism often requires integration

• Straightforward to do in RooFit Integration functionality for pdfs also works for likelihood functions

113

Page 90: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Likelihood ratio intervals

Likelihood ratio interval

HESSE error

Extrapolationof parabolicapproximationat minimum)ˆ,(

),(),(

xL

xLxLR

• Definition of Likelihood Ratio interval (identical to MINOS for 1 parameter)

114

Page 91: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Dealing with nuisance parameters in Likelihood ratio intervals

MLE fit fit data

-logLR(mean,sigma) -logLR(mean,sigma)

)ˆ,ˆ(

))(ˆ̂,()(

L

L

•best L(μ) for any value of s

•best L(μ,σ)

-logPLR(mean)

• Nuisance parameters in LR interval

– For each value of the parameter of interest, search the full subspace of nuisance parameters for the point at which the likelihood is maximized.

– Associate that value of the likelihood with that value of the parameter of interest ‘Profile likelihood’

115

Page 92: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Working with profile likelihood

)ˆ,ˆ(

)ˆ̂,()(

qpL

qpLp

RooAbsReal* ll = model.createNLL(data,NumCPU(8)) ;

RooAbsReal* pll = ll->createProfile(params) ;

RooPlot* frame = w::frac.frame() ;

nll->plotOn(frame,ShiftToZero()) ;

pll->plotOn(frame,LineColor(kRed)) ;

Best L for given p

Best L• A profile likelihood ratio

can be represent by a regular RooFit function(albeit an expensive one to evaluate)

116

Page 93: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Dealing with nuisance parameters in Likelihood ratio intervals

•Likelihood Ratio

•Profile Likelihood Ratio

•Minimizes –log(L) for each value of fsig

by changing bkg shape params(a 6th order Chebychev Pol)

117

Page 94: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

On the equivalence of profile likelihood and MINOS

• Demonstration of equivalenceof (RooFit) profile likelihoodand MINOS errors

– Macro to make above plots is34 lines of code (+23 to beautifygraphics appearance)

118

Page 95: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Intervals & Limits9• A brief introduction to RooStats

119

Page 96: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Overview

• Goals:

– Standardize interface for major statistical procedures so that they can work on an arbitrary RooFit model & dataset and handle many parameters of interest and nuisance parameters.

– Implement most accepted techniques from Frequentist, Bayesian, and Likelihood-based approaches

– Provide utilities to perform combined measurements

• Design:

– Essentially all methods start with the basic probability density function or likelihood function. Building a good model is the hard part. Want to re-use it for multiple methods Use RooFit to

construct models

– Build series of tools that perform statistical procedures on RooFit models

120

Page 97: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Structure

• RooFit (data modeling)

– Data modeling language (pdfs and likelihoods).Scales to arbitrary complexity

– Support for efficient integration, toy MC generation

– Workspace

• Persistent container for data models

• Completely self-contained (including custom code)

• Complete introspection and access to components

– Workspace factory provides easy scripting language to populate the workspace

• RooStats (limits, interval calculators & utilities)

– Profile Likelihood calculator

– Neyman construction (FC)

– Bayesian calculator (BAT & native MCMC)

– Utilities (combinations, construct pdfs corresponding to standard number counting problems)

121

Page 98: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Organization

• Joint ATLAS/CMS project

• Core developers

– K. Cranmer (ATLAS)

– Gregory Schott (CMS)

– Wouter Verkerke (RooFit)

– Lorenzo Moneta (ROOT)

• Open project, you are welcome to join

– Max Baak, Mario Pelliccioni, Alfio Lazzaro contributing now

• Included since ROOT v5.22

– Example macros in $ROOTSYS/tutorials/roostats

• Documentation

– Code doc. via ROOT

– Esers manual is in development

122

Page 99: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Example

RooWorkspace* w = new RooWorkspace(“w”);

w->factory(“Poisson::P(obs[150,0,300],

sum::n(s[50,0,120]*ratioSigEff[1.,0,2.],

b[100,0,300]*ratioBkgEff[1.,0.,2.]))");

w->factory("PROD::PC(P, Gaussian::sigCon(ratioSigEff,1,0.05),

Gaussian::bkgCon(ratioBkgEff,1,0.1))");

)1.0,1,()05.0,1,()|( bsbs rGaussrGaussrbrsxPoisson

RooWorkspace(w) w contents

variables

---------

(b,obs,ratioBkgEff,ratioSigEff,s)

p.d.f.s

-------

RooProdPdf::PC[ P * sigCon * bkgCon ] = 0.0325554

RooPoisson::P[ x=obs mean=n ] = 0.0325554

RooAddition::n[ s * ratioSigEff + b * ratioBkgEff ] = 150

RooGaussian::sigCon[ x=ratioSigEff mean=1 sigma=0.05 ] = 1

RooGaussian::bkgCon[ x=ratioBkgEff mean=1 sigma=0.1 ] = 1

•Create workspace with above model (using factory)

•Contents of workspace from above operation

• Create a model - Example

123

Page 100: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Example

RooPlot* frame = w::obs.frame(100,200) ;

w::PC.plotOn(frame) ;

frame->Draw()

• Simple use of model

124

Page 101: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Example

ProfileLikelihoodCalculator plc;

plc.SetPdf(w::PC);

plc.SetData(data); // contains [obs=160]

plc.SetParameters(w::s);

plc.SetTestSize(.1);

ConfInterval* lrint = plc.GetInterval(); // that was easy.

FeldmanCousins fc;

fc.SetPdf(w::PC);

fc.SetData(data); fc.SetParameters(w::s);

fc.UseAdaptiveSampling(true);

fc.FluctuateNumDataEntries(false);

fc.SetNBins(100); // number of points to test per parameter

fc.SetTestSize(.1);

ConfInterval* fcint = fc.GetInterval(); // that was easy.

UniformProposal up;

MCMCCalculator mc;

mc.SetPdf(w::PC);

mc.SetData(data); mc.SetParameters(s);

mc.SetProposalFunction(up);

mc.SetNumIters(100000); // steps in the chain

mc.SetTestSize(.1); // 90% CL

mc.SetNumBins(50); // used in posterior histogram

mc.SetNumBurnInSteps(40);

ConfInterval* mcmcint = mc.GetInterval();

• Confidence intervals calculated with model

– Profile likelihood

– FeldmanCousins

– Bayesian (MCMC)

125

Page 102: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Example

double fcul = fcint->UpperLimit(w::s);

double fcll = fcint->LowerLimit(w::s);

• Retrieving and visualizing output

126

Page 103: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

RooStats Project – Example

• Some notes on example

– Complete working example (with output visualization) shipped with ROOT distribution ($ROOTSYS/tutorials/roofit/rs101_limitexample.C)

– Interval calculators make no assumptions on internal structure of model. Can feed model of arbitrary complexity to same calculator (computational limitations still apply!)

127

Page 104: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introduzione a RooSTATS

129

Page 105: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

130

RooStats

RooStatsTutorial_120323.pdfhttps://indico.desy.de/getFile.py/access?contribId=15&resId=3&materialId=slides&confId=5065slides da 1 a 14

Page 106: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

131

Page 107: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

TMVA

132

Page 108: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

133

TMVA

TMVA (Tool for Multi Variate Analysis) Utilizzo di TMVA come classificatore. Descrizione di TMVAGui.

- Multivariate Methods, di Niklaus Berger - Statistical methods for data analysis, di L. Lista (Multivariate discriminators with TMVA)

Page 109: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Multidimensional models5• Uncorrelated products of p.d.f.s

• Using composition to p.d.f.s with correlation

• Products of conditional and plain p.d.f.s

134

Page 110: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Building realistic models

* =

g(x;m,s)m(y;a0,a1)

=

g(x,y;a0,a1,s)Possible in any PDFNo explicit support in PDF code needed

– Multiplication

– Composition

135

Page 111: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Model building – Products of uncorrelated p.d.f.s

RooBMixDecay

RooPolynomial

RooHistPdf

RooArgusBG

RooGaussian

RooProdPdf*

)()(),( yGxFyxH

136

Page 112: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Uncorrelated products – Mathematics and constructors

)()(),( yGxFyxH i

iii xFxH )()( }{}{}{

2D nD

w.factory(“Gaussian::gx(x[-5,5],mx[2],sx[1])”) ;

w.factory(“Gaussian::gy(y[-5,5],my[-2],sy[3])”) ;

w.factory(“PROD::gxy(gx,gy)”) ;

• Mathematical construction of products of uncorrelated p.d.f.s is straightforward

– No explicit normalization required If input p.d.f.s are unit

normalized, product is also unit normalized (this is true only because of the absence of correlations)

• Corresponding factory operator is PROD

137

Page 113: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

How it work – event generation on uncorrelated products

Delegate Generate Merge

• If p.d.f.s are uncorrelated, each observable can be generated separately

– Reduced dimensionality of problem (important for e.g. accept/reject sampling)

– Actual event generation delegated to component p.d.f (can e.g. use internal generator if available)

– RooProdPdf just aggregates output in single dataset

138

Page 114: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fundamental multi-dimensional p.d.fs

EXPR::mypdf(‘sqrt(x+y)*sqrt(x-y)’,x,y) ;

• It also possible define multi-dimensional p.d.f.s that do not arise through a product construction

– For example

– But usually n-dim p.d.f.s are constructed more intuitively through product constructs. Also correlations can be introduced efficiently (more on that in a moment)

• Example of fundamental 2-D B-physics p.d.f. RooBMixDecay

– Two observables: decay time (t, continuous) mixingState (m, discrete [-1,+1])

139

Page 115: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting multi-dimensional PDFs

RooPlot* xframe = x.frame() ;

data->plotOn(xframe) ;

prod->plotOn(xframe) ;

xframe->Draw() ;

c->cd(2) ;

RooPlot* yframe = y.frame() ;

data->plotOn(yframe) ;

prod->plotOn(yframe) ;

yframe->Draw() ;

dyyxpdfxf ),()(

dxyxpdfyf ),()(

-Plotting a dataset D(x,y) versus x represents a projection over y

-To overlay PDF(x,y), you must plot Int(dy)PDF(x,y)

-RooFit automatically takes care of this!

•RooPlot remembers dimensions of plotted datasets

140

Page 116: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introduction to slicing

x = x.getVal()

Slice in x

Range in y

• With multidimensional p.d.f.s it is also often useful to be able to plot a slice of a p.d.f

• In RooFit

– A slice is thin

– A range is thick

• Slices mostly usefulin discrete observables

– A slice in a continuous observablehas no width and usually no datawith the corresponding cut (e.g. “x=5.234”)

• Ranges work for bothcontinuous and discrete observables

– Range of discrete observablecan be list of >=1 state

141

Page 117: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting a slice of a dataset

// Mixing dataset defines dt,mixState

RooDataSet* data ;

// Plot the entire dataset

RooPlot* frame = dt.frame() ;

data->plotOn(frame) ;

// Plot the mixed part of the data

RooPlot* frame_mix = dt.frame() ;

data->plotOn(frame,

Cut(”mixState==mixState::mixed”)) ;

• Use the optional cut string expression

– Works the same for binned data sets

142

Page 118: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting a slice of a p.d.f

RooPlot* dtframe = dt.frame() ;

data->plotOn(dtframe,Cut(“mixState==mixState::mixed“)) ;

bmix.plotOn(dtframe,Slice(mixState,”mixed”)) ;

dtframe->Draw() ;

For slices both data and p.d.f normalize with respect to full dataset. If fraction ‘mixed’ in above example disagrees between data and p.d.f prediction, this discrepancy will show in plot

143

Page 119: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting a range of a p.d.f and a dataset

RooPlot* xframe = x.frame() ;

data->plotOn(xframe) ;

model.plotOn(xframe) ;

y.setRange(“sig”,-1,1) ;

RooPlot* xframe2 = x.frame() ;

data->plotOn(xframe2,CutRange("sig")) ;

model.plotOn(xframe2,ProjectionRange("sig")) ;

model(x,y) = gauss(x)*gauss(y) + poly(x)*poly(y)

Works also with >2D projections (just specify projection range on all projected observables)

Works also with multidimensional p.d.fs that have correlations

144

Page 120: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Physics example of combined range and slice plotting

// Plot projection on mB

RooPlot* mbframe = mb.frame(40) ;

data->plotOn(mbframe) ;

model.plotOn(mbframe) ;

// Plot mixed slice projection on deltat

RooPlot* dtframe = dt.frame(40) ;

data>plotOn(dtframe,

Cut(”mixState==mixState::mixed”)) ;

model.plotOn(dtframe,Slice(mixState,”mixed”)) ;

Example setup:Argus(mB)*Decay(dt) +

Gauss(mB)*BMixDecay(dt)

(background)(signal)

mB

dt (mixed slice)

145

Page 121: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting a range - Example

Example setup:Argus(mB)*Decay(dt) +

Gauss(mB)*BMixDecay(dt)

(background)(signal)

mb.setRange(“signal”,5.27,5.30) ;

mbSliceData->plotOn(dtframe2,

Cut("mixState==mixState::mixed“),

CutRange(“signal”))

model.plotOn(dtframe2,Slice(mixState,”mixed”),

ProjectionRange(“signal”))

mB

dt (mixed slice)

dt (mixed slice &&“signal” range)

“signal”

146

Page 122: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting a range - Example

// Generate 80K toy MC events from p.d.f to be projected

RooDataSet *toyMC =

model.generate(RooArgSet(dt,mixState,tagFlav,mB),80000);

// Apply desired cut on toy MC data

RooDataSet* mbSliceToyMC = toyMC->reduce(“mb>5.27”);

// Plot data requesting data averaging over selected toy MC data

model.plotOn(dtframe2,Slice(mixState),ProjWData(mb,mbSliceToyMC))

),(

),,(1

),,(zyD

ii zyxMN

dydzzyxM

• We can also plot the finite width slice with a different technique toy MC integration

147

Page 123: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting non-rectangular PDF regions

4)3()5( 22 yx ‘donut’

• Why is this interesting? Because with this technique we can trivially implement projection over arbitrarily shaped regions.

– Any cut prescription that you can think of to apply to data works

• Example: Likelihood ratio projection plot

– Common technique in rare decay analyses

– PDF typically consist of N-dimensional event selection PDF,where N is large (e.g. 6.)

– Projection of data & PDF in any of the N dimensions doesn’t show a significant excess of signal events

– To demonstrate purity of selected signal, plot data distribution (with overlaid PDF) in one dimension, while selecting events with a cut on the likelihood ratio of signal and background in the remaining N-1 dimensions

148

Page 124: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Likelihood ratio plots

dxzyxBfzyxSf

dxzyxSfzyLR

),,()1(),,(

),,(),(

),,()1(),,(

),,(),,(

zyxBfzyxSf

zyxSfzyxLR

•Integrate over x

•Plot LR vs (y,z)

• Idea: use information on S/(S+B) ratio in projected observables to define a cut

• Example: generalize previous toy model to 3 dimensions

• Express information on S/(S+B) ratio of model in terms of integrals over model components

149

Page 125: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Likelihood ratio plots

),(5.0),(

),,(1

),,(zyD

ii

zyLR

zyxMN

dydzzyxM•Dataset with values of (y,z)sampled from p.d.f andfiltered for events that meetLR(y,z)>0.5

•All events •Only LR(y,z)>0.5

• Decide on s/(s+b) puritycontour of LR(y,z)

– Example s/(s+b) > 50%

• Plot both data and model with corresponding cut.

– For data: calculate LR(y,z) for each event, plot only event with LR>0.5

– For model: using Monte Carlo integration technique:

150

Page 126: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Likelihood ratio plot on model with correlations

151

Page 127: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Likelihood ratio plots – Coded example

// Construct likelihood ratio in projection on (y,z)

w.factory("expr::LR('fsig*psig/ptot',fsig,

PROJ::psig(sig,x),PROJ::ptot(model,x))") ;

// Generate toy dataset for MC integration over region with LR>68%

RooDataSet* tmpdata = model.generate(RooArgSet(x,y,z),10000) ;

tmpdata->addColumn(*w.function(“LR”)) ;

RooDataSet* projdata = (RooDataSet*) tmpdata->reduce(Cut("LR>0.68")) ;

// Add LR to observed data so we can cut on it

data->addColumn(*w.function(“LR”)) ;

RooDataSet* seldata = (RooDataSet*) data->reduce(Cut("LR>0.68")) ;

// Make plot for data and pdf

RooPlot* frame3 = x.frame(Title("Projection with LR(y,z)>68%")) ;

seldata->plotOn(frame3) ;

model.plotOn(frame3,ProjWData(*projdata)) ;

dxzyxBfzyxSf

dxzyxSfzyLR

),,()1(),,(

),,(),(

152

Page 128: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Plotting in more than 2,3 dimensions

TH2D* ph2 = pdf.createHistogram(“ph2”,x,YVar(y)) ;

TH2* dh2 = data.createHistogram(“dg2",x,Binning(10),

YVar(y,Binning(10)));

ph2->Draw("SURF") ;

dh2->Draw("LEGO") ;

• No equivalent of RooPlot for >1 dimensions

– Usually >1D plots are not overlaid anyway

• Easy to use createHistogram() methods provided in both RooAbsData and RooAbsPdf to fill ROOT 2D,3D histograms

153

Page 129: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Building models – Introducing correlations

);,()),(,();( qyxfqypxfpxf

• Easiest way to do this is

– start with 1-dim p.d.f. and change on of its parameters into a function that depends on another observable

– Natural way to think about it

• Example problem

– Observable is reconstructed mass M of some object.

– Fitting Gaussian g(M,mean,sigma) some background to dataset D(M)

– But reconstructed mass has bias depending on some other observable X

– Rewrite fit functions as g(M,meanCorr(mtrue,X,alpha),sigma)where meanCorr is an (emperical) function that corrects for the bias depending on X

154

Page 130: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Introducing correlations through composition

);,()),(,();( qyxfqypxfpxf

w.factory(“expr::mean(‘a*y+b’,y[-10,10],a[0.7],b[0.3])”) ;

w.factory(“Gaussian::g(x[-10,10],mean,sigma[3])”) ;

• RooFit pdf building blocks do not require variables as input, just real-valued functions

– Can substitute any variable with a function expression in parameters and/or observables

– Example: Gaussian with shifting mean

– No assumption made in function on a,b,x,y being observables or parameters, any combination will work

155

Page 131: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

What does the example p.d.f look like?

Projection on Y

Projection on X

• Use example model with x,y as observables

• Note flat distribution in y. Unlikely to describe data, solutions:

1. Use as conditional p.d.f g(x|y,a,b)

2. Use in conditional form multiplied by another pdf in y: g(x|y)*h(y)

156

Page 132: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Conditional p.d.f.s – Formulation and construction

xdpyxf

pyxfpyxF

),,(

),,();|(

• Mathematical formulation of a conditional p.d.f

– A conditional p.d.f is not normalized w.r.t its conditional observables

– Note that denominator in above expression depends on y and is thus in general different for each event

• Constructing a conditional p.d.f in RooFit

– Any RooFit p.d.f can be used as a conditional p.d.f as objects have no internal notion of distinction between parameters, observables and conditional observables

– Observables that should be used as conditional observables have to be specified in use context (generation, plotting, fitting etc…)

157

Page 133: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Method 1 – Using a conditional p.d.f – fitting and plotting

pdf.fitTo(data,ConditionalObservables(y))

xdyxf

yxfyxF

),(

),()|(

Ni

D i

ip

dxyxp

yxp

NxP

,1

),(

),(1)(

dxdyyxp

dyyxpxPp

),(

),()(

Sum over all yi in dataset DIntegrate over y

• For fitting, indicate in fitTo() call what the conditional observables are

– You may notice a performance penalty if the normalization integral of the p.d.f needs to be calculated numerically. For a conditional p.d.f it must evaluated again for each event

• Plotting: You cannot project a conditional F(x|y) on xwithout external information on the distribution of y

– Substitute integration with averaging over y values in data

158

Page 134: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

How it works – event generation with conditional p.d.f.s

• Just like plotting, event generation of conditional p.d.f.s requires external input on the conditional observables

– Given an external input dataset P(dt)

– For each event in P, set the value of dt in F(d|dt) to dti

generate one event for observable t from F(t|dti)

– Store both ti and dti in the output dataset

159

Page 135: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Physics example with conditional p.d.f.s

),,();()( mtRtDtF

),,();()|( tmtRtDttF

• Want to fit decay time distribution of B0 mesons (exponential) convoluted with Gaussian resolution

• However, resolution on decay time varies from event by event (e.g. more or less tracks available).

– We have in the data an error estimate dt for each measurement from the decay vertex fitter (“per-event error”)

– Incorporate this information into this physics model

– Resolution in physics model is adjusted for each event to expected error.

– Overall scale factor can account for incorrect vertex error estimates (i.e. if fitted >1 then dt was underestimate of true error)

– Physics p.d.f must used conditional conditional p.d.f because it give no sensible prediction on the distribution of the per-event errors

160

Page 136: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Physics example with conditional p.d.f.s

),,();()|( tmtRtDttF

Small dt

Large dt

// Plotting of decay(t|dterr)

RooPlot* frame = dt.frame() ;

data->plotOn(frame2) ;

decay_gm1.plotOn(frame2,ProjWData(*data)) ;

Ni

D i

ip

dxyxp

yxp

NxP

,1

),(

),(1)(

Note that projecting over largedatasets can be slow. You can speedthis up by projecting with a binnedcopy of the projection data

• Some illustrations of decay model with per-event errors

– Shape of F(t|t) for several values of t

• Plot of D(t) and F(t|dt) projected over dt

161

Page 137: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Method 2 – Building products with conditional pdfs

• Use of conditional pdf in fitting, plotting, event generation has some practical drawbacks

– Need external dataset with distribution in conditional observable in all operations

• But there is also a fundamental issue

– If your model has both a signal and a background component, the model assumes that the distribution of the conditional observable (e.g. the per-event error) is the same for signal and background

– This may not be a valid assumption (‘Punzi effect’)

– Way out: Construct a product F(x|y)*G(y) separately for signal and background

162

Page 138: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Example with product of conditional and plain p.d.f.

// I - Use g as conditional pdf g(x|y)

w::g.fitTo(data,ConditionalObservables(w::y)) ;

// II - Construct product with another pdf in y

w.factory(“Gaussian::h(y,0,2)”) ;

w.factory(“PROD::gxy(g|y,h)”) ;

gx(x|y) gy(y)* model(x,y)=

dyygyxgx )()|(163

Page 139: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Example with product of conditional and plain p.d.f.

)()|()()|(),( dtbdttBdtsdttSdttF

• Following the ‘conditional product’ formalism you can now choose different distributions for the conditional observable for signal and background e.g.

• At this point F(t,dt) is a plain pdf: fitting plotting and event generation works ‘as usual’ without external input

• You may want to use an empirical pdf for s(dt) or b(dt) if these distributions are difficult to model

– Histogram based pdf (RooHistPdf)

– Kernel estimatin pdf (RooKeysPdf) Set next slide

164

Page 140: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Special pdfs – Kernel estimation model

Sample of eventsGaussian pdffor each event

Summed pdffor all events

Adaptive Kernel:width of Gaussian depends on local event density

w.import(myData,Rename(“myData”)) ;

w.factory(“KeysPdf::k(x,myData)”) ;

• Kernel estimation model

– Construct smooth pdf from unbinned data, using kernel estimation technique

• Example

• Also available for n-D data

165

Page 141: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fit validation,

Toy MC studies6• Goodness-of-fit, c2

• Toy Monte Carlo studies for fit validation

166

Page 142: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

How do you know if your fit was ‘good’

• Goodness-of-fit broad issue in statistics in general, will just focus on a few specific tools implemented in RooFit here

• For one-dimensional fits, a c2 is usually the right thing to do

– Some tools implemented in RooPlot to be able to calculate c2/ndf of curve w.r.t data

double chi2 = frame->chisquare(nFloatParam) ;

– Also tools exists to plot residual and pull distributions from curve and histogram in a RooPlot

frame->makePullHist() ;

frame->makeResidHist() ;

167

Page 143: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

GOF in >1D, other aspects of fit validity

• No special tools for >1 dimensional goodness-of-fit

– A c2 usually doesn’t work because empty bins proliferate with dimensions

– But if you have ideas you’d like to try, there exists generic base classes for implementation that provide the same level of computational optimization and parallelization as is done for likelihoods (RooAbsOptTestStatistic)

• But you can study many other aspect of your fit validity

– Is your fit unbiased?

– Does it (often) have convergence problems?

• You can answer these with a toy Monte Carlo study

– I.e. generate 10000 samples from your p.d.f., fit them all and collect and analyze the statistics of these 10000 fits.

– The RooMCStudy class helps out with the logistics

168

Page 144: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Advanced features – Task automation

Input model Generate toy MC Fit model

Repeat N times

Accumulatefit statistics

Distribution of- parameter values- parameter errors- parameter pulls

// Instantiate MC study manager

RooMCStudy mgr(inputModel) ;

// Generate and fit 100 samples of 1000 events

mgr.generateAndFit(100,1000) ;

// Plot distribution of sigma parameter

mgr.plotParam(sigma)->Draw()

• Support for routine task automation, e.g. goodness-of-fit study

169

Page 145: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

How to efficiently generate multiple sets of ToyMC?

• Use RooMCStudy class to manage generation and fitting

• Generating features

– Generator overhead only incurred once Efficient for large number of small samples

– Optional Poisson distribution for #events of generated experiments

– Optional automatic creation of ASCII data files

• Fitting

– Fit with generator PDF or different PDF

– Fit results (floating parameters & NLL) automatically collected in summary dataset

• Plotting

– Automated plotting for distribution of parameters, parameter errors, pulls and NLL

• Add-in modules for optional modifications of procedure

– Concrete tools for variation of generation parameters, calculation of likelihood ratios for each experiment

– Easy to write your own. You can intervene at any stage and offer proprietary data to be aggregated with fit results

170

Page 146: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A RooMCStudy example

// Setup PDF

RooRealVar x("x","x",-5,15) ;

RooRealVar mean("mean","mean of gaussian",-1) ;

RooRealVar sigma("sigma","width of gaussian",4) ;

RooGaussian gauss("gauss","gaussian PDF",x,mean,sigma) ;

// Create manager

RooMCStudy mgr(gauss,gauss,x,””,”mhv”) ;

// Generate and fit 1000 experiments of 100 events each

mgr.generateAndFit(1000,100) ;

RooMCStudy::run: Generating and fitting sample 999

RooMCStudy::run: Generating and fitting sample 998

RooMCStudy::run: Generating and fitting sample 997

Fitting Options

Generator Options

Observables

Generator PDF

Fitting PDF

• Generating and fitting a simple PDF

171

Page 147: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A RooMCStudy example

// Plot the distrution of the value

RooPlot* mframe = mean.frame(-2,0) ;

mgr.plotParamOn(mframe) ;

mframe->Draw() ;

// Plot the distrution of the error

RooPlot* meframe = mgr.plotError(mean,0.,0.1) ;

meframe->Draw() ;

// Plot the distrution of the pull

RooPlot* mpframe = mgr.plotPull(mean,-3,3,40,kTRUE) ;

mpframe->Draw() ;

Add Gaussian fit

• Plot the distribution of the value, error and pull of mean

172

Page 148: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A RooMCStudy example

// Plot the distribution of the NLL

mgr.plotNLL(mframe) ;

mframe->Draw() ;

• Plot the distribution of –log(L)

• NB: likelihood distributions cannot be used to deduce goodness-of-fit information!

173

Page 149: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A RooMCStudy example

mgr.fitParDataSet().get(10)->Print(“v”) ;

RooArgSet:::

1) RooRealVar::mean : 0.14814 +/- 0.191 L(-10 - 10)

2) RooRealVar::sigma : 4.0619 +/- 0.143 L(0 - 20)

3) RooRealVar::NLL : 2585.1 C

4) RooRealVar::meanerr : 0.19064 C

5) RooRealVar::meanpull : 0.77704 C

6) RooRealVar::sigmaerr : 0.14338 C

7) RooRealVar::sigmapull : 0.43199 C

TH2* h = mean.createHistogram("mean vs sigma",sigma) ;

mgr.fitParDataSet().fillHistogram(h,RooArgList(mean,sigma)) ;

h->Draw("BOX") ;

Pulls and errorshave separateentries foreasy accessand plotting

• For other uses, use summarized fit results in RooDataSet form

174

Page 150: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fit Validation Study – Practical example

);();(),,,;( bkgsigbkgsig BSBS pmANpmGNppNNmF

Nsig(fit)

Nsig(generated)

• Example fit model in 1-D (B mass)

– Signal component is Gaussian centered at B mass

– Background component is Argus function (models phase space near kinematic limit)

• Fit parameter under study: Nsig

– Results of simulation study: 1000 experiments with NSIG(gen)=100, NBKG(gen)=200

– Distribution of Nsig(fit)

– This particular fit looks unbiased…

175

Page 151: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fit Validation Study – The pull distribution

(Nsig)

fit

N

true

sig

fit

sig NN

)pull(N sig

pull(Nsig)

• What about the validity of the error?

– Distribution of error from simulated experiments is difficult to interpret…

– We don’t have equivalent of Nsig(generated) for the error

• Solution: look at the pull distribution

– Definition:

– Properties of pull:

• Mean is 0 if there is no bias

• Width is 1 if error is correct

– In this example: no bias, correct errorwithin statistical precision of study

176

Page 152: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fit Validation Study – Low statistics example

• Special care should be taken when fitting small data samples

– Also if fitting for small signal component in large sample

• Possible causes of trouble

– c2 estimators may become approximate as Gaussian approximation of Poisson statistics becomes inaccurate

– ML estimators may no longer be efficient error estimate from 2nd derivative may become inaccurate

– Bias term proportional to 1/N of ML and c2 estimators may no longer be small compared to 1/sqrt(N)

• In general, absence of bias, correctness of error can not be assumed. How to proceed?

– Use unbinned ML fits only – most robust at low statistics

– Explicitly verify the validity of your fit

177

Page 153: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Demonstration of fit bias at low N – pull distributions

NBKG(gen)=200

NSIG(gen)=20

Distributions becomeasymmetric at low statistics

NSIG(fit) (NSIG) pull(NSIG)

NSIG(gen)

Pull mean ~2 away from 0 Fit is positively biased!

• Low statistics example:

– Scenario as before but now with 200 bkg events and only 20 signal events (instead of 100)

• Results of simulation study

• Absence of bias, correct error at low statistics not obvious

178

Page 154: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

New developments for automated studies

• A new alternative framework is being put in place to replace class RooMCStudy.

– Class RooStudyManager manages logistics of repeated studies, but does not implement content of study.

– Abstract concept of study interfaced through class RooAbsStudy

– Class RooGenFitStudy manages implementation of ‘generate-and-fit’ style studies (functionality of RooMCStudy)

• Greater flexibility in choice of study (you can put in anything you want)

• Support for multiple backend implementations

– Inline calculation (as done in RooMCStudy)

– Parallelized execution through PROOF (lite)

– Almost complete automation of support for batch submission

– Just need to change one line of your macro to change back-end

179

Page 155: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Demo of parallelization with PROOF-lite

RooStudyManager mcs(*w,gfs) ;

mcs.run(1000) ; // inline running

mcs.runProof(1000,"") ; // empty string is PROOF-lite

mcs.prepareBatchInput("default",1000,kTRUE) ;

• Example – Factor 8 speed up on a dual-quad core box.

– Works with out-of-the box ROOT distribution

– Also: Graceful early termination when users presses ‘Stop’

• Much larger gains can be made with ‘real’ PROOF farms180

Page 156: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Constructing joint models7• Using discrete variable to classify data

• Simultaneous fits on multiple datasets

181

Page 157: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Datasets and discrete observables

Dataset A

X

5.0

3.7

1.2

4.3 Dataset B

X

5.0

3.7

1.2

Dataset A+B

X source

5.0 A

3.7 A

1.2 A

4.3 A

5.0 B

3.7 B

1.2 B

• Discrete observables play an important role in management of datasets

– Useful to classify ‘sub datasets’ inside datasets

– Can collapse multiple, logically separate datasets into a single dataset by adding them and labeling the source with a discrete observable

– Allows to express operations such a simultaneous fits as operation on a single dataset

182

Page 158: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Discrete variables in RooFit – RooCategory

// Define a cat. with explicitly numbered states

w.factory(“b0flav[B0=-1,B0bar=1]”) ;

// Define a category with labels only

w.factory(“tagCat[Lepton,Kaon,NT1,NT2]”) ;

w.factory(“sample[CPV,BMixing]”) ;

• Properties of RooCategory variables

– Finite set of named states self documenting

– Optional integer code associated with each state

• Used for classification of data, or to describe occasional discrete fundamental observable (e.g. B0 flavor)

183

Page 159: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Datasets and discrete observables – part 2

RooDataSet simdata("simdata","simdata",x,source,

Import(“A",*dataA),Import(“B",*dataB)) ;

• Example of constructing a joint dataset from 2 inputs

• But can also derive classification from info within dataset

– E.g. (10<x<20 = “signal”, 0<x<10 | 20<x<30 = “sideband”)

– Encode classification using realdiscrete mapping functions

184

Page 160: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A universal realdiscrete mapping function

// Mass variable

RooRealVar m(“m”,”mass,0,10.);

// Define threshold category

RooThresholdCategory region(“region”,”Region of M”,m,”Background”);

region.addThreshold(9.0, “SideBand”) ;

region.addThreshold(7.9, “Signal”) ;

region.addThreshold(6.1,”SideBand”) ;

region.addThreshold(5.0,”Background”) ;

Sig Sidebandbackground

Default state

Define region boundaries

• Class RooThresholdCategory maps ranges of input RooRealVar to states of a RooCategory

185

Page 161: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Discrete multiplication function

// Define ‘product’ of tagCat and runBlock

RooSuperCategory prod(“prod”,”prod”,RooArgSet(tag,flav))

flav

B0

B0bar

tag

Lepton

Kaon

NT1

NT2

prod

{B0;Lepton} {B0bar;Lepton}

{B0;Kaon} {B0bar;Kaon}

{B0;NT1} {B0bar;NT1}

{B0;NT2} {B0bar;NT2}

X

• RooSuperCategory/RooMultiCategory provides

category multiplication

186

Page 162: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

DiscreteDiscrete mapping function

RooCategory tagCat("tagCat","Tagging category") ;

tagCat.defineType("Lepton") ;

tagCat.defineType("Kaon") ;

tagCat.defineType("NetTagger-1") ;

tagCat.defineType("NetTagger-2") ;

RooMappedCategory tagType(“tagType”,”type”,tagCat) ;

tagType.map(“Lepton”,”CutBased”) ;

tagType.map(“Kaon”,”CutBased”) ;

tagType.map(“NT*”,”NeuralNet”) ;

Define inputcategory

Create mappedcategory

Add mapping rules

Wildcard expressionsallowed

tagCat

Lepton

Kaon

NT1

NT2

tagType

CutBased

NeuralNet

• RooMappedCategory provides cat cat mapping

187

Page 163: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Exploring discrete data

RooTable* table=data->table(b0flav) ;

table->Print() ;

Table b0flav : aData+-------+------+| B0 | 4949 || B0bar | 5051 |+-------+------+

Double_t nB0 = table->get(“B0”) ;

Double_t b0Frac = table->getFrac(“B0”);

data->table(tagCat,"x>8.23")->Print() ;

Table tagCat : aData(x>8.23)+-------------+-----+| Lepton | 668 || Kaon | 717 || NetTagger-1 | 632 || NetTagger-2 | 616 |+-------------+-----+

Tabulate contents of datasetby category state

Extract contents by label

Extract contents fraction by label

Tabulate contents of selected part of dataset

• Like real variables of a dataset can be plotted,discrete variables can be tabulated

188

Page 164: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Exploring discrete data

data->table(b0Xtcat)->Print() ;

Table b0Xtcat : aData+---------------------+------+| {B0;Lepton} | 1226 || {B0bar;Lepton} | 1306 || {B0;Kaon} | 1287 || {B0bar;Kaon} | 1270 || {B0;NetTagger-1} | 1213 || {B0bar;NetTagger-1} | 1261 || {B0;NetTagger-2} | 1223 || {B0bar;NetTagger-2} | 1214 |+---------------------+------+

data->table(tcatType)->Print() ;

Table tcatType : aData+----------------+------+| Unknown | 0 || Cut based | 5089 || Neural Network | 4911 |+----------------+------+

Tabulate RooSuperCategory states

Tabulate RooMappedCategory states

• Discrete functions, built from categories in a datasetcan be tabulated likewise

189

Page 165: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fitting multiple datasets simultaneously

• Simultaneous fitting efficient solution to incorporate information from control sample into signal sample

• Example problem: search rare decay

– Signal dataset has small number entries.

– Statistical uncertainty on shape in fit contributes significantly to uncertainty on fitted number of signal events

– However can constrain shape of signal from control sample (e.g. another decay with similar properties that is not rare), so no need to relay on simulations

190

Page 166: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Fitting multiple datasets simultaneously

• Fit to control sample yields accurate information on shape of signal

• Q: What is the most practical way to combine shape measurement on control sample to measurement of signal on physics sample of interest

• A: Perform a simultaneous fit

– Automatic propagation of errors & correlations

– Combined measurement (i.e. error will reflect contributions from both physics sample and control sample

191

Page 167: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Discrete observable as data subset classifier

mi

i

BB

ni

i

AA DPDFDPDFL,1,1

))(log())(log()log(

‘CTL’‘SIG’

Combined-lo

g(L

)

• Likelihood level definition of a simultaneous fit

• Minimize -logL(a,b,c)= -logL(a,b)+ -logL(b,c)

– Errors, correlations on common par. b automatically propagated192

Page 168: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Discrete observable as data subset classifier

mi

i

BB

ni

i

AA DPDFDPDFL,1,1

))(log())(log()log(

RooSimultaneous implements ‘switch’ PDF:

case (indexCat) {

A: return pdfA ;

B: return pdfB ;

}

Likelihood of switchPdfwith composite datasetautomatically constructssum of likelihoods above

ni

i

BADsimPDFL,1

))(log()log(

• Likelihood level definition of a simultaneous fit

• PDF level definition of a simultaneous fit

193

Page 169: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Practical fitting – Simultaneous fit technique

•Dsig(x), Fsig(x;a,b) •Dctl(x), Fctl(x;b,c)

• given data Dsig(x) and model Fsig(x;a,b) anddata Dctl(x) and model Fctl(x;b,c)

– Construct –log[Lsig(a,b)] and –log[Lctl(b,c)] and

194

Page 170: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Constructing joint pdfs

// Pdfs for channels ‘A’ and ‘B’

w.factory(“Gaussian::pdfA(x[-10,10],mean[-10,10],sigma[3])”) ;

w.factory(“Uniform::pdfB(x)”) ;

// Create discrete observable to label channels

w.factory(“index[A,B]”) ;

// Create joint pdf

w.factory(“SIMUL::joint(index,A=pdfA,B=pdfB)”) ;

RooDataSet *dataA, *dataB ;

RooDataSet dataAB(“dataAB”,”dataAB”,Index(w::index),

Import(“A”,*dataA),Import(“B”,*dataB)) ;

49

• Operator class SIMUL to construct joint models at the pdf level

• Can also construct joint datasets

195

Page 171: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Building simultaneous fits in RooFit

// Signal pdf

w.factory("Gaussian::sig(x[-10,10],mean[0,-10,10],sigma[3,2,4])") ;

w.factory("Uniform::bkg(x)") ;

w.factory("SUM::model(Nsig[800,0,1000]*sig,Nbkg[0,1000]*bkg)") ;

// Background pdf

w.factory("Gaussian::sig_control(x[-10,10],mean[0,-10,10],sigma[3,2,4])") ;

w.factory("Chebychev::bkg_control(x,a0[1])") ;

w.factory("SUM::model_control(Nsig_control[500,0,10000]*sig_control,

Nbkg_control[500,0,10000]*bkg_control)") ;

// Joint pdf construction

w.factory("SIMUL::model_sim(index[sig,control],

sig=model, control=model_control)") ;

// Joint data construction

RooDataSet simdata("simdata","simdata",w::x,Index(w::index),

Import("sig",*data),Import("control",*data_control)) ;

// Joint fit

RooFitResult* rs = w::model_sim.fitTo(simdata,Save()) ;

• Code that construct example shown 2 slides back

196

Page 172: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Constructing joint likelihood

RooAbsReal* nllJoint = w::joint.createNLL(dataAB) ;

RooAbsReal* nllA = w::A.createNLL(*dataA) ; w.import(nllA) ;

RooAbsReal* nllB = w::B.createNLL(*dataB) ; w.import(nllB) ;

w.factory(sum::nllJoint(nllA,nllB)) ;

50

• When you have a simultaneous pdf you can create a joint likelihood from the joint pdf

• Also possible to make likelihood functions of the components first and then add them

• Likelihood constructed either way is the same.

• Minimization of joint likelihood == Joint fit

197

Page 173: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Other scenarios in which simultaneous fits are useful

• Preceding example was ‘asymmetric’

– Very large control sample, small signal sample

– Physics in each channel possibly different (but with some similar properties

• There are also ‘symmetric’ use cases

– Fit multiple data sets that are functionally equivalent, but have slightly different properties (e.g. purity)

– Example: Split B physics data in block separated by flavor tagging technique (each technique results in a different sensitivity to CP physics parameters of interest).

– Split data in block by data taking run, mass resolutions in each run may be slightly different

– For symmetric use cases pdf-level definition of simultaneous fit very convenient as you usually start with a single dataset with subclassing formation derived from its observables

• By splitting data into subsamples with p.d.f.s that can be tuned to describe the (slightly) varying properties you can increase the statistical sensitivity of your measurement

198

Page 174: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A more empirical approach to simultaneous fits

• Instead of investing a lot of time in developing multi-dimensional models Split data in many subsamples, fit all subsamples

simultaneously to slight variations of ‘master’ p.d.f

• Example: Given dataset D(x,y) where observable of interest is x.

– Distribution of x varies slightly with y

– Suppose we’re only interested in the width of the peakwhich is supposed to be invariant under y (unlike mean)

– Slice data in 10 bins of y and simultaneous fit each bin with p.d.f that only has different Gaussian mean parameter, but same width

199

Page 175: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A more empirical approach to simultaneous fits

Floating Parameter FinalValue +/- Error

-------------------- --------------------------

mean_bin1 -4.5302e+00 +/- 1.62e-02

mean_bin2 -3.4928e+00 +/- 1.38e-02

mean_bin3 -2.4790e+00 +/- 1.35e-02

mean_bin4 -1.4174e+00 +/- 9.64e-03

mean_bin5 -4.8945e-01 +/- 7.95e-03

mean_bin6 4.0716e-01 +/- 9.67e-03

mean_bin7 1.4733e+00 +/- 1.37e-02

mean_bin8 2.4912e+00 +/- 1.44e-02

mean_bin9 3.5028e+00 +/- 1.41e-02

mean_bin10 4.5474e+00 +/- 1.68e-02

sigma 2.7319e-01 +/- 2.46e-03

• Fit to sample of preceding page would look like this

– Each mean is fitted to expected value (-4.5 + ibin)

– But joint measurement of sigma

– NB: Correlation matrix is mostly diagonal as all mean_binXX parameters are completely uncorrelated!

200

Page 176: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A more empirical approach to simultaneous fits

• Preceding example was simplistic for illustrational clarity, but more sensible use cases exist

– Example: Measurement CP violation in B decay. Analyzing power of each event is diluted by factor (1-2w) where w is the mistake rate of the flavor tagging algorithm

– Neural net flavor tagging algorithm provides a tagging probability for each event in data. Could use prob(NN) as w, but then we rely on good calibration of NN, don’t want that

– In a simultaneous fit to CPV+Mixing samples, can measure average w from the latter. Now not relying on NN calibration, but not exploiting event-by-event variation in analysis power.

– Improved scenario: divide (CPV+mixing) data in 10 or 20 subsets corresponding to bins in prob(NN). Use identical p.d.f but only have separate parameter to express fitted mistag rate w_binXX.

– Simultaneous fit will now exploit difference in analyzing power of events and be insensitive to calibration of flavor tagging NN.

– If calibration of NN was OK fitting mistag rate in each bin of probNN will be average probNN value for that bin

201

Page 177: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

A more empirical approach to simultaneous fits

Event with little analyzing power

Event withgreat analyzing

power

NN predicted power

NN predicted power

NN predicted power

co

ntr

ol sam

ple

m

easu

red

po

wer

co

ntr

ol sam

ple

m

easu

red

po

wer

co

ntr

ol sam

ple

m

easu

red

po

wer

Perfect NN

OK NN

Lousy NN

In all 3 casesfit not biasedby NN calibration

Better precisionon CPV meas.because moresensitive events in sample

Worse precisionon CPV meas.because lesssensitive events in sample

202

Page 178: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Building simultaneous fits from a template

// Template pdf – B0 decay with mixing

w.factory("TruthModel::tm(t[-20,20])") ;

w.factory("BMixDecay::sig(t,mixState[mixed=-1,unmixed=1],

tagFlav[B0=1,B0bar=-1], tau[1.54,1,2],

dm[0.472,0.1,0.8],w[0.1,0,0.5],dw[0],tm)") ;

// Construct index category

w.factory(“tag[Lep,Kao,NT1,NT2]”) ;

// Construct simultaneous pdf with separate mistag rate for each category

w.factory(“SIMCLONE::model(sig,$SplitParam({w,dw},tagCat)”) ;

• In the ‘symmetric’ use case the models assigned to each state are very similar in structure – Usually just one parameter name is different

• Easiest way to construct these from a template pdf and a prescription on how to tailor the template for each index state

• Use operator SIMCLONE instead of SIMUL

203

Page 179: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Building simultaneous fits from a template

RooWorkspace(w) w contents

variables

---------

(dm,dw,dw_Kao,dw_Lep,dw_NT1,dw_NT2,mixState,t,tagCat,tagFlav,tau,w,w_Kao,w_Lep,w_NT1,w_NT2)

p.d.f.s

-------

RooBMixDecay::sig[ mistag=w delMistag=dw mixState=mixState tagFlav=tagFlav tau=tau dm=dm t=t ] = 0.2

RooSimultaneous::model[ indexCat=tagCat Lep=sig_Lep Kao=sig_Kao NT1=sig_NT1 NT2=sig_NT2 ] = 0.2

RooBMixDecay::sig_Kao[ mistag=w_Kao delMistag=dw_Kao ... t=t ] = 0.2

RooBMixDecay::sig_Lep[ mistag=w_Lep delMistag=dw_Lep ... t=t ] = 0.2

RooBMixDecay::sig_NT1[ mistag=w_NT1 delMistag=dw_NT1 ... t=t ] = 0.2

RooBMixDecay::sig_NT2[ mistag=w_NT2 delMistag=dw_NT2 ... t=t ] = 0.2

analytical resolution models

----------------------------

RooTruthModel::tm[ x=t ] = 1

• Result

204

Page 180: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Adding parameter pdfs to the likelihood

w.factory(“Gaussian::g(x[-10,10],mean[-10,10],sigma[3])”) ;

w.factory(“PROD::gprime(f,Gaussian(mean,1.15,0.30))”) ;

))30.0,15.1,(log(),;(log(),(log GaussxfLdata

i

• Systematic/external uncertainties can be modeledwith regular RooFit pdf objects.

• To incorporate in likelihood, simply multiply with orig pdf

– Any pdf can be supplied, e.g. Gaussian most common, but an also use class RooMultiVarGaussian to introduce a Gaussian uncertainty on multiple parameteres including a correlation

• Advantage of including systematic uncertainties in likelihood: error automatically propagated to error reported by MINUIT

205

Page 181: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Adding uncertainties to a likelihood

• Example 1 – Width known exactly

• Example 2 – Gaussian uncertainty on width

206

Page 182: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Using the fit result output

RooAbsPdf* paramPdf =

fr->createHessePdf(RooArgSet(frac,mean,sigma));

• The fit result class contains the full MINUIT output

• Can construct multi-variate Gaussian pdfrepresenting pdf on parameters

– Returned pdf represents HESSE parabolic approximation of fit

• Can also multiply this pdf in parameterswith a pdf in observables

– ‘Simultaneous fit’

207

Page 183: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Another approach to joint fitting

• ‘Asymmetric’ simultaneous fit may spend majority of it CPU time calculating the likelihood of the control sample part

– Because control sample have many more events

– Example: joint fit between CPV golden modes and BMixing samples

• Alternate solution: Make joint fit using likelihood of signal sample and parameterized likelihood of control sample

– Assumption: Likelihood can be described by a multi-variate Gaussian with correlations (i.e. log-likelihood is parabolic)

– Very easy to do in RooFit using RooFitResult->createHessePdf()

– Example on next page

208

Page 184: Analisi Statistica dei dati nella Fisica Nucl. e Subnucl ...campus.unibo.it/230678/60/ASD-lab-corso.pdf · Analisi Statistica dei dati nella Fisica Nucl. e Subnucl. [Laboratorio]

Example of joint fit with parameterized likelihood

// Joint pdf construction

w.factory("SIMUL::model_sim(index[sig,ctl],

sig=model, ctl=model_ctl)") ;

// Joint data construction

RooDataSet simdata("simdata","simdata",w::x,Index(w::index),

Import("sig",*data),Import("ctl",*data_ctl)) ;

// Joint fit

RooFitResult* rs = w::model_sim.fitTo(simdata,Save()) ;

// Fit to control sample only

RooFitResult* r = w::model_ctl.fitTo(*data_ctl,Save()) ;

RooAbsPdf* ctrlParamPdf = r->createHessePdf(w::model_ctl.getParameters());

// Make pdf of parameters and import in workspace

ctrlParamPdf->SetName(“ctrlParamPdf”) ;

w.import(*ctrlParamPdf) ;

w.factory(“PROD::model_sim2(model,ctrlParamPdf)”) ;

// Joint fit with parameterized likelihood for control sample

RooFitResult* rs = w::model_sim2.fitTo(*data,Save()) ;

Regular joint fit

Joint fit with parameterized L for ctl sample

209