Post on 22-Dec-2015
Maria Grazia Pia, INFN Genova
A Toolkit for A Toolkit for Statistical Data AnalysisStatistical Data AnalysisS. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B. Mascialino,
A. Pfeiffer, M.G. PiaM.G. Pia, A. Ribon, P. Viarengo
http://www.ge.infn.it/geant4/analysis/HEPstatistics
LCG Application Area MeetingCERN, 5 May 2004
Maria Grazia Pia, INFN Genova
History and backgroundHistory and background
Maria Grazia Pia, INFN Genova
The motivation from Geant4The motivation from Geant4Validation of Geant4 physics models through comparison of
simulation vs experimental data or reference databases
Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation
ESA Bepi Colombo mission to Mercury Test beam at Bessy
Photon attenuation coefficient, Al
Geant4 Standard
Geant4 LowE
NIST
Electromagnetic models in Geant4 w.r.t. NIST reference
Maria Grazia Pia, INFN Genova
Historical introduction to EDF testsHistorical introduction to EDF tests
In 1933 Kolmogorov published a short, but landmark paper on the Italian Giornale dell’Istituto degli Attuari. He formally defined the empirical distribution functionempirical distribution function (EDF) and then enquired how close this would be to the true distributionenquired how close this would be to the true distribution F(x), when this is continuous.
It must be noticed that Kolmogorov himself regarded his paper as the solution of an interesting probability probleminteresting probability problem, following the general interest of the time, rather than a paper on statistical methodologystatistical methodology..
After Kolmogorov article, over a period of about 10 years, the foundationsfoundations were laid by a number of distinguished mathematicians of methods of testing fit to a distribution based on the EDF (Smirnov, Cramer, Von Mises, Anderson, DarlingSmirnov, Cramer, Von Mises, Anderson, Darling, …).
The ideas in this paper have formed a platform for vast literature, both of interesting and important probability problems, and also concerning methods of using the Kolmogorov statistics for testing fit to a distribution. The literature production continues continues with great strength todaywith great strength today showing no sign to decrease.
Maria Grazia Pia, INFN Genova
Typical use cases in HEPTypical use cases in HEP
Regression testing– Throughout the software life-cycle
Online DAQ– Monitoring detector behaviour w.r.t. a reference
Simulation validation– Comparison with experimental data
Reconstruction– Comparison of reconstructed vs. expected distributions
Physics analysis– Comparisons of experimental distributions (ATLAS vs. CMS Higgs?)– Comparison with theoretical distributions (data vs. Standard Model)
Maria Grazia Pia, INFN Genova
Software toolsSoftware tools
Commercial products used by “professional” statisticians– SPSS, NCSS...
In HEP:
A lot of activity:– workshops/conferences (CERN, Durham, SLAC etc.)– books (F. James et al., L. Lyons, R. Barlow etc.)– sophisticated statistical algorithms applied in various data analyses
...but, in spite of the relevant role played by statistics in HEP, very limited availability of software tools for statistics in our field
– and in open-source software in general
Maria Grazia Pia, INFN Genova
Let’s do it ourselves...
Provide tools for theProvide tools for the statistical comparisonstatistical comparison of distributionsof distributions
Create a hub Create a hub toto aggregate expertiseaggregate expertise andand collaborative contributionscollaborative contributions
from scientists interested in statistical methodsfrom scientists interested in statistical methods
A project to develop an open-source
software system for statistical analysissoftware system for statistical analysisA project to develop an open-source
software system for statistical analysissoftware system for statistical analysis
see presentation at LCG-AA meeting, 27 November 2002
Maria Grazia Pia, INFN Genova
Vision: the basics
Rigorous software processsoftware process
Have a visionvision for the project– General purpose tool for statistical analysis
– Toolkit approach (choice open to users)
– Open source product
Build on a solid architecturearchitecture
Clearly define scopescope, objectivesobjectives
Flexible, extensible, Flexible, extensible, maintainablemaintainable system
Software quality quality
Maria Grazia Pia, INFN Genova
Architectural guidelinesArchitectural guidelines
The project adopts a solid architectural architectural approach– to offer the functionalityfunctionality and the qualityquality needed by the users– to be maintainablemaintainable over a large time scale– to be extensibleextensible, to accommodate future evolutions of the requirements
Component-based architectureComponent-based architecture– to facilitate re-use and integration in diverse frameworks
DependenciesDependencies– adopt a standard (AIDA) for the user layer– no dependence on any specific analysis tool
PythonPython– the “glue” for interactivity
The approach adopted is compatible with the recommendations of the LCG Architecture Blueprint ReportLCG Architecture Blueprint Report
Maria Grazia Pia, INFN Genova
Software processSoftware process
United Software Development Process, specifically tailored to the project
– practical guidance and tools from the RUP– both rigorous and lightweight– mapping onto ISO 15504– significant experience gained in the group from other projects
Incremental and iterative life-cycle model
Maria Grazia Pia, INFN Genova
The Goodness-of-Fit component
The Goodness-of-Fit component
Maria Grazia Pia, INFN Genova
User RequirementsUser Requirements
User requirementsUser requirements elicitedelicited, analysedanalysed and formally specifiedformally specified – Functional (capability) and not-functional (constraint) requirements– User Requirements Document available from the web site
• Requirements• Design• Implementation• Test & test results• Documentation
Requirement traceability
Maria Grazia Pia, INFN Genova
Maria Grazia Pia, INFN Genova
Maria Grazia Pia, INFN Genova
Simple user layerSimple user layerShields the user from the complexity of the underlying algorithms and design
Only deal with AIDA objectsAIDA objects and choice of comparison algorithmcomparison algorithm
Maria Grazia Pia, INFN Genova
GoF algorithmsGoF algorithmsAlgorithms for binned distributionsAlgorithms for binned distributions
– Anderson-Darling test– Chi-squared test – Fisz-Cramer-von Mises test– Tiku test (Cramer-von Mises test in chi-squared approximation)
Algorithms for unbinned distributionsAlgorithms for unbinned distributions – Anderson-Darling test– Fisz-Cramer-von Mises test– Goodman test (Kolmogorov-Smirnov test in chi-squared approximation)– Kolmogorov-Smirnov test– Kuiper test– Tiku test (Cramer-von Mises test in chi-squared approximation)
Maria Grazia Pia, INFN Genova
Chi-squared testChi-squared test
Applies to binned distributions
It can be useful also in case of unbinned distributions, but the data must be grouped into classes
Cannot be applied if the counting of the theoretical frequencies in each class is < 5
– When this is not the case, one could try to unify contiguous classes until the minimum theoretical frequency is reached
Maria Grazia Pia, INFN Genova0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
EMPIRICAL DISTRIBUTION FUNCTIONORIGINAL DISTRIBUTIONS
• Kolmogorov-Smirnov test
• Goodman approximation of KS test
• Kuiper test
)(
4 22
nm
nmDmn
)()( xGxFSupD mnmn
)()()()( 00* xFxFMaxxFxFMaxD TT
Dmn
Unbinned distributionsUnbinned distributionsSUPREMUM STATISTICSSUPREMUM STATISTICS
More sophisticated algorithmsMore sophisticated algorithms
Maria Grazia Pia, INFN Genova
)()()(2
02 xdFxFxF T • Cramer-von Mises test
• Anderson-Darling test
)()(1)(
)()( 202 xdF
xFxF
xFxFA T
TT
T
• Fisz-Cramer-von Mises test
• k-sample Anderson-Darling test
i
ii xFxFnn
nnt 2
21221
21 )]()([)(
i k kkk
kiikk
iK nh
HnH
HnnFh
nkn
nA
4)(
)(1
)1(
)1( 2
22
Unbinned distributionsUnbinned distributions
Binned distributionsBinned distributions
TESTS CONTAINING A WEIGHTING FUNCTIONTESTS CONTAINING A WEIGHTING FUNCTION
More powerful algorithmsMore powerful algorithms
Maria Grazia Pia, INFN Genova
Anderson-Darling High Sensitive to tails
2 Low General
Fisz-Cramer-von Mises High Symmetric, right-skewed distributions
Goodman Medium Approximation of K-S to 2 test statistics
Kolmogorov-Smirnov Medium Derives from Kolmogorov statistics
Kuiper Medium Sensitive to tails and median
Tiku High Converts CvM statistics to a chi2
Test Power
Characteristics
More about a comparative evaluation of tests in the User Documentation on our web
Topic still subject to research activity in the domain of statistics
Comparative documentation of testsComparative documentation of tests
Maria Grazia Pia, INFN Genova
2 loses information in a test for unbinned distribution by grouping the data into cells Kac, Kiefer and Wolfowitz (1955) showed that Kolmogorov-
Smirnov test requires n4/5 observations compared to n observations for 2 to attain the same power
Cramer-von Mises and Anderson-Darling statistics are expected to be superior to Kolmogorov-Smirnov’s, since they make a comparison of the two distributions all along the range of x, rather than looking for a marked difference at one point
2222 Supremum Supremum statistics statistics
teststests
Supremum Supremum statistics statistics
teststests
Tests Tests containing a containing a
weight functionweight function
Tests Tests containing a containing a
weight functionweight function< <
The power of a test is the probability of rejecting the null hypothesis correctly
In terms of power:
Power of testsPower of tests
Maria Grazia Pia, INFN Genova
Maria Grazia Pia, INFN Genova
Unit test: 2 (1)Unit test: 2 (1)
0
5
10
15
20
25
30
35
40
45
1 2 3 4 5 6 7 8 9 10 11 12
Fre
qu
en
cy
Birth distribution
Death distribution
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
2 test-statistics = 15.8
Expected 2 = 15.8
Exact p-value=0.200758Expected p-value=0.200757
Months
The study concerns monthly birth and death distributions (binned data)
Maria Grazia Pia, INFN Genova
Unit test: 2 (2)Unit test: 2 (2)EXAMPLE FROM CRAMER BOOK
(MATHEMATICAL METHODS OF STATISTICS - page 447)The study concerns the sex distribution of children born in Sweden in 1935
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 2 3 4 5 6 7 8 9 10 11 12
Classes
Fre
qu
en
cy
Boys
Girls
2 test-statistics = 123.203Expected 2 = 123.203
Exact p-value=0
Expected p-value=0
Maria Grazia Pia, INFN Genova
Unit test: K-S Goodman (1)Unit test: K-S Goodman (1)EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
2 test-statistics = 3.9
Expected 2 = 3.9Exact p-value=0.140974Expected p-value=0.140991
Months
The study concerns monthly birth and death distributions (unbinned data)
0
0,2
0,4
0,6
0,8
1
1 ,2
Cu
mu
lati
ve F
un
ctio
n
Maria Grazia Pia, INFN Genova
Unit test: K-S Goodman (2) Unit test: K-S Goodman (2)
2 test-statistics = 1.5Expected 2 = 1.5
EXAMPLE FROM LANDENNA BOOK (NONPARAMETRIC TESTS BASED ON FREQUENCIES - page 287)
We consider body lengths of two independent groups of anopheles
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
73 78 83 88 93 98
Distribution 1
Distribution 2
Exact p-value=0.472367Expected p-value=0.472367
Body lengths
Maria Grazia Pia, INFN Genova
Unit test: Kolmogorov-Smirnov(1)Unit test: Kolmogorov-Smirnov(1)
0
0,2
0,4
0,6
0,8
1
1,2
0 5 10 15 20 25 30 35 40 45 50
Time (s)
Redwell
Whitney
EXAMPLE FROM http://www.physics.csbsju.edu/stats/KS-test.html
D test-statistics =0.2204Expected D =0.2204
Exact p-value=0.0354675Expected p-value=0.035
The study concerns how long a bee stays near a particular tree (Redwell/Whitney)
Cu
mu
lati
ve
Maria Grazia Pia, INFN Genova
Unit test: Kolmogorov-Smirnov (2) Unit test: Kolmogorov-Smirnov (2)
EXAMPLE FROM LANDENNA BOOK (NONPARAMETRIC STATISTICAL METHODS - page 318-325)
We consider one clinical parameter of two independent groups of patients
D test-statistics = 0.65Expected D = 0.65
Exact p-value=2 10-19
Expected p-value=8 10-19
Distribution 1
Distribution 2
Cu
mu
lati
ve
Maria Grazia Pia, INFN Genova
Example of application resultsExample of application results
Anderson-Darling
Ac (95%) =0.752
Fluorescence spectrum from Icelandic basalt (Mars-like rock): experimental data and simulation
ESA Bepi Colombo mission to Mercury test beam at Bessy
Photon attenuation coefficient, Al
Geant4 Standard
Geant4 LowE
NIST
2N-L=13.1 – =20 p=0.87
2N-S=23.2 – =15 p=0.08
Electromagnetic models in Geant4 w.r.t. NIST reference
Maria Grazia Pia, INFN Genova
Latest release: 30 March 2004
GPL License
Maria Grazia Pia, INFN Genova
User DocumentationUser Documentation
Download
Installation
User Guide
Statistics Reference Guide
Maria Grazia Pia, INFN Genova
A toolkit for modeling multi-parametric fit problems
A toolkit for modeling multi-parametric fit problems
F. Fabozzi, L. Lista
INFN Napoli
Initially developed while rewriting a fortran fitter for BaBar analysis
– Simultaneous estimate of:
B(B J/) / B(B J/K)
direct CP asymmetry
– More control on the code was needed to justify a bias appeared in the original fitter
Maria Grazia Pia, INFN Genova
RequirementsRequirements
Provide Tools for modeling parametric fit problems
Unbinned Maximum Likelihood (UML[*]) fit of:– PDF parameters– Yields of different sub-samples– Both, mixed
2 fits
Toy Monte Carlo to study the fit properties– Fitted parameter distributions
Pulls, Bias, Confidence level of fit results
[*] not Unified Modeling Language … …
New components included in the Statistical Toolkit
Architecture open to extension and evolution
Maria Grazia Pia, INFN Genova
For LCG usersFor LCG users
The Statistical Toolkit is distributed with PI as an external product– Currently the previous release - not the latest yet - is distributed– Update foreseen
Integration in the Savannah system for problem reporting foreseen
Open to collaboration to facilitate the usage in the LGC community
– feedback, user requirements, suggestions are welcome, of course!
Please contact Andreas.Pfeiffer@cern.ch for further information about the Statistical Toolkit in PI distribution
Maria Grazia Pia, INFN Genova
ReferencesReferences
Conference Proceedings:– PhyStat Conference, SLAC, 2003– IEEE Nuclear Science Symposium, Portland, 2003
Papers:– S. Donadio et al., A toolkit for statistical data comparison To be published in IEEE Trans. Nucl. Sci. (August 2004)
More papers in preparation
References kept up-to-date on the web site
Maria Grazia Pia, INFN Genova
http://www.ge.infn.it/geant4/analysis/HEPstatistics/
Will be moved to a new area out of Geant4-INFN web (automatic re-direction)
Maria Grazia Pia, INFN Genova
AcknowledgmentsAcknowledgments
Work supported and partially funded by the European Space Agency (ESA) under Contract No.16339/02/NL/FM
Geant4 beta testing– P. Cirrone (INFN-LNS), S. Guatelli (INFN Genova) , S. Parlati (INFN-LNGS)
Fred James (CERN) and Louis Lyons (Oxford)– many useful suggestions, discussions, encouragement...
Maria Grazia Pia, INFN Genova
ConclusionsConclusionsA project to develop an open source, general purpose software toolkit for statistical data analysis is in progress
– to provide a product of common interest to user communities
Rigorous software process– to contribute to the quality of the product
Component-based architecture, OO methods + generic programming– to ensure openness to evolution, maintainability, ease of use
GoF component
Component for modeling multi-parametric fit problems
Software released and results available– toolkit in use for Geant4 physics validation– incremental and iterative life-cycle