WEJT002

download WEJT002

of 4

Transcript of WEJT002

  • 8/2/2019 WEJT002

    1/4

    Multivariate Analysis from a Statistical Point of View

    K.S.Cranmer

    University of Wisconsin-Madison, Madison, WI 53706, USA

    Multivariate Analysis is an increasingly common tool in experimental high energy physics; however, many of thecommon approaches were borrowed from other fields. We clarify what the goal of a multivariate algorithm should

    be for the search for a new particle and compare different approaches. We also translate the Neyman-Pearsontheory into the language of statistical learning theory.

    1. INTRODUCTION

    Multivariate Analysis is an increasingly commontool in experimental high energy physics; however,most of the common approaches were borrowed fromother fields. Each of these algorithms were developedfor their own particular task, thus they look quite dif-ferent at their core. It is not obvious that what thesedifferent algorithms do internally is optimal for the the

    tasks which they perform within high energy physics.It is also quite difficult to compare these different al-gorithms due to the differences in the formalisms thatwere used to derive and/or document them. In Sec-tion 2 we introduce a formalism for a Learning Ma-chine, which is general enough to encompass all ofthe techniques used within high energy physics. InSections 3 & 4 we review the statistical statementsrelevant to new particle searches and translate theminto the formalism of statistical learning theory. Inthe remainder of the note, we look at the main re-sults of statistical learning theory and their relevanceto some of the common algorithms used within highenergy physics.

    2. FORMALISM

    Formally a Learning Machine is a family of func-tions F with domain I and range O parametrized by . The domain can usually be thought of as, orat least embedded in, Rd and we generically denotepoints in the domain as x. The points x can be re-ferred to in many ways (e.g. patterns, events, inputs,examples, . . . ). The range is most commonlyR, [0, 1],or just {0, 1}. Elements of the range are denoted by

    y and can be referred to in many ways (e.g. classes,target values, outputs, . . . ). The parameters spec-ify a particular function f F and the structure of depends upon the learning machine [1, 2].

    In the modern theory of machine learning, the per-formance of a learning machine is usually cast in themore pessimistic setting of risk. In general, the risk,R, of a learning machine is written as

    R() =

    Q(x, y; ) p(x, y)dxdy (1)

    where Q measures some notion of loss between f(x)and the target value y. For example, when classifyingevents, the risk of mis-classification is given by Eq. 1with Q(x, y; ) = |yf(x)|. Similarly, for regression

    1

    tasks one takes Q(x, y; ) = (y f(x))2. Most of

    the classic applications of learning machines can becast into this formalism; however, searches for newparticles place some strain on the notion of risk.

    3. SEARCHES FOR NEW PARTICLES

    The conclusion of an experimental search for a newparticle is a statistical statement usually a decla-ration of discovery or a limit on the mass of the hy-pothetical particle. Thus, the appropriate notion ofperformance for a multivariate algorithm used in asearch for a new particle is that performance mea-sure which will maximize the chance of declaring adiscovery or provide the tightest limits on the hypo-thetical particle. In principle, it should be a fairlystraight-forward procedure to use the formal statis-

    tical statements to derive the most appropriate per-formance measure. This procedure is complicated bythe fact that experimentalists (and statisticians) can-not settle on a formalism to use (i.e. Bayesians vs.Frequentists). As an example, let us consider the Fre-quentist theory developed by Neyman and Pearson [3].This was the basis for the results of the search for theStandard Model Higgs boson at LEP [4].

    The Neyman-Pearson theory (which we reviewbriefly for completeness) begins with two Hypothe-ses: the null hypothesis H0 and the alternate hypoth-esis H1 [3]. In the case of a new particle search H0is identified with the currently accepted theory (i.e.the Standard Model) and is usually referred to as thebackground-only hypothesis. Similarly, H1 is iden-tified with the theory being tested usually referred toas the signal-plus-background hypothesis

    1During the presentation, J. Friedman did not distinguishbetween these two tasks; however, in a region with p(x, 1) = band p(x, 0 ) = 1 b, the optimal f(x) for classification andregression differ. For classification, f(x) = {1 if b > 1/2, else 0},and for regression the optimal f(x) = b.

    PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

    211

  • 8/2/2019 WEJT002

    2/4

    Next, one defines a region W I such that if thedata fall in W we accept the null hypothesis (and re-

    ject the alternate hypothesis)2. Similarly, if the datafall in I W we reject the null hypothesis and acceptthe alternate hypothesis. The probability to commita Type I error is called the size of the test and is givenby (note alternate use of )

    =IW

    p(x|H0)dx. (2)

    The probability to commit a Type II error is given by

    =

    W

    p(x|H1)dx. (3)

    Finally, the Neyman-Pearson lemma tells us that theregion W of size which minimizes the rate of Type IIerror (maximizes the power) is given by

    W =

    x

    p(x|H1)

    p(x|H0)> k

    . (4)

    4. THE NEYMAN-PEARSON THEORY INTHE CONTEXT OF RISK

    In Section 1 we provided the loss functional Q ap-propriate for the classification and regression tasks;however, we did not provide a loss functional forsearches for new particles. Having chosen theNeyman-Pearson theory as an explicit example, it ispossible to develop a formal notion of risk.

    Once the size of the test, , has been agreed upon,the notion of risk is the probability of Type II er-

    ror . In order to return to the formalism out-lined in Section 2, identify H1 with y = 1 a ndH0 with y = 0. Let us consider learning ma-chines that have a range R which we will composewith a step function f(x) = (f(x) k) so thatby adjusting k we insure that the acceptance re-gion W has the appropriate size. The region Wis the acceptance region for H0, thus it correspondsto W = {x|f(x) = 0} and I W = {x|f(x) =1}. We can also translate the quantities p(x|H0)and p(x|H1) into their learning-theory equivalents

    p(x|0) = p(x, 0)/p(0) = (y)p(x, y)/

    p(x, 0)dx and(1 y)p(x, y)/

    p(x, 1)dx, respectively. With these

    substitutions we can rewrite the Neyman-Pearson the-

    ory as follows. A fixed size gives us the global con-straint

    =

    (f(x) k) (y) p(x, y))dxdy

    p(x, 0)dx(5)

    2With m measurements, we should actually consider thedata as (x1, . . . , xm) Im, but, for ease of notation, let usonly consider m = 1.

    and the risk is given by

    =

    [1 (f(x) k)] p(x, 1)dx

    p(x, 1)dx(6)

    (f(x) + k) (1 y) p(x, y)dxdy.

    Extracting the integrand we can write the loss func-tional as

    Q(x, y; ) = (f(x) + k) (1 y). (7)

    Unfortunately, Eq. 1 does not allow for the global con-straint imposed by k (which is implicitly a functionaloff), but this could be accommodated by the meth-ods of Euler and Lagrange. Furthermore, the con-straint cannot be evaluated without explicit knowl-edge of p(x, y).

    4.1. Asymptotic Equivalence

    Certain approaches to multivariate analysis lever-age the many powerful theorems of statistics assum-ing one can explicitly refer to p(x, y). This depen-dence places a great deal of stress on the asymptoticability to estimate p(x, y) from a finite set of samples{(x, y)i}. There are many such techniques for esti-mating a multivariate density function p(x, y) giventhe samples [5, 6]. Unfortunately, for high dimen-sional domains, the number of samples needed to en-

    joy the asymptotic properties grows very rapidly; thisis known as the curse of dimensionality.

    In the case that there is no (or negligible) interfer-ence between the signal process and the background

    processes one can avoid the complications imposedby quantum mechanics and simply add probabili-ties. This is often the case with searches for newparticles, thus the signal-plus-background hypothe-sis can be rewritten p(x, |H1) = nsps(x) + nbpb(x),where ns and nb are normalization constants thatsum to unity. This allows us to rewrite the contoursof the likelihood ratio as contours of the signal-to-background ratio. In particular the contours of thelikelihood ratio p(x|H1)/p(x|H0) = k can be rewrit-ten as ps(x)/pb(x) = (k nb)/ns = k

    .The kernel estimation techniques described in this

    conference represent a particular statistical approachin which classification is achieved by cutting on a dis-criminant function D(x) [7]. The discriminant func-tion D(x) = ps(x)/(ps(x) + pb(x)) is one-to-one with

    ps(x)/pb(x) (which is in turn one-to-one with the like-lihood ratio). These correspondences are only validasymptotically, and the ability to accurately approxi-mate p(x) from an empirical sample is often far fromideal. However, for particle physics applications, up to5-dimensional multivariate analyses have shown goodperformance [8]. Furthermore, they have the addedbenefit that they can be easily understood

    PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

    212

  • 8/2/2019 WEJT002

    3/4

    4.2. Direct vs. Indirect Methods

    The loss functional defined in Eq. 7 is derived froma minimization on the rate of Type II error. Thisis logically distinct from, but asymptotically equiv-alent to, approximating the likelihood ratio. In thecase of no interference, this is logically distinct from,but asymptotically equivalent to, approximating the

    signal-to-background ratio. In fact, most multivari-ate algorithms are concerned with approximating anauxiliary function that is one-to-one with the likeli-hood ratio. Because the methods are not directly con-cerned with minimizing the rate of Type II error, theyshould be considered indirect methods. Furthermore,the asymptotic equivalence breaks down in most ap-plications, and the indirect methods are no longer op-timal. Neural networks, kernel estimation techniques,and support vector machines all represent indirect so-lutions to the search for new particles. The GeneticProgramming (GP) approach presented in Section 6is a direct method concerned with optimizing a user-

    defined performance measure.

    5. STATISTICAL LEARNING THEORY

    The starting point for statistical learning theory isto accept that we might not know p(x, y) in any an-alytic or numerical form. This is, indeed, the casefor particle physics, because only {(x, y)i} can be ob-tained from the Monte Carlo convolution of a well-known theoretical prediction and complex numericaldescription of the detector. In this case, the learn-ing problem is based entirely on the training samples

    {(x, y)i} with l elements. The risk functional is thusreplaced by the empirical risk functional

    Remp() =1

    l

    li=1

    Q(xi, yi; ). (8)

    There is a surprising result that the true riskR() can be bounded independent of the distribution

    p(x, y). In particular, for 0 Q(x, y; ) 1

    R()Remp() +

    h(log(2l/h) + 1) log(/4)

    l

    ,

    (9)where h is the Vapnik-Chervonenkis (VC) dimensionand is the probability that the bound is violated. As 0, h , or l 0 the bound becomes trivial.

    The VC dimension is a fundamental property of alearning machine F, and is defined as the maximalcardinality of a set which can be shattered by F. Aset {xi} can be shattered by F means that for each ofthe 2h binary classifications of the points {xi}, thereexists a f F which satisfies yi = f(xi). A setof three points can be shattered by an oriented line

    Figure 1: Example of an oriented line shattering 3points. Solid and empty dots represent the two classes

    for y and each of the 23 permutations are shown.

    as illustrated in Figure 1. Note that for a learningmachine with VC dimension h, not every set of h ele-ments must be shattered by F, but at least one.

    Eq. 9 is a remarkable result which relates the num-ber of training examples l, a fundamental propertyof the learning machine h, and the risk R indepen-dent of the unknown distribution p(x, y). The boundsprovided by Eq. 9 are relatively weak due to their

    stunning generality.It is important to realize that with an independent

    testing sample one can evaluate the true risk arbi-trarily well. This testing sample, by definition, is notknown to the algorithm, so the bound is useful forthe design of algorithms through structural risk min-imization. However, neural networks and most othermethods rely on an independent testing sample to aidin their design and validation. An independent testingsample is clearly a better way to assess the true riskof a multivariate algorithm; however, Eq. 9 does shedlight on the issues of overtraining, suggests the num-ber of training samples that are needed, and offers atool to compare different algorithms.

    5.1. VC Dimension of Neural Networks

    In order to apply Eq. 9, one must determine the VCdimension of neural networks. This is a difficult prob-lem in combinatorics and geometry aided by algebraictechniques. Eduardo Sontag has an excellent review ofthese techniques and shows that the VC dimension ofneural networks can, thus far, only be bounded fairlyweakly [9]. In particular, if we define as the numberof weights and biases in the network, then the bestbounds are 2 < h < 4. In a typical particle physicsneural network one can expect 100 < < 1000, whichtranslates into a VC dimension as high as 1012, whichimplies l > 1013 for reasonable bounds on the risk.These bounds imply enormous numbers of trainingsamples when compared to a typical training sampleof 105. Sontag goes on to show that these shatteredsets are incredibly special and that the set of all shat-tered sets of cardinality > 2 + 1 is measure zero ingeneral. Thus, perhaps a more relevant notion of theVC dimension of a neural network is given by .

    PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

    213

  • 8/2/2019 WEJT002

    4/4

    6. GENETIC PROGRAMMING ANDALGORITHMS

    Genetic Programming (GP) and Genetic Algo-rithms (GA) are based on a similar evolutionarymetaphor in which individuals (potential solutionsto the problem at hand) compete with respect to auser-defined performance measure. For new particlesearches, the rate of Type II error, the significance,the exclusion potential, or G. Punzis suggestion [10]are all reasonable performance measures. Ideally, onewould use as a performance measure the same pro-cedure that will be used to quote the results of theexperiment. For instance, there is no reason (otherthan speed) that one could not include discriminat-ing variables and systematic error in the optimizationprocedure (in fact, the author has done both).

    The use of GP for the classification is fairly lim-ited; however, it can be traced to the early works onthe subject by Koza [11]. To the best of the authorsknowledge, the first application of GP within particle

    physics will appear in [12]. The difference between thealgorithms is that GAs evolve a bit string which typ-ically encodes parameters to a pre-existing program,function, or class of cuts, while GP directly evolvesthe programs or functions. For example, Field andKanev [13] used Genetic Algorithms to optimize thelower- and upper-bounds for six 1-dimensional cutson Modified Fox-Wolfram shape variables. In thatcase, the phase-space region was a pre-defined 6-cubeand the GA was simply evolving the parameters forthe upper- and lower-bounds. On the other hand, GPalgorithm is not constrained to a pre-defined shape orparametric form. Instead, the GP approach is con-

    cerned directly with the construction of an optimal,non-trivial phase space region (i.e. an acceptance re-gion W) with respect to a user-defined performancemeasure. GPs which only produce polynomial expres-sions form a vector space, which allows for a quickapproximation of their VC dimension [9].

    7. CONCLUSIONS

    Clearly multivariate algorithms will have an in-creasingly important role in high energy physics,which necessitates that the field develop a coherent

    formalism and carefully consider what it means fora method to be optimal. Statistical learning theoryoffers a formalism that is general enough to describeall of the common multivariate analysis techniques,and provides interesting results relating risk, the num-ber of training samples, and the learning capacity ofthe algorithm. However, independent testing sam-ples and the global constraint on the rate of Type Ierror places some strain on the risk formalism. Fi-nally, when one takes into account limited training

    data and systematic errors it is not clear that indirectmethods are truly optimizing an experiments sensitiv-ity. Direct methods, such as Genetic Programming,force analysts to be more clear about what statisticalstatements they plan to make and remove an artificialboundary between the goals of the experiment and theoptimization procedures of the algorithm.

    Acknowledgments

    This work was supported by a graduate research fel-lowship from the National Science Foundation and USDepartment of Energy Grant DE-FG0295-ER40896.

    References

    [1] V. Vapnik and A.J. Cervonenkis. The uniform

    convergence of frequencies of the appearance ofevents to their probabilities. Dokl. Akad. NaukSSSR, 1968. in Russian.

    [2] V. Vapnik. The Nature of Statistical LearningTheory. Springer, New York, 2nd edition, 2000.

    [3] J.K Stuart, A. Ord and S. Arnold. KendallsAdvanced Theory of Statistics, Vol 2A (6th Ed.).Oxford University Press, New York, 1994.

    [4] Search for the standard model Higgs boson atLEP. Phys. Lett., B565:6175, 2003.

    [5] D. Scott. Multivariate Density Estimation: The-ory, Practice, and Visualization. John Wiley andSons Inc., 1992.

    [6] K. Cranmer. Kernel estimation in high-energyphysics. Comput. Phys. Commun., 136:198207,2001.

    [7] A. Askew. Event selection with adaptive gaussiankernels. In PhyStat2003, 2003.

    [8] L. Holmstrom et. al. A new multivariate tech-nique for top quark search. Comput. Phys. Com-mun., 88:195210, 1995.

    [9] E. Sontag. VC dimension of neural networks.In C.M. Bishop, editor, Neural Networks andMachine Learning, pages 6995, Berlin, 1998.Springer-Verlag.

    [10] G. Punzi. Sensitivity of searches for new signalsand its optimization. In PhyStat2003, 2003.

    [11] J.R. Koza. Genetic Programming: On the Pro-gramming of Computers by Means of Natural Se-lection. MIT Press, Cambridge, MA, 1992.

    [12] K. Cranmer and R.S. Bowman. PhysicsGP: Agenetic programming approach to event selection.submitted to Comput. Phys. Commun.

    [13] R. D. Field and Y. A. Kanev. Using collider eventtopology in the search for the six-jet decay of topquark antiquark pairs. hep-ph/9801318, 1997.

    PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

    214