Statistics Statistics is..... making a silk purse out of a pig's ear, i.e. getting reliable results...

Statistics

Statistics is . . . . .

making a silk purse out of a pig's ear,

i.e. getting reliable results from data which is never quite as good as you hoped it was going to be.

N.B. The book used for these notes was “Statistics” by R J Barlow, Wiley 1989.

How long is a piece of string?

N measurements x1, . . . . . xN of the length of a piece of string Moments of distribution: mean and variance V = We suppose that is an estimator of the string length and an estimator of the

measurement errors.

NxxN i Nxx

N i22 )(

)1()()( 22' NxxsN i

a good estimatorshould be unbiased—its expectation value should equal the true value,should be consistent—the limit as N should be the true value andshould be efficient—its variance should be small.

(It turns out that the mean satisfies these criteria, but the variance does not—it is biased. However the expression is unbiased.)

But what do we mean by the “true value”?

x

The population model

Set up a model of the measurement, where each observation is subject to an error ei

which is a random variable of mean 0 and variance V, which added to gives xi.

This model dataset has an infinite number of randomly-distributed observations from which our real dataset has been sampled.

The expectation value < > is evidently . Because we know the probability distribution of the xi we can answer questions like " if I repeat this experiment 1000 times, what will be the magnitude and distribution of the values of - ?" and so to deduce the accuracy of .

x

xx

Statistical inference

This is an example of the process of statistical inference, where we set up a model of the measurement and then deduce its properties using suitable estimators. For more complicated models than the one used here we might use a fitting process to deduce the values of several parameters of the model, but the process is essentially the same.

Conceptually, we can repeat the measurement as often as we like. This allows us to answer the question, “what is the probability that I will get a value of in the range a to b” by repeating the (thought) experiment many times and finding the fraction which give a result in this range. This is the frequentist approach to defining the probability of an outcome.

x

The obvious problem is that the idea is highly artificial: no one would really try to answer these questions by actually repeating the experiment—sometimes in astronomy that is not even possible in principle. Observations of a classical gamma-ray burst could not be repeated because that star no longer exists!

On the other hand it defines clearly what we mean by probability, which otherwise is not easy.

The Calculus of Probabilities

Suppose a certain “event” has two possible outcomes A and B with probabilities P(A ) and P(B). Also let P(A B) be the probability of A given B.

Then, if A and B are mutually exclusive P(A) + P(B) = 1 P(A & B) = 0 P(A + B) = 1

If A and B are independent P(A B) = P(A)

Otherwise, more generally, P(A + B) = P(A) + P(B) – P(A & B)and P(A & B) = P(A) P(B A) = P(B) P(A B)

The last expression can be re-arranged to give Bayes’ TheoremP(A B) = P( A) P(B A) / P(B)

These axioms are the basis of the mathematical theory of probablility

The Bayesian approachSuppose P(m d) is the probability of our model, given the observed data (the

“posterior” probability), then by Bayes’ Theorem we can writeP(m d) = P( m) P(d m) / P(d)

where P( m) is the probability of the model before we take into account our new measurements (the “prior” probability),

P(d m) is the probability of getting our particular dataset if the model is true,

and P(d) is a normalisation constant which makes the sum of P(m d) =1, taken over all possible models.

Now P(m d) is just exactly what we would like to know—going back to our piece of string, P( x) is much better than P( x) which is what we get from the frequentist analysis. But nobody knows what this probability really means, and they attach different names—propensity, plausibility, confidence which make it plain it is somewhat subjective, unlike the frequentist probability which has an objective meaning.

In addition it is usually difficult to know what value to give the prior; normally it is given some rather bland value, implying little prior knowledge.

x

Probability theory and statistical inference

Model space data space Probability theory

Statistical inference

True model

statistical inference-- what can we say about the probability P(m d) of the model or its parameters from the properties of our dataset?

How do we calculate the probability P(d m) of getting a given dataset, from a knowledge of the model?

Parameter estimation

Probability distributions

The specification of the model will tell us the distribution function which we must use to calculate the probability P(r) of any particular outcome r for a measurement ( error ei or numbers of photons).

For example, if we are observing random events which occur at a mean rate of per sec, then we must use a Poissonian distribution with that mean value.

If we are measuring the heights of students in a class, with mean and variance 2, we must use a Gaussian distribution with those parameters (Central Limit Theorem).

!)exp();( rrP rP

2)2/)(exp(),;( 22 xxPG

Note that r is discrete but x is a continuous variable, so PG is a probability density:

dPG = PG. dx

For example, if we are observing random events which occur at a mean rate of per sec, then we must use a Poissonian distribution with that mean value.

!)exp();( rrP rP

Poisson distribution

The mean <r> = , of course, but in addition the variance also

Gaussian distribution

For large , the Poisson distribution tends to a Gaussian of mean and variance

The 2 distribution

Another important distribution is that of 2, the statistic often used in fitting. This is

where n=N-m is the number of degrees of freedom—N is the number of data points fitted, and m the number of parameters optimised—and is the

gamma function.

)2/()2/exp(2);( 222/2 nnP nn

The Central Limit Theorem

It is noticeable that Poissonian and 2 distributions tend to become more

symmetrical and to look more like a Gaussian at large or n,

respectively; they do in fact tend to Gaussians.

Another important case is where the distribution of a quantity (like the

heights of a population) depends on a number of independent factors;

it then always tends to be Gaussian-distributed.

The Central Limit Theorem states that any variable which depends on a

large number of independent factors will be Gaussian-distributed.

Model-fitting with Maximum Likelihood

Calculate P(d m) for the particular case of models of photon-counting.

Usually called the likelihood of the data—only difference is normalisation

L(d;a) = P(d m) =

so

);( axPi

i

));(ln(ln axPLi

i

Next we maximise the likelihood by using the values of a for which

0/ln aL

Example:

Model: constant source of counting rate a.

Since the statistics are Poissonian, we have

So the Maximum Likelihood value is just the mean.

xNxa

axaL

xaxaL

xaeaxP

ii

ii

iii

ixa

ii

/

0/1/ln

)!ln(lnln

!/.);(

In reality . . .

Of course the model is likely to be much more complicated,

e.g. a point source of strength a at position () in the field of view, together with a background b, would be modelled by

m(x, y) = b + a. h(x - , y - ) where h(a, b) is the point spread function of the image. Now lnL has to be minimised with respect to 4 parameters, a, b, and , so it must be done numerically.

Note that with ML/Poissonian statistics it is never correct to subtract the background before fitting as it changes the statistics.

Finally, in “real life” (i.e. LAT)

The source model is considered as:

This model is folded with the Instrument Response Functions (IRFs) to obtain the predicted counts in the measured quantity space (E’,p’,t’):

where

is the combined IRF. is the orientation vector of the spacecraft. The integral is performed over the Source Region, i.e. the sky region encompassing all sources contributing to the Region-of – Interest (ROI).

Extended ML fitting

The above formulation of the likelihood is correct only if the total number of photon events is predetermined by stopping the exposure when a desired value has been reached. In practice of course we normally collect events until a desired exposure time is reached in which case the total number of events is a parameter of the model. This changes the form of lnL slightly; the main effect is to increase the estimated errors.

Bayesian Model-fitting

From Bayes’ Theorem, P(m d) = P( m) P(d m) / P(d)

If we have no prior information about the model parameters, P( m) will be uniform over the parameter space and P(d) is only a normalising factor, so the parameter values that which were obtained from the ML analysis which maximise P(d m) will also maximise P(m d) and so are the Bayesian values.

2 (Least squares) Model-fittingIf our data is binned

and if the number of photon events per bin is large enough (>5, 10, 20, say)

then we can replace the Poisson distribution in the model by a Gaussian distribution

and for a one-parameter model22

2/)(

2/)(ln

)2./();(22

axL

eaxP

i i

axi

i

For a multi-parameter model , where mi and i

2 are the predicted mean and variance in the ith bin.

22 2/)(ln iii

i mxL

Maximising lnL is equivalent to minimising ,

so Least Squares fitting is just ML fitting with large numbers of events.

In fact, in place of lnL, the Cash statistic C = -2lnL can be used which makes the correspondence even closer.

222 /)( iii

i mx

Pros and cons of

Cons:(i) Data must be binned reducing resolution(ii) What value to use for the estimate of i

2?Commonly, xi is used but this introduces a bias Other choices have been proposed. As an example, Keith Arnaud did 1000

simulations of a Chandra power law spectrum of slope 1.7 with 1000 events in the spectrum and fitted this data using different fit statistics to get the following results for the fitted slope:

Standard 2 1.41 (-0.15, +0.16)Gehrels 2 1.88 (-0.22, +0.25)Churasov 2 1.69 (-0.15, +0.16)ML 1.70 (-0.13, +0.14)The superiority of ML is obvious; it is partly due to this issue of the estimate of the

variance, and partly to the use of the approximate distribution function.

Pros:2 has one important strength, however—its distribution is known which allows an estimate to be made of the goodness-of-fit of the model.

Calculating Optimal Parameter Values

The calculation of maximal/minimal statistics is usually done numerically using a standard computer package.

BEWARE!!

Suppose we need to minimise a statistic S using a model with m free parameters. The simplest method is, starting from a suitably-chosen point on the S-m surface, to calculate the direction of steepest descent and take a small step in that direction. Repeating this procedure, one will eventually arrive at a minimum.

The problem is that unless m is small, the S-m surface may well have multiple minima and the one found may be only a local minimum.

Always repeat the calculation from a large range of initial parameters and make sure the fit always converges on the same solution.

Confidence intervals

s.experimentsimilar of 90%in lie should ich within whrange thei.e. interval, confidence 90% thegives This

65.1 toingcorrespond 90%,say — valuedesiredany has them

between lies y thatprobabilit thesuch that and limits lsymmetricaset can we rem)Limit theo Centralby (e.g.on distributiGaussian a hasdataset theassume weIf

isstrength source theof estimatebest Our . dataset) not the mean, the(ofdeviation standard estimated with

ismean sample whosesource a ofintensity theof tsmeasuremen make weSuppose

xxxxx

xxx

xx

We now assume that this means that will lie within this range in 90% of experiments.

Reverend Bayes is back!!This is a Bayesian statement based on a uniform prior.

Suppose we estimate a parameter m to be x, and that we have an estimate of the error We assume that x is Gaussian-distributed and use a prior P(m) = 1

Note that this is correctly normalised—x must have a value between - ∞ and ∞ and P integrated over this range is 1.

We know m must be positive, so now we use a prior P(m) = 1, m ≥ 0= 0 otherwise

and now we must re-normalise so that

x > 0

2)|(

22 2/)( mxexmP

2/2'

)|(22

22

22 2/)(

0

2/)'(

2/)(

mx

mx

mx e

dxe

exmP

Dealing with negative intensities

We can use this result to advantage in those cases where the confidence interval includes physically-nonsensical negative values, e.g. an intensity of 0.06±0.10 units.

Using

gives the 90% confidence interval as 0.0 to 0.19 instead of -0.04 to 0.16.

See Barlow, p130

2/2'

)|(22

22

22 2/)(

0

2/)'(

2/)(

mx

mx

mx e

dxe

exmP

Confidence Intervals in Model-fittingThe fit statistic gives directly the relative probability attached to any set of

parameter values; this is true whether LnL or 2

If the prior is bland, this is also the model probability given the observed data. The parameter for which an error estimate is required is varied in steps, whilst at

each stage re-optimising the rest, until the change S in the fit statistic S reaches a required value, say 2.71

2 for one parameter is distributed as 2 (1); for two parameters as 2 (2) Confidence,% 68 90 95 99S (one parameter) 1.0 2.71 3.84 6.63S (two paras) 2.30 4.61 5.99 9.21

For ML, (lnL) is distributed as -2 /2 provided the number of data points n is large, so these values can be multiplied by -1/2..

If two parameters are varied at once, the line S =2.3 is an ellipse which traces out the 68% confidence region. If the axes of the ellipse are not parallel to the parameter axes, then the two parameters are correlated.

Correlated variables

Goodness of Fit

It is obvious from the expression

that we would expect the minimum value of 2 to be close to N and this is true—the distribution of 2 is

n = N-m is the number of degrees of freedom where N is the number of data points

and m is the number of parameters in the modelThis is true for all n, but if n is large enough it approaches a Gaussian of mean n

and variance 2n

222 /)( iii

i mx

)2/()2/exp(2);( 222/2 nnP nn

This assumes the model is a good one and so can be used as a test of that hypothesis, by comparing the “reduced” 2

min i.e. the best-fit value, with the expected value. If n is large, we compare the “reduced” 2

min / n with 1.

A good agreement implies the model is a good description of the data; if the reduced 2 is too large not only is the model poor, but the difference indicates how likely such a value is by chance.

Alternatively, a low value implies that the errors are overestimated.

This is a very valuable property of 2 fitting and, perhaps, the only reason to consider using it.

Unfortunately, there is no ML equivalent. The usual solution is to run e.g. 1000 Monte Carlo simulations of the experiment.

For example, a bremsstrahlung fit to a 20 channel spectrum gives 2min =35

There are 2 parameters so there are 18 dof.

2min is suspiciously high—P ~ 0.01 by chance

How many Parameters?

Adding more adjustable parameters will usually make a model fit better.

Example—try adding a power law component (2 extra parameters) to the bremsstrahlung

2min drops to 18 with 16 dof—obviously a better fit

How can you know when to stop?

The first answer is just to stop when you get an acceptable reduced 2, but the F test gives a more rigorous answer.

The F testSuppose an initial model with i degrees of freedom gives a 2 of Si and after

fitting extra parameters one gets Sf with f dof.

Calculate

and look up the significance of the result in a table of the F-distribution.

fi

f

f

fi

S

SSF

.

For the example, F = (35-18)/18 x 16/2 = 7.6.

The F-test table shows the chance of exceeding 6.23 by chance is 0.01 so the power law is significant at 99% confidence.

(It is essential to note that books usually give this test in its simpler, original form and the tables to go with it. You need tables of the distribution of F(as in Bevington, “Data reduction and error analysis for the physical sciences”)

Conditions for using the F-test

The F-test can only validly be used if two conditions are satisfied

the simpler model is nested within the more complex

the null hypothesis does not lie along a boundary in data space

(added spectral line, point source, power law)

See Protassov et al , Ap J 2002, 571, 545

The Maximum Likelihood “Test Statistic”

The F-statistic is closely related to the “test statistic” TS used to determine the significance of a possible point source, (see below)

TS = 2 ( lnL1 – lnL2 )If N is large, TS is approximately distributed as 2 (fi)

For a point source, this is 2 ()(position, flux, spectral index)or 2 ()position, intensity)

TS = 25 for 4 additional parameters corresponds to P = 2.10-5

Rule of thumb: TS = 2 (is actually 4.1 in this case)There are 20000 PSF’s in the LAT sky so 0.4 spurious sources expected.

1/fit good a is model theif and ln2just is f ffi SLSS

Hypothesis testing

We started by considering the problem of parameter estimation

We have now moved into hypothesis testing—e.g. that a spectrum includes a significant power law component.

An important tool of hypothesis testing is the Null Hypothesis, useful where a testable form of the hypothesis cannot be set up.

the Null hypothesis

Suppose that we count photons from a source in constant intervals, and find the mean number of events in each is . Then in one interval we measure some (greater) number, l. Is this due to a flare in the source?

First we formulate the null hypothesis-- there is no flare, the measurement l is simply drawn from the same population as all the others.

;lPl P

The probability of obtaining a count of l or greater is

which for =10, l=20 is 1%How significant this result is, depends on how many measurements in all we have made.

In 1000 intervals, 10 would give this value by chance alone

Source SearchingSearching an image for point sources is another example of hypothesis testing,

where one is looking for pixels (or groups of pixels) which contain a significant excess of counts.

One estimates the background b from some selected large region of the image and compares this with the number of counts n in each pixel.

The significance of a hypothetical source is then (n-b)/ √b in standard deviations, and the hypothesis that there is no source present can then be rejected at some level of confidence.

A group of pixels would be used to increase the sensitivity of the test when the image point spread function covers several pixels.

A maximum probability of ~10-6 ("5") is usually taken as a realistic value when searching for point sources in an image which can easily have 105 pixels.

Timing Analyses

Dave showed data from the Vela Pulsar using gtptest

Type of test: Chi-squared Test, 10 phase bins

Probability distribution: Chi-squared, 9 degrees of freedom

Test Statistic: 824.028880866426

Chance Probability Range: (0, 2.03757046903054e-99)

Type of test: Rayleigh Test



Chance Probability Range: (9.02305042259081e-11, 9.02395272685048e-11)

Type of test: Z2n Test, 10 harmonics



Chance Probability Range: (0, 2.0785338644267e-99)

Type of test: H Test, 10 maximum harmonics

Probability distribution: H Test-specific


Chance Probability Range: (0, 4e-08)

From Dave's timing talk

Testing Pulsation Significance – Output from gtptest on Vela

Pulsation tests

Chi-squared null hypothesis: all bins equal

Rayleigh : phases form a random walk

Z2n : pulsation is a sine wave with n harmonicsH Test :pulsation is a sine wave with automatic selection

of n

Why have more than one?

Because the waveform can vary from one or two sharp peaks/cycle to a sine wave

The tests are most sensitive for waveforms which match their null hypothesis

Also, the last three tests do not require the data to be binned

Statistics Statistics is..... making a silk purse out of a pig's ear, i.e. getting reliable results...

Documents

Transcript of Statistics Statistics is..... making a silk purse out of a pig's ear, i.e. getting reliable results...