Statistics Statistics is..... making a silk purse out of a pig's ear, i.e. getting reliable results...
-
Upload
tracey-nicholson -
Category
Documents
-
view
215 -
download
0
Transcript of Statistics Statistics is..... making a silk purse out of a pig's ear, i.e. getting reliable results...
Statistics
Statistics is . . . . .
making a silk purse out of a pig's ear,
i.e. getting reliable results from data which is never quite as good as you hoped it was going to be.
N.B. The book used for these notes was “Statistics” by R J Barlow, Wiley 1989.
How long is a piece of string?
N measurements x1, . . . . . xN of the length of a piece of string Moments of distribution: mean and variance V = We suppose that is an estimator of the string length and an estimator of the
measurement errors.
NxxN i Nxx
N i22 )(
)1()()( 22' NxxsN i
a good estimatorshould be unbiased—its expectation value should equal the true value,should be consistent—the limit as N should be the true value andshould be efficient—its variance should be small.
(It turns out that the mean satisfies these criteria, but the variance does not—it is biased. However the expression is unbiased.)
But what do we mean by the “true value”?
x
The population model
Set up a model of the measurement, where each observation is subject to an error ei
which is a random variable of mean 0 and variance V, which added to gives xi.
This model dataset has an infinite number of randomly-distributed observations from which our real dataset has been sampled.
The expectation value < > is evidently . Because we know the probability distribution of the xi we can answer questions like " if I repeat this experiment 1000 times, what will be the magnitude and distribution of the values of - ?" and so to deduce the accuracy of .
x
xx
Statistical inference
This is an example of the process of statistical inference, where we set up a model of the measurement and then deduce its properties using suitable estimators. For more complicated models than the one used here we might use a fitting process to deduce the values of several parameters of the model, but the process is essentially the same.
Conceptually, we can repeat the measurement as often as we like. This allows us to answer the question, “what is the probability that I will get a value of in the range a to b” by repeating the (thought) experiment many times and finding the fraction which give a result in this range. This is the frequentist approach to defining the probability of an outcome.
x
The obvious problem is that the idea is highly artificial: no one would really try to answer these questions by actually repeating the experiment—sometimes in astronomy that is not even possible in principle. Observations of a classical gamma-ray burst could not be repeated because that star no longer exists!
On the other hand it defines clearly what we mean by probability, which otherwise is not easy.
The Calculus of Probabilities
Suppose a certain “event” has two possible outcomes A and B with probabilities P(A ) and P(B). Also let P(A B) be the probability of A given B.
Then, if A and B are mutually exclusive P(A) + P(B) = 1 P(A & B) = 0 P(A + B) = 1
If A and B are independent P(A B) = P(A)
Otherwise, more generally, P(A + B) = P(A) + P(B) – P(A & B)and P(A & B) = P(A) P(B A) = P(B) P(A B)
The last expression can be re-arranged to give Bayes’ TheoremP(A B) = P( A) P(B A) / P(B)
These axioms are the basis of the mathematical theory of probablility
The Bayesian approachSuppose P(m d) is the probability of our model, given the observed data (the
“posterior” probability), then by Bayes’ Theorem we can writeP(m d) = P( m) P(d m) / P(d)
where P( m) is the probability of the model before we take into account our new measurements (the “prior” probability),
P(d m) is the probability of getting our particular dataset if the model is true,
and P(d) is a normalisation constant which makes the sum of P(m d) =1, taken over all possible models.
Now P(m d) is just exactly what we would like to know—going back to our piece of string, P( x) is much better than P( x) which is what we get from the frequentist analysis. But nobody knows what this probability really means, and they attach different names—propensity, plausibility, confidence which make it plain it is somewhat subjective, unlike the frequentist probability which has an objective meaning.
In addition it is usually difficult to know what value to give the prior; normally it is given some rather bland value, implying little prior knowledge.
x
Probability theory and statistical inference
Model space data space Probability theory
Statistical inference
True model
statistical inference-- what can we say about the probability P(m d) of the model or its parameters from the properties of our dataset?
How do we calculate the probability P(d m) of getting a given dataset, from a knowledge of the model?
Parameter estimation
Probability distributions
The specification of the model will tell us the distribution function which we must use to calculate the probability P(r) of any particular outcome r for a measurement ( error ei or numbers of photons).
For example, if we are observing random events which occur at a mean rate of per sec, then we must use a Poissonian distribution with that mean value.
If we are measuring the heights of students in a class, with mean and variance 2, we must use a Gaussian distribution with those parameters (Central Limit Theorem).
!)exp();( rrP rP
2)2/)(exp(),;( 22 xxPG
Note that r is discrete but x is a continuous variable, so PG is a probability density:
dPG = PG. dx
For example, if we are observing random events which occur at a mean rate of per sec, then we must use a Poissonian distribution with that mean value.
!)exp();( rrP rP
Poisson distribution
The mean <r> = , of course, but in addition the variance also
Gaussian distribution
For large , the Poisson distribution tends to a Gaussian of mean and variance
The 2 distribution
Another important distribution is that of 2, the statistic often used in fitting. This is
where n=N-m is the number of degrees of freedom—N is the number of data points fitted, and m the number of parameters optimised—and is the
gamma function.
)2/()2/exp(2);( 222/2 nnP nn
The Central Limit Theorem
It is noticeable that Poissonian and 2 distributions tend to become more
symmetrical and to look more like a Gaussian at large or n,
respectively; they do in fact tend to Gaussians.
Another important case is where the distribution of a quantity (like the
heights of a population) depends on a number of independent factors;
it then always tends to be Gaussian-distributed.
The Central Limit Theorem states that any variable which depends on a
large number of independent factors will be Gaussian-distributed.
Model-fitting with Maximum Likelihood
Calculate P(d m) for the particular case of models of photon-counting.
Usually called the likelihood of the data—only difference is normalisation
L(d;a) = P(d m) =
so
);( axPi
i
));(ln(ln axPLi
i
Next we maximise the likelihood by using the values of a for which
0/ln aL
Example:
Model: constant source of counting rate a.
Since the statistics are Poissonian, we have
So the Maximum Likelihood value is just the mean.
xNxa
axaL
xaxaL
xaeaxP
ii
ii
iii
ixa
ii
/
0/1/ln
)!ln(lnln
!/.);(
In reality . . .
Of course the model is likely to be much more complicated,
e.g. a point source of strength a at position () in the field of view, together with a background b, would be modelled by
m(x, y) = b + a. h(x - , y - ) where h(a, b) is the point spread function of the image. Now lnL has to be minimised with respect to 4 parameters, a, b, and , so it must be done numerically.
Note that with ML/Poissonian statistics it is never correct to subtract the background before fitting as it changes the statistics.
Finally, in “real life” (i.e. LAT)
The source model is considered as:
This model is folded with the Instrument Response Functions (IRFs) to obtain the predicted counts in the measured quantity space (E’,p’,t’):
where
is the combined IRF. is the orientation vector of the spacecraft. The integral is performed over the Source Region, i.e. the sky region encompassing all sources contributing to the Region-of – Interest (ROI).
Extended ML fitting
The above formulation of the likelihood is correct only if the total number of photon events is predetermined by stopping the exposure when a desired value has been reached. In practice of course we normally collect events until a desired exposure time is reached in which case the total number of events is a parameter of the model. This changes the form of lnL slightly; the main effect is to increase the estimated errors.
Bayesian Model-fitting
From Bayes’ Theorem, P(m d) = P( m) P(d m) / P(d)
If we have no prior information about the model parameters, P( m) will be uniform over the parameter space and P(d) is only a normalising factor, so the parameter values that which were obtained from the ML analysis which maximise P(d m) will also maximise P(m d) and so are the Bayesian values.
2 (Least squares) Model-fittingIf our data is binned
and if the number of photon events per bin is large enough (>5, 10, 20, say)
then we can replace the Poisson distribution in the model by a Gaussian distribution
and for a one-parameter model22
2/)(
2/)(ln
)2./();(22
axL
eaxP
i i
axi
i
For a multi-parameter model , where mi and i
2 are the predicted mean and variance in the ith bin.
22 2/)(ln iii
i mxL
Maximising lnL is equivalent to minimising ,
so Least Squares fitting is just ML fitting with large numbers of events.
In fact, in place of lnL, the Cash statistic C = -2lnL can be used which makes the correspondence even closer.
222 /)( iii
i mx
Pros and cons of
Cons:(i) Data must be binned reducing resolution(ii) What value to use for the estimate of i
2?Commonly, xi is used but this introduces a bias Other choices have been proposed. As an example, Keith Arnaud did 1000
simulations of a Chandra power law spectrum of slope 1.7 with 1000 events in the spectrum and fitted this data using different fit statistics to get the following results for the fitted slope:
Standard 2 1.41 (-0.15, +0.16)Gehrels 2 1.88 (-0.22, +0.25)Churasov 2 1.69 (-0.15, +0.16)ML 1.70 (-0.13, +0.14)The superiority of ML is obvious; it is partly due to this issue of the estimate of the
variance, and partly to the use of the approximate distribution function.
Pros:2 has one important strength, however—its distribution is known which allows an estimate to be made of the goodness-of-fit of the model.
Calculating Optimal Parameter Values
The calculation of maximal/minimal statistics is usually done numerically using a standard computer package.
BEWARE!!
Suppose we need to minimise a statistic S using a model with m free parameters. The simplest method is, starting from a suitably-chosen point on the S-m surface, to calculate the direction of steepest descent and take a small step in that direction. Repeating this procedure, one will eventually arrive at a minimum.
The problem is that unless m is small, the S-m surface may well have multiple minima and the one found may be only a local minimum.
Always repeat the calculation from a large range of initial parameters and make sure the fit always converges on the same solution.
Confidence intervals
s.experimentsimilar of 90%in lie should ich within whrange thei.e. interval, confidence 90% thegives This
65.1 toingcorrespond 90%,say — valuedesiredany has them
between lies y thatprobabilit thesuch that and limits lsymmetricaset can we rem)Limit theo Centralby (e.g.on distributiGaussian a hasdataset theassume weIf
isstrength source theof estimatebest Our . dataset) not the mean, the(ofdeviation standard estimated with
ismean sample whosesource a ofintensity theof tsmeasuremen make weSuppose
xxxxx
xxx
xx
We now assume that this means that will lie within this range in 90% of experiments.
Reverend Bayes is back!!This is a Bayesian statement based on a uniform prior.
Suppose we estimate a parameter m to be x, and that we have an estimate of the error We assume that x is Gaussian-distributed and use a prior P(m) = 1
Note that this is correctly normalised—x must have a value between - ∞ and ∞ and P integrated over this range is 1.
We know m must be positive, so now we use a prior P(m) = 1, m ≥ 0= 0 otherwise
and now we must re-normalise so that
x > 0
2)|(
22 2/)( mxexmP
2/2'
)|(22
22
22 2/)(
0
2/)'(
2/)(
mx
mx
mx e
dxe
exmP
Dealing with negative intensities
We can use this result to advantage in those cases where the confidence interval includes physically-nonsensical negative values, e.g. an intensity of 0.06±0.10 units.
Using
gives the 90% confidence interval as 0.0 to 0.19 instead of -0.04 to 0.16.
See Barlow, p130
2/2'
)|(22
22
22 2/)(
0
2/)'(
2/)(
mx
mx
mx e
dxe
exmP
Confidence Intervals in Model-fittingThe fit statistic gives directly the relative probability attached to any set of
parameter values; this is true whether LnL or 2
If the prior is bland, this is also the model probability given the observed data. The parameter for which an error estimate is required is varied in steps, whilst at
each stage re-optimising the rest, until the change S in the fit statistic S reaches a required value, say 2.71
2 for one parameter is distributed as 2 (1); for two parameters as 2 (2) Confidence,% 68 90 95 99S (one parameter) 1.0 2.71 3.84 6.63S (two paras) 2.30 4.61 5.99 9.21
For ML, (lnL) is distributed as -2 /2 provided the number of data points n is large, so these values can be multiplied by -1/2..
If two parameters are varied at once, the line S =2.3 is an ellipse which traces out the 68% confidence region. If the axes of the ellipse are not parallel to the parameter axes, then the two parameters are correlated.
Correlated variables
Goodness of Fit
It is obvious from the expression
that we would expect the minimum value of 2 to be close to N and this is true—the distribution of 2 is
n = N-m is the number of degrees of freedom where N is the number of data points
and m is the number of parameters in the modelThis is true for all n, but if n is large enough it approaches a Gaussian of mean n
and variance 2n
222 /)( iii
i mx
)2/()2/exp(2);( 222/2 nnP nn
This assumes the model is a good one and so can be used as a test of that hypothesis, by comparing the “reduced” 2
min i.e. the best-fit value, with the expected value. If n is large, we compare the “reduced” 2
min / n with 1.
A good agreement implies the model is a good description of the data; if the reduced 2 is too large not only is the model poor, but the difference indicates how likely such a value is by chance.
Alternatively, a low value implies that the errors are overestimated.
This is a very valuable property of 2 fitting and, perhaps, the only reason to consider using it.
Unfortunately, there is no ML equivalent. The usual solution is to run e.g. 1000 Monte Carlo simulations of the experiment.
For example, a bremsstrahlung fit to a 20 channel spectrum gives 2min =35
There are 2 parameters so there are 18 dof.
2min is suspiciously high—P ~ 0.01 by chance
How many Parameters?
Adding more adjustable parameters will usually make a model fit better.
Example—try adding a power law component (2 extra parameters) to the bremsstrahlung
2min drops to 18 with 16 dof—obviously a better fit
How can you know when to stop?
The first answer is just to stop when you get an acceptable reduced 2, but the F test gives a more rigorous answer.
The F testSuppose an initial model with i degrees of freedom gives a 2 of Si and after
fitting extra parameters one gets Sf with f dof.
Calculate
and look up the significance of the result in a table of the F-distribution.
fi
f
f
fi
S
SSF
.
For the example, F = (35-18)/18 x 16/2 = 7.6.
The F-test table shows the chance of exceeding 6.23 by chance is 0.01 so the power law is significant at 99% confidence.
(It is essential to note that books usually give this test in its simpler, original form and the tables to go with it. You need tables of the distribution of F(as in Bevington, “Data reduction and error analysis for the physical sciences”)
Conditions for using the F-test
The F-test can only validly be used if two conditions are satisfied
the simpler model is nested within the more complex
the null hypothesis does not lie along a boundary in data space
(added spectral line, point source, power law)
See Protassov et al , Ap J 2002, 571, 545
The Maximum Likelihood “Test Statistic”
The F-statistic is closely related to the “test statistic” TS used to determine the significance of a possible point source, (see below)
TS = 2 ( lnL1 – lnL2 )If N is large, TS is approximately distributed as 2 (fi)
For a point source, this is 2 ()(position, flux, spectral index)or 2 ()position, intensity)
TS = 25 for 4 additional parameters corresponds to P = 2.10-5
Rule of thumb: TS = 2 (is actually 4.1 in this case)There are 20000 PSF’s in the LAT sky so 0.4 spurious sources expected.
1/fit good a is model theif and ln2just is f ffi SLSS
Hypothesis testing
We started by considering the problem of parameter estimation
We have now moved into hypothesis testing—e.g. that a spectrum includes a significant power law component.
An important tool of hypothesis testing is the Null Hypothesis, useful where a testable form of the hypothesis cannot be set up.
the Null hypothesis
Suppose that we count photons from a source in constant intervals, and find the mean number of events in each is . Then in one interval we measure some (greater) number, l. Is this due to a flare in the source?
First we formulate the null hypothesis-- there is no flare, the measurement l is simply drawn from the same population as all the others.
;lPl P
The probability of obtaining a count of l or greater is
which for =10, l=20 is 1%How significant this result is, depends on how many measurements in all we have made.
In 1000 intervals, 10 would give this value by chance alone
Source SearchingSearching an image for point sources is another example of hypothesis testing,
where one is looking for pixels (or groups of pixels) which contain a significant excess of counts.
One estimates the background b from some selected large region of the image and compares this with the number of counts n in each pixel.
The significance of a hypothetical source is then (n-b)/ √b in standard deviations, and the hypothesis that there is no source present can then be rejected at some level of confidence.
A group of pixels would be used to increase the sensitivity of the test when the image point spread function covers several pixels.
A maximum probability of ~10-6 ("5") is usually taken as a realistic value when searching for point sources in an image which can easily have 105 pixels.
Timing Analyses
Dave showed data from the Vela Pulsar using gtptest
Type of test: Chi-squared Test, 10 phase bins
Probability distribution: Chi-squared, 9 degrees of freedom
Test Statistic: 824.028880866426
Chance Probability Range: (0, 2.03757046903054e-99)
Type of test: Rayleigh Test
Probability distribution: Chi-squared, 2 degrees of freedom
Test Statistic: 46.2571601550502
Chance Probability Range: (9.02305042259081e-11, 9.02395272685048e-11)
Type of test: Z2n Test, 10 harmonics
Probability distribution: Chi-squared, 20 degrees of freedom
Test Statistic: 1511.03487971911
Chance Probability Range: (0, 2.0785338644267e-99)
Type of test: H Test, 10 maximum harmonics
Probability distribution: H Test-specific
Test Statistic: 1475.03487971911
Chance Probability Range: (0, 4e-08)
From Dave's timing talk
Testing Pulsation Significance – Output from gtptest on Vela
Pulsation tests
Chi-squared null hypothesis: all bins equal
Rayleigh : phases form a random walk
Z2n : pulsation is a sine wave with n harmonicsH Test :pulsation is a sine wave with automatic selection
of n
Why have more than one?
Because the waveform can vary from one or two sharp peaks/cycle to a sine wave
The tests are most sensitive for waveforms which match their null hypothesis
Also, the last three tests do not require the data to be binned