NASSP Masters 5003F - Computational Astronomy - 2009 Lecture 24 – revision! 1.Random quantities....

NASSP Masters 5003F - Computational Astronomy - 2009

Lecture 24 – revision!1. Random quantities.2. Model fitting.3. Signal detection.4. Catalogs of sources.5. Radio astronomy.6. Polarized signals.7. Fourier matters.8. Interferometry.9. X-ray astronomy.10.Satellite observatories.11.Astronomical data.


1. Random quantities• Important properties of a random value Y:

– The probability density p(y).

– The mean or average

– The variance

• It is important to distinguish between the ideal values of these and estimates of them which one can calculate from a sample [y1,y2,...yN] of Y.

• Although the ideal values are formally unattainable, in practice good estimates of them may be available. Eg:– There may be formulae which predict p, μ and (often most

importantly) σ2 (see eg radio astronomy).– Long-term calibration measurements can provide good

estimates (true for most scientific instruments).

ypydyy

ypydy 22 These are not guar-anteed to exist!


1. Random quantities

• Also often of interest is an integral over the probability density.

y

p(y)

y0

ypdyyyPy

0

0


1. Random quantities• Estimating the three properties from N

samples of Y:– A frequency histogram serves as an estimate

of p(y).– Estimate of the mean:

– Estimate of the variance*:

• The result of every (non-trivial) transformation of a set of random numbers is itself a random number. (For example, the estimators for the mean and variance.)

N

iiyN 1

1

N

iiyN 1

22 ˆ1

1ˆ

*This formula was incorrect on slide 5 of lecture 3.Note: the ‘hats’ here mean ‘estimate’.



• Measurements of some physical quantity:– We have a set of N measurements yi which are

usually made at different values of some other essentially non-random quantities ri (eg at different times, positions, frequencies, etc etc)

• It is often convenient to treat each measurement yi as a sum of signal si, background bi and noise ni.

– The division between signal and background is made largely on grounds of convenience, or interest – it is a bit like the distinction between ‘useful’ plants and weeds.


1. Random quantities– The estimate μ^ will not approach the ideal μ

of the noise at large N unless s+b=0.– The estimate σ^2 of the noise will not

approach the ideal σ2 at large N unless s+b is constant.

• Uncertainty propagation: if some random variable z is a function f(y) of other random variables y=[y1,y2,...yN], then

N

i

N

jji

jiz y

f

y

f

1 1

2,

2



• Often (not always!) the different yi are uncorrelated – ie, the value of one does not depend on another. In this case σi,j2=0 for i≠j and so

• Examples (all uncorrelated):

.1

2

2

2

N

iy

iz iy

f

.222

offonnett TTT

N

iiyN 1

1 .1

1

22

2

N

iyiN

.2

22

yy

z

yz ln

offonnett TTT


2. Model fitting• Just as we do not have access to μ or σ2, neither

do we have access to the ‘ideal’ quantities s(r) or b(r).– Model fitting is the process of estimating these

quantities.• The usual practice is to propose a model

which depends on a small number M of parameters Θ=[θ1,θ2,.. θM].

• The background can usually be much better estimated than the signal, because background tends to occur at similar levels in all data points, whereas signal is localized.– Hence it is often assumed that b^=b.

ΘrrΘr ,ˆˆ, sbm


2. Model fitting• Steps in the estimation of s:

1. Decide on a model.2. Choose a fitting statistic U: ie some formula

which has the following properties:1. It returns a single number.

2. It is a function of the data values yi and the model m(x).

3. U should be a smooth function of the parameters Θ, with a quadratic (=analytic) minimum.

4. The parameter values at that minimum should be, in some reasonable sense of the words, the best fit parameters, ie provide the ‘best’ estimate of the ideal, unattainable s.

3. Calculate uncertainties, and the model plausibility.


2. Model fitting• Types of U: first I’ll look at the method of least

squares.• This prescribes a formula for U which is

essentially a ratio between variances. The estimate σ^2 we make from our data goes on the numerator, whereas the denominator is the predicted σ2.

– Before writing this down, let us reconsider the formula for σ^2. Adapting the formula on slide 4 gives:

– However, in general, this gives too small a value. And the larger the number M of parameters, the greater this error. This is because, as M is increased, the model can better and better follow the ups and downs of the noisy part of y.

.,1

1ˆ

1

22

N

iii my

NΘr


2. Model fitting– A more accurate value is given by

– The difference between the number N of data points and the number M of parameters to be fitted is called the number of degrees of freedom.

– Note that the standard formula for σ^2 has N-1 degrees of freedom because μ^ represents a fitted model with 1 parameter.

• Now at last we can construct our ratio. This U formula is

– Another name for this is reduced chi squared or χ2red.

.,1

ˆ1

22

N

iii my

MNΘr

.

,1

12

2

N

i i

ii my

MNU

Θr


2. Model fitting– χ2 itself is the same formula, but without

dividing by the degrees of freedom:

– It is a funny notation, because no-one seems to think or care what χ might be.

• Now we want to consider two related questions about this choice of U:

1. Can we tell from the best-fit value of U how good our choice of model was?

2. We have best-fit values of the parameters – can we also get their uncertainties?

.

,

12

22

N

i i

ii my

Θr


2. Model fitting• The model is our hypothesis. It represents

what we think the signal s(r) is doing.• There is an infinite choice of models. We

cannot prove any given model is correct, only disprove it.

• But how do we do this?• If the model choice is good, then the

estimate σ^2 ought to be close to σ2; in other words, the value of χ2

red in this case ought to be close to 1.

• But suppose χ2red=1.04, or 2, or 7641. How

‘close’ to 1 is still good?


2. Model fitting• Whatever U formula we chose, we cannot decide

whether the model is good or not without knowing the probability distribution of U.

– Remember that U, as a function of random variables, is therefore itself a random variable, therefore has a p(U).

• To test our model hypothesis, after we have done the fitting, and obtained Ubest fit, we have to:

1. Obtain p(U).2. Calculate the probability

If P is small, our model is suspect; in doubt; probably no good; has failed the test.

UpdUUUPU

b.f.

fitbest


2. Model fitting

• Sometimes, for certain choices of U, this distribution is a known function. U=χ2

red is an example of this happy situation: in this case p(U) is an easily calculable function, and more importantly, so is P(U≥Ubest fit):

where Q is the complementary incomplete gamma function.

2

,2

fitbest fitbest

UMNQUUP


2. Model fitting

• For Poisson data, if we replace σi2 in the

formula by mi,best fit, the probability distribution is unchanged (still χ2).

– Thus we can easily use this formula for U to assess the worth of the model.

• We can not use it to obtain mbest fit in the first place though, because the result is known to be biased.

• An unbiased fitting formula has been devised by K Mighell:

N

i i

iii

y

myyU

1

2

Mighell 1

1,min But the probabilitydistribution for thisis not known.


2. Model fitting

• Uncertainties in the parameters:– If, for U= χ2

red, we calculate the matrix H of second derivatives of U vs the model parameters, then the cross-correlation σij

2 between parameters i and j (aka the square in the uncertainty if i=j) is the i,jth element of H-1.

• Another form of U is the likelihood:– Suppose we know the probability distribution

p(y) of y. Eg

or

2

2

2exp

2

1

yy

yp

!

exp

y

yyyp

y


2. Model fitting– If, for every data point yi, we replace <y> in

these expressions by our model m(Θ), then multiply all the probabilities together for i in [1,N], we have an expression for the likelihood L of a given set of parameters Θ. The best fit parameter values are taken to be those which maximize L.

– Two further modifications are usually made:1. Since the numbers are often very small, it is

more convenient to work with log(L);

2. A minimization algorithm will need to work with the negative of the likelihood, ie the expression to minimize is –log(L).


3. Signal detection.

• This is like model fitting, but we start with a special class of model: one which contains only background.– Often we don’t need to do the fitting part,

because we have obtained a good estimate of the background from other sets of data.

– All we have to do is test the model.

• Remember that a model is our hypothesis about what lies behind the data.– This signal-less model is called the null

hypothesis (‘null’ is from the Latin for ‘nothing’).


3. Signal detection.

• The model testing proceeds exactly the same as before:

1. Calculate U for the null hypothesis model.1. eg for U=χ2 this would be

2. Calculate the probability P(U≥Unull).

3. If this value is small, then the null hypothesis fails the test. Deduction: there probably is some signal present.

.

12

2

null

N

i i

ii byU


3. Signal detection• This is ok, but it may not be the most

sensitive way to detect signals. This is because signal and background usually have different spatial scales:– Background usually extends over many data

points;– Signal usually extends only over a few data

points.

• Thus, if one uses the whole data set to calculate U, a small signal may be swamped among the background and noise.– Some selection and filtering are usually done

before applying the test.


3. Signal detection• A likelihood ratio between the likelihood Lbest

calculated for best fit of a normal model m=b+s(Θ), and the likelihood Lnull for m=b alone, can also be used to test the null hypothesis.– W Cash has shown that for

where M is the number of fitted parameters.

• Don’t get confused though – this is a test of the null hypothesis, not of the best fit - a low value of this P means the null hypothesis is probably wrong and therefore there is some signal there.

.2

,2

bestbest

UMQUUP

,log2 nullbestbest LLU


3. Signal detection• Significance:

– If you only have 1 random value y, and you want to decide whether it contains ‘signal’ or just ‘background’.

• Eg the assignment question in which you had some values of the Stokes parameter V, plus its uncertainty. In this case the ‘background’ was zero. The question was, is there some ‘signal’ (ie a non-zero value of V) present.

– The way to do this is to compare y-b with the uncertainty σy (this σ is the ‘expected’ standard deviation, calculated from sources other than the single data point).

– The ratio (y-b)/σy is called the significance.• If the ratio is about 1, then you might say “y is consistent with the

background value.” Ie you can’t rule out the null hypothesis.• Sometimes for a ratio X, you will hear people call this a “X-sigma

detection.” X=5 is a commonly used yardstick for a detection which is judged to be significant.


4. Catalogs of sources.• Source detection is a special case, in which the

model to be fitted is

• The difficulty is that one usually doesn’t know the value of M.– So one has to assume that the sources are not

confused – in other words that the separation between sources is usually greater than the size or extent of each si. Then one can use

for m.

.1

M

iiiisAbm rrrr

0rrrr sAbm


4. Catalogs of sources.• The value of amplitude A at which the source is detected

about ½ the time is the sensitivity.– Different detection methods give differing sensitivities.

• The completeness is the estimated fraction of sources above a certain A which are detected.

– Eg “the survey is 90% complete above a flux of 10-13 erg cm-2 s-1.”

• Suppose:1. We test the null hypothesis at N locations.

2. We decide to label as sources all locations where Unull≥α.

Then our catalog of sources will contain Nα false positives – ie places at which Unull≥α but where in reality there is no source.

– (But! We can’t know which are real and which are false.)

The fraction of false positives is a measure of the reliability of our survey.


4. Catalogs of sources.

• Of interest is the frequency distribution of source amplitudes, n(A).– Often one talks of the source flux or flux density

rather than amplitude:• Examples of units are erg cm-2 s-1 for x-ray and

janskys for radio;• The symbol used is often S.

– Hence n(S) is more common notation.

• If the sources are distributed uniformly in space (known as a Euclidean distribution), n(S) will be a power law, with index=-2.5; in other words, .5.2 SSn


4. Catalogs of sources.

• n(S) is estimated in the usual way: via a histogram.

• Perhaps more often one makes a reverse cumulative histogram, to show the total number N(>S) of sources of flux ≥ S.

• If n is a power law, so is

with an index greater by 1. So in the Euclidean case, the index of N is -1.5.

S

sndsSN


5. Radio astronomy.• Radio dishes are reflectors just the same as mirrors

of optical telescopes. Just the terminology is sometimes a bit different.– Instead of PSF one refers to the beam.

• They are sensitive to radiation from a small area in the sky, of angle ~ λ/D.– The beam nearly always has sidelobes though.

• Unlike optical detectors, radio detectors are polarized – the output voltage varies between a maximum and zero, depending on the polarization state of the incoming radiation.– Sometimes the detector is most sensitive to linear

polarized radiation at a given angle, sometimes to left- or right-circularly polarized radiation.

– Often these days, two detectors of opposite polarization are placed at the focus.


5. Radio astronomy.• Radio signals are nearly always noise-like. As

such they can be mimicked by:– placing a resistor at a certain temperature across the

input terminals to the detection electronics.

or– pointing the antenna directly at a surface with the

same temperature as the above resistor!

• Because the noise power spectral density from such a resistor is equal to kT watts Hz-1 (kT is Boltzmann’s constant times the temperature in kelvin), radio engineers are in the practice of expressing all noise powers in terms of temperatures.


5. Radio astronomy.• Since powers are additive, so are the associated

temperatures.• Thus the total output noise temperature can be

expressed as a sum of several terms:

– Such temperatures are most of the time not ‘real’ in the sense that they are numbers you could read off a thermometer somewhere. They are just a handy way of expressing power spectral density.

• For example, the ‘real’ temperature of the ground is about

300K, but Tbackground in the above sum won’t be 300K unless the antenna is pointed right at the ground. Normally you will only get a little contribution from this hot surface from reflections and from far sidelobes.

Ttotal = Tsource + Tbackground + Tatmosphere + Tsystem


5. Radio astronomy.• The uncertainty of a noise temperature is

• For a point source of flux density S W m-2 Hz-1, the power spectral density

where α varies from 0 to 1 depending on the relative polarization of radiation and detector,– (α=0.5 if the radiation is unpolarized)

and Ae is the effective area of the antenna in m2.

t

TT

SAw e


5. Radio astronomy.• From these and from

we have

and

This last is roughly equal to the limiting sensitivity of the telescope – ie the minimum flux point source which can be detected. (Technically you’d want to set the detection threshhold at 5σ or so.)

kTw

tA

kT

e

totaltotal

e

source

A

kTS

In W m-2 Hz-1. You have to multiply

by 1026 to convert to janskys.

In W m-2 Hz-1. You have to multiplyby 1026 to convert to janskys.


6. Polarized signals.

Visualize with the “Poincaré sphere.” of radius I.

Stokes parameters I, Q, U and V.I = total intensity.Q = intensity of horizontal pol.U = intensity of pol. at 45°V = intensity of left circular pol.

Q axis

U axis

V axis

Polarization fraction d:

I

VUQd

222

Therefore need 4measurements tocompletely definethe radiation.

Polarization angle is

QUarctan2


6. Polarized signals.• Depolarization: basically due to mixing of many

slightly different, uncorrelated source polarizations within the width of the beam.

• Faraday rotation of the polarization angle:

where D is the distance from the source, Ne is the average number density of electrons along that path, and B is the average magnetic field.– Enhances depolarization because uneven DNeB

within the beam amplifies differences between polarization angles.

BDNe2 radians


7. Fourier matters.

• What is the difference between– A Fourier series (FS);– A Fourier transform (FT);– A discrete* Fourier transform (DFT);– The Fast Fourier Transform (FFT)?

• A Fourier series starts with a function f(t) defined on an interval [0,T]. Its FS to order N-1 is

where

1

0

2expN

jj T

tjiAtf

tfdtT

AT

tjitfdt

TA

TT

j

000

1except ,2exp

2 *‘Discrete’ and ‘discreet’ mean different things – consult a dictionary!


7. Fourier matters.• The FT of f(t) (sometimes indicated F{f}) is:

FTs are known for some functions, but not for others. They can’t in general be calculated exactly.

• The DFT is the nearest one can get to a FT which is calculable. If f(t) is sampled at N evenly-spaced points within the interval [0,T], then the DFT is defined as

titfdtF exp

.2exp1 1

0

N

jk N

jki

N

jTf

NF


7. Fourier matters.

• The relation of f’ to f is the same as F’ to F:

In other words, f’ is f wrapped or aliased in a cyclic fashion at the interval boundaries 0 and T.

• The FFT is not some new formula, it is a fast computer algorithm for calculating the DFT.

j

jTtftf


7. Fourier matters.• The convolution integral is

• Its FT is:

• The correlation integral between two functions f and g is closely related:

• Its FT is:

The * here denotes complex conjugation.

.tgtfdtgf

.gfgf FFF

tgtfdtR

gfR FFF


7. Fourier matters.

• From that it is pretty clear that an autocorrelation (correlation of a function against itself) must be real-valued.– The FT of an autocorrelation of f is called the

power spectrum of f.

• The normalized, zero-lag correlation is just the average of the product of f and g:

tgtfdtT

RfgT

TT

2

2

norm

1lim0


8. Interferometry.

2 1

sθ

l

(1-l2)1/2

dθ

2

Phase centre

w

u

Coordinate system:

u1,2 (u component of thebaseline vector b1,2.)

w1,2

v into the page.

m into the page.


8. Interferometry.• The zero-lag complex correlation <y1y2> between

the signals from the two antennas gives the spatial coherence function of the radiation

• where

• Multiplying by exp(2πiw) gives the visibility function:

• The approximation is good provided πw(l2+m2)<<1.

221

1

1

1

12exp,,, mlwvmulimlIdmdlwvuE

vmulimlIdmdlvuV

2exp,,

221

,,,

ml

mlImlAmlI


8. Interferometry.

• The really important points are:– The visibility function V is the inverse FT of

the (slightly modified) sky intensity I’.

– The multiplication of <y1y2> by exp(2πiw) is in reality accomplished by delaying the signal from each jth antenna by tlag=wjλ/c.

– Baselines are separations between the antennas, measured in wavelengths, and projected onto the reference plane.

– The phase centre is the direction normal to the reference plane.


8. Interferometry.

Reference plane

Phase centre

Signals are selectively delayed so that plane wavesfrom the phase centre arrive at the correlator in phase.

A baseline separation

Correlator

delay

delay

delay


The dirtybeam B.

The samplingfunction S

8. Interferometry.

• The real-life process of interferometry:

The sky brightnessdistribution I.

The dirtyimage D=I*B.VxS

The visibilityfunction V.

Cross-correlationeffects an inverse FT.

Sampling

Gridding,discrete FT.

CLEAN or otherdeconvolution.

Gridding,discrete FT.


9. X-ray astronomy.

• Imaging:– Wolter optics:

• Grazing incidence, otherwise x-rays won’t reflect.• A paraboloid mirror followed by hyperboloid.• The effective area of a Wolter mirror shows a

pronounced fall-off as the angle of incidence departs from the optic axis. Called vignetting.

• The Wolter PSF also changes significantly across the field of view.

– A ‘Coded Mask’ is another way to ‘image’ x-rays, and the only way to image gamma rays.

• Actually it more resembles interferometry – the image is obtained by a Fourier technique.


9. X-ray astronomy.

• Detection:– State-of-the-art detectors are CCDs.– An optical-wavelength the CCD accumulates

charge from many low-energy photons.– At x-ray wavelengths, all the charge comes from

just 1 high-energy photon.• This allows the energy of the photon to be measured.• Patterns of charge in adjacent pixels have to be

interpreted by software ‘x-ray events’.• Thus the whole data processing is event-based rather

than brightness-based; flux measurement is based on integers rather than floating-points.

– Hence it is dominated by Poisson statistics.


9. X-ray astronomy.• Frames:

– Longish time accumulating photons;• Too bright a source more than 1 photon per pixel per

frame, called pileup.

– Shortish time reading the data out.• Serial readout – slow – OOTEs – ‘dead time’.

• Hardness ratios are a crude measure of an x-ray spectrum.

• With a ‘proper’ spectrum, if it is a power law, you have to be careful whether you are talking about number of photons per unit photon energy, or total energy per unit photon energy. The respective spectral indices differ by 1.


10. Satellite observatories.

• You need to know the orientation of the satellite in space.– (Remember too that this usually varies with time.)

• For this, we define:– a sky coordinate frame with the x axis at

RA=0, dec=0, the y axis at RA=6 hr, dec=0 and a a z axis at dec=90;

– a spacecraft coordinate frame comprising three orthogonal vectors ax, ay and az;

– an attitude matrix such that ax occupies row 1, ay row 2 and az row 3 (all vectors expressed in Cartesian coordinates of the sky frame).


10. Satellite observatories.

• For example, if the spacecraft is pointing at RA=3 hr, dec=+30, and if it has 90 degrees of roll, the corresponding attitude matrix is:

02

1

2

12

3

22

1

22

12

1

22

3

22

3

A

If you try to check this, but get figures which don’t agree, keep in mind the possibility (even likelihood) that I made an error!


10. Satellite observatories.• The instruments on the spacecraft are in

general, not precisely aligned with the s/c coordinate axes.

• Thus they need their own coordinate systems.

• The difference between the two is called the boresight matrix for that instrument.– The 3 vectors bx, by and bz which define the

instrument’s Cartesian coordinate system are expressed in the basis of the spacecraft coordinate system.

– bx forms row 1, etc of the boresight matrix.


10. Satellite observatories.– Typical numbers in a boresight matrix are

small, because the instruments are usually closely aligned with the spacecraft – as close as the engineers can get it.

• Changing bases:– Suppose we have a source of x-rays at a

certain place in the sky (RA α and dec δ). It is easy to express that direction as a vector u in the sky basis. The values will be:

(provided α and δ are in radians.)

.sincossincoscossky u


10. Satellite observatories.– If we want to calculate where that direction is in the

spacecraft basis, all we need to do is multiply by the attitude matrix A:

– To understand why this is, consider that the 1st component of us/c must be the cross product of the s/c X axis vector ax (which is the 1st row of A) with usky; and so for the other two components of us/c.

– We can go further, and calculate the components of u in the instrument frame by multiplying by the boresight matrix in similar fashion:

just remember that matrix multiplication does not commute! AB is NOT the same as BA.

skys/c Auu

skys/cinst BAuBuu


11. Astronomical data.• Spectra:

– A very common form of spectrum (for both radio and x-ray sources) is the power law:

α here is known as the spectral index.– Beware though, there are two opposite conventions in

use, which have opposite sign of α.• If you are reading information about spectral indices, make sure

you can determine which convention is being used;• If you are writing about them, make the convention clear to the

reader!

– A power law shows a straight line on a log-log plot.

S


11. Astronomical data.

• Time variation:– An important fact is that the shortest time-

scale Δt on which the flux varies carries information about the size D of the object.

• D cannot be larger than cΔt because time variation implies that some property of the object is changing in concert. But ‘news’ of such a change cannot travel from one side of the object to another faster than light.

• Binned data:– Dithering or randomization of position within

the bin is useful to prevent Moiré effects on rebinning.


11. Astronomical data.

• Histograms:– The value within any bin is an integer. Such

integers are nearly always Poisson random variables.

– Thus, for the uncertainty in the height of a histogram bar, you should use the square root of its (untransformed) value.

– Naturally if you transform the histogram values (eg, divide them all by some constant), you should treat these square root uncertainties the same.


11. Astronomical data.– Eg suppose you have a small histogram with

four bins which have initial values [3,15,21,16]. Their uncertainties are [1.7,3.9,4.6,4.0]. If you divide all values by the total 55, you should also divide the uncertainties by 55. (This is simple propagation of uncertainties.)

• Cumulative histograms:– Here again one can calculate uncertainties

based on the square roots.• Have a care though because cumulative values are

no longer uncorrelated – the value in one bin depends on others above or below it.

• Because of this factor, is easy to over-estimate the degree to which a cumulative histogram deviates from what is expected.


11. Astronomical data.• Data presentation and graphing:

– ALWAYS show uncertainties. Most data is useless without them.

– If you have some data values yi which you expect to be related to other values xi by some simple rule, of course you can graph y against x... however it is often preferable to transform the x and/or y values such that the transformed coordinates x’ and y’ have a linear relationship, then plot y’ against x’.

– Why? Because much of the time you are interested to see whether the supposed relationship is true or not.

1. It is much easier to tell by eye if points lie on a straight line, rather than trying to judge between various curves;

2. A straight line is also more straightforward to fit.


11. Astronomical data.– Examples:

1. Faraday rotation. If you have a set of xs which are wavelengths and ys which are the associated polarization angles, then you expect these to be related by y=A+Bx2, for some constants A and B. Better to transform y’=y, x’=x2, which obey the linear relation y’=A+Bx’.

2. A power law. If x is wavelength once again and y is flux, a power law relation between them is y=Axα. Transforming y’=log(y) and x’=log(x) gives a linear relation y’=log(A)+αx’.

– If you transform y, you should of course also transform the uncertainties σy (using the propagation of error formula).


The end.

NASSP Masters 5003F - Computational Astronomy - 2009 Lecture 24 – revision! 1.Random quantities....

Documents

Transcript of NASSP Masters 5003F - Computational Astronomy - 2009 Lecture 24 – revision! 1.Random quantities....