13_loredo07

8/13/2019 13_loredo07

1/111

Introduction to Bayesian Inference

Tom Loredo

Dept. of Astronomy, Cornell Universityhttp://www.astro.cornell.edu/staff/loredo/bayes/

CASt Summer School June 7, 2007

1/111

8/13/2019 13_loredo07

2/111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

5 Application: Extrasolar Planets

6 Probability & Frequency

2/111

8/13/2019 13_loredo07

3/111

Outline

1 The Big Picture






3/111

8/13/2019 13_loredo07

4/111

Scientific Method

Science is more than a body of knowledge; it is a way of thinking.

The method of science, as stodgy and grumpy as it may seem,is far more important than the findings of science.

Carl Sagan

Scientists argue!

ArgumentCollection of statements comprising an act ofreasoning from premises to a conclusion

A key goal of science: Explain or predict quantitative

measurements

Framework: Mathematical modeling

Science uses rational argument to construct and appraisemathematical models for measurements

4/111

8/13/2019 13_loredo07

5/111

Mathematical Models

A model (in physics) is a representation of structure in a physical

system and/or its properties. (Hestenes)

REALWORLD

MATHEMATICALWORLD

TRANSLATION &

INTERPRETATION

A model is a surrogate

The model is not the modeled system! It stands in for the system

for a particular purpose, and is subject to revision.

A model is an idealization

A model is a caricatureof the system being modeled (Kac). It

focuses on a subset of system properties of interest.5/111

8/13/2019 13_loredo07

6/111

A model is an abstraction

Models identify common features of different things so that general

ideas can be created and applied to different situations.

We seek a mathematical model for quantifying uncertaintyit will share

these characteristics with physical models.

Asides

Theoriesare frameworks guiding model construction (laws, principles).

Physics as modelingis a leading school of thought in physics education research;e.g., http://modeling.la.asu.edu/

6/111

8/13/2019 13_loredo07

7/111

The Role of DataData do not speak for themselves!

We dont just tabulatedata, we analyzedata.

We gather data so they may speak for or against existinghypotheses, and guide the formation of new hypotheses.

A key role of data in science is to be among the premises inscientific arguments.

7/111

8/13/2019 13_loredo07

8/111

8/13/2019 13_loredo07

9/111

Bayesian Statistical Inference

A different approach to allstatistical inference problems (i.e.,not just another method in the list: BLUE, maximumlikelihood,2 testing, ANOVA, survival analysis . . . )

Foundation: Use probability theory to quantify the strength ofarguments (i.e., a more abstract view than restricting PT todescribe variability in repeated random experiments)

Focuses on deriving consequences of modeling assumptions

rather than devising and calibrating procedures

9/111

8/13/2019 13_loredo07

10/111

Outline

1 The Big Picture






10/111

8/13/2019 13_loredo07

11/111

LogicSome Essentials

Logic can be defined as the analysis and appraisal of argumentsGensler, Intro to Logic

Build arguments with propositions and logicaloperators/connectives

Propositions: Statements that may be true or false

P : Universe can be modeled with CDMA: tot[0.9, 1.1]B : is not 0

B : not B, i.e., = 0

Connectives:A B : A and Bare both trueA B : A or Bis true, or both are

11/111

8/13/2019 13_loredo07

12/111

Arguments

Argument: Assertion that an hypothesized conclusion, H, follows

from premises,P={A, B, C, . . .} (take , = and)Notation:

H|P : Premises P implyHHmay be deduced from PHfollows from PH is true given that P is true

Arguments are (compound) propositions.

Central role of arguments

special terminology:

A true argument is valid A false argument is invalidor fallacious

12/111

8/13/2019 13_loredo07

13/111

Valid vs. Sound Arguments

Content vs. form

An argument is factually correct iff all of its premises are true(it has good content).

An argument is valid iff its conclusion follows from its

premises (it has good form).

An argument is soundiff it is both factually correct and valid(it has good form and content).

We want to make soundarguments. Logic and statistical methodsaddress validity, but there is no formal approach for addressingfactual correctness.

13/111

8/13/2019 13_loredo07

14/111

Factual Correctness

Although logic can teach us something about validity andinvalidity, it can teach us very little about factual correctness. Thequestion of the truth or falsity of individual statements is primarilythe subject matter of the sciences.

Hardegree, Symbolic Logic

To test the truth or falsehood of premisses is the task ofscience. . . . But as a matter of fact we are interested in, and mustoften depend upon, the correctness of arguments whose premisses

are not known to be true. Copi, Introduction to Logic

14/111

8/13/2019 13_loredo07

15/111

Premises

Facts Things known to be true, e.g. observed data Obvious assumptions Axioms, postulates, e.g., Euclids

first 4 postulates (line segment b/t 2 points; congruency ofright angles . . . )

Reasonable or working assumptions E.g., Euclids fifthpostulate (parallel lines)

Desperate presumption!

Conclusions from other arguments

15/111

8/13/2019 13_loredo07

16/111

Deductive and Inductive InferenceDeductionSyllogism as prototype

Premise 1: A impliesH

Premise 2: A is trueDeduction: H is trueH|P is valid

InductionAnalogy as prototype

Premise 1: A, B, C, D, Eall share properties x, y, zPremise 2: Fhas properties x, yInduction: Fhas property zF has z|Pis not valid, but may still be rational (likely,plausible, probable); some such arguments are stronger thanothers

Boolean algebra (and/or/not over{0, 1}) quantifies deduction.Probability theory generalizes this to quantify the strength ofinductive arguments. 16/111

8/13/2019 13_loredo07

17/111

Real Number Representation of Induction

P(H|P)strength of argument H|PP = 0 Argument is invalid

= 1 Argument is valid (0, 1) Degree of deducibility

A mathematical model for induction:

AND (product rule) P(A B|P) = P(A|P) P(B|A P)= P(B|P) P(A|B P)

OR (sum rule) P(A B|P) = P(A|P) +P(B|P)P(A B|P)

We will explore the implications of this model.

17/111

8/13/2019 13_loredo07

18/111

Interpreting Bayesian Probabilities

If we like there is no harm in saying that a probability expresses adegree of reasonable belief. . . . Degree of confirmation has beenused by Carnap, and possibly avoids some confusion. But whateververbal expression we use to try to convey the primitive idea, thisexpression cannot amount to a definition. Essentially the notion

can only be described by reference to instances where it is used. Itis intended to express a kind of relation between data andconsequence that habitually arises in science and in everyday life,and the reader should be able to recognize the relation fromexamples of the circumstances when it arises.

Sir Harold Jeffreys, Scientific Inference

18/111

M O I i

8/13/2019 13_loredo07

19/111

More On Interpretation

Physics uses words drawn from ordinary languagemass, weight,momentum, force, temperature, heat, etc.but their technical meaningis more abstract than their colloquial meaning. We can map between thecolloquial and abstract meanings associated with specific values by usingspecific instances as calibrators.

A Thermal Analogy

Intuitive notion Quantification Calibration

Hot, cold Temperature, T Cold as ice = 273KBoiling hot = 373K

uncertainty Probability, P Certainty = 0, 1

p= 1/36:plausible as snakes eyes

p= 1/1024:plausible as 10 heads

19/111

A Bi M O I i

8/13/2019 13_loredo07

20/111

A Bit More On InterpretationBayesian

Probability quantifies uncertainty in an inductive inference. p(x)

describes how probabilityis distributed over the possible values x

might have taken in the single case before us:

P

x

p is distributed

x has a single,uncertain value

Frequentist

Probabilities are always (limiting) rates/proportions/frequencies in

an ensemble. p(x) describes variability, how the values of x are

distributed among the cases in the ensemble:x is distributed

x

P

20/111

A t R l ti

8/13/2019 13_loredo07

21/111

Arguments RelatingHypotheses, Data, and Models

We seek to appraise scientific hypotheses in light of observed dataand modeling assumptions.

Consider the data and modeling assumptions to be the premises ofan argument with each of various hypotheses, Hi, as conclusions:

Hi|Dobs, I. (I= background information, everything deemedrelevant besides the observed data)P(Hi|Dobs, I) measures the degree to which (Dobs, I) allow one todeduce Hi. It provides an ordering among arguments for various Hithat share common premises.

Probability theory tells us how to analyze and appraise theargument, i.e., how to calculate P(Hi|Dobs, I) from simpler,hopefully more accessible probabilities.

21/111

Th B i R i

8/13/2019 13_loredo07

22/111

The Bayesian Recipe

Assess hypotheses by calculating their probabilities p(Hi|

. . .)

conditional on known and/or presumed information using therules of probability theory.

Probability Theory Axioms:

OR (sum rule) P(H1 H2|I) = P(H1|I) +P(H2|I)P(H1, H2|I)

AND (product rule) P(H1, D

|I) = P(H1

|I) P(D

|H1, I)

= P(D|I) P(H1|D, I)

22/111

Three Important Theorems

8/13/2019 13_loredo07

24/111

y ( )

Consider exclusive, exhaustive{Bi} (Iasserts one of themmust be true),

i

P(A, Bi|I) = i

P(Bi|A, I)P(A|I) =P(A|I)

=

i

P(Bi|I)P(A|Bi, I)

If we do not see how to get P(A|I) directly, we can find a set{Bi} and use it as a basisextend the conversation:

P(A|I) =

iP(Bi|I)P(A|Bi, I)

If our problem already has Biin it, we can use LTP to getP(A|I) from the joint probabilitiesmarginalization:

P(A|I) =i

P(A, Bi|I)

24/111

8/13/2019 13_loredo07

25/111

Example: Take A= Dobs, Bi=Hi; then

P(Dobs|I) = i P(Dobs, Hi|I)

=

i

P(Hi|I)P(Dobs|Hi, I)

prior predictive for Dobs= Average likelihood for Hi

(akamarginal likelihood)

Normalization

For exclusive, exhaustive Hi,i

P(Hi| ) = 1

25/111

Well Posed Problems

8/13/2019 13_loredo07

26/111

Well-Posed ProblemsThe rules express desired probabilities in terms of otherprobabilities.

To get a numerical value out, at some point we have to putnumerical values in.

Direct probabilitiesare probabilities with numerical valuesdetermined directly by premises (via modeling assumptions,

symmetry arguments, previous calculations, desperatepresumption . . . ).

An inference problem is well posedonly if all the neededprobabilities are assignable based on the premises. We may need to

add new assumptions as we see what needs to be assigned. Wemay not be entirely comfortable with what we need to assume!(Remember Euclids fifth postulate!)

Should explore how results depend on uncomfortable assumptions

(robustness). 26/111

Recap

8/13/2019 13_loredo07

27/111

RecapBayesian inference is more than BT

Bayesian inference quantifies uncertainty by reportingprobabilities for things we are uncertain of, given specifiedpremises.It uses allof probability theory, not just (or even primarily)Bayess theorem.

The Rules in Plain English

Ground rule: Specify premises that include everything relevantthat you know or are willing to presume to be true (for thesake of the argument!).

BT: Make your appraisal account for all of your premises.

Things you know are false must not enter your accounting.

LTP: If the premises allow multiple arguments for ahypothesis, its appraisal must account for all of them.

Do not just focus on the most or least favorable way ahypothesis may be realized. 27/111

Outline

8/13/2019 13_loredo07

28/111

Outline

1 The Big Picture






28/111

Inference With Parametric Models

8/13/2019 13_loredo07

29/111

Inference With Parametric Models

ModelsMi (i= 1 to N), each with parameters i, each imply asampling distn (conditional predictive distn for possible data):

p(D|i, Mi)

The idependence when we fix attention on theobserveddata isthe likelihood function:

Li(i)p(Dobs|i, Mi)

We may be uncertain about i(model uncertainty) or i (parameteruncertainty).

29/111

Three Classes of Problems

8/13/2019 13_loredo07

30/111

Three Classes of Problems

Parameter Estimation

Premise = choice of model (pick specific i) What can we say about i?

Model Assessment

Model comparison: Premise =

{Mi

} What can we say about i? Model adequacy/GoF: Premise =M1 all alternatives

Is M1 adequate?

Model AveragingModels share some common params: i={, i} What can we say about w/o committing to one model?(Systematic error is an example)

30/111


8/13/2019 13_loredo07

31/111


Problem statement

I= Model Mwith parameters (+ any addl info)Hi= statements about ; e.g. [2.5, 3.5], or >0Probability for any such statement can be found using aprobability density function (PDF) for :

P([, +d]| ) = f()d= p(| )d

Posterior probability density

p(|D, M) = p(|M)L()dp(|M)L()

31/111

8/13/2019 13_loredo07

32/111

Summaries of posterior

Best fit values:

Mode, , maximizes p(|D, M) Posterior mean,= d p(|D, M)

Uncertainties:

Credible region of probability C:

C=P(|D, M) = d p(|D, M)Highest Posterior Density (HPD) region has p(|D, M) higherinside than outside

Posterior standard deviation, variance, covariances

Marginal distributions

Interesting parameters , nuisance parameters Marginal distn for: p(|D, M) = d p(, |D, M)

32/111

Nuisance Parameters and Marginalization

8/13/2019 13_loredo07

33/111

s c s g o

To model most data, we need to introduce parameters besidesthose of ultimate interest: nuisance parameters.

Example

We have data from measuring a rate r=s+bthat is a sumof an interesting signal sand a background b.We have additional data just about b.What do the data tell us about s?

33/111

Marginal posterior distribution

8/13/2019 13_loredo07

34/111

g p

p(s|D, M) = db p(s, b|D, M) p(s|M) db p(b|s) L(s, b) p(s|M)Lm(s)

with

Lm(s) the marginal likelihood for s. For broad prior,

Lm(s) p(bs|s) L(s, bs)bsbest bgiven s

buncertainty given s

Profile likelihoodLp(s) L(s, bs) gets weighted by aparameterspace volume factor

E.g., Gaussians: s= r b, 2s =2r +2b

Background subtraction is a special case of background marginalization. 34/111

Model Comparison

8/13/2019 13_loredo07

35/111

p

Problem statement

I = (M1 M2 . . .) Specify a set of models.Hi =Mi Hypothesis chooses a model.Posterior probability for a model

p(Mi|D, I) = p(Mi|I)p(D

|Mi, I)

p(D|I) p(Mi|I)L(Mi)

ButL(Mi) =p(D|Mi) =

dip(i|Mi)p(D|i, Mi).

Likelihood for model = Average likelihood for its parametersL(Mi) =L(i)

Varied terminology: Prior predictive = Average likelihood = Globallikelihood = Marginal likelihood = (Weight of) Evidence for model

35/111

Odds and Bayes factors

8/13/2019 13_loredo07

36/111

y

Ratios of probabilities for two propositions using the same premises

are called odds:

Oij p(Mi|D, I)p(Mj|D, I)

= p(Mi|I)

p(Mj|I)p(D|Mj, I)p(D|Mj, I)

The data-dependent part is called the Bayes factor:

Bij

p(D|Mj, I)p(D

|M

j, I)

It is a likelihood ratio; the BF terminology is usually reserved forcases when the likelihoods are marginal/average likelihoods.

36/111

An Automatic Occams Razor

8/13/2019 13_loredo07

37/111

Predictive probabilities can favor simpler models

p(D|Mi) =

dip(i|M) L(i)

DobsD

P(D|H)

Complicated H

Simple H

37/111

The Occam Factor

8/13/2019 13_loredo07

38/111

p, L

Prior

Likelihood

p(D|Mi) = di p(i|M)L(i)p(i|M)L(i)i L(i)ii

= Maximum Likelihood Occam Factor

Models with more parameters often make the data moreprobable for the best fitOccam factor penalizes models for wastedvolume ofparameter spaceQuantifies intuition that models shouldnt require fine-tuning

38/111

Model Averaging

8/13/2019 13_loredo07

39/111

Problem statement

I = (M1

M2

. . .) Specify a set of models

Models all share a set of interesting parameters, Each has different set of nuisance parameters i(or differentprior info about them)Hi= statements about

Model averaging

Calculate posterior PDF for :

p(|D, I) =i

p(Mi|D, I) p(|D, Mi)

i

L(Mi)

dip(, i|D, Mi)

The model choice is a (discrete) nuisance parameter here.

39/111

Theme: Parameter Space Volume

8/13/2019 13_loredo07

40/111

Bayesian calculations sum/integrate over parameter/hypothesisspace!

(Frequentist calculations average over samplespace & typically optimize

over parameter space.)

Marginalization weights the profile likelihood by a volume

factor for the nuisance parameters.

Model likelihoods have Occam factors resulting fromparameter space volume factors.

Many virtues of Bayesian methods can be attributed to thisaccounting for the size of parameter space. This idea does notarise naturally in frequentist statistics (but it can be added byhand).

40/111

Roles of the Prior

8/13/2019 13_loredo07

41/111

Prior hastworoles

Incorporate any relevant prior information

Convert likelihood from intensity to measure Accounts forsize of hypothesis space

Physical analogy

Heat: Q = dV cv(r)T(r)Probability: P

dp(|I)L()

Maximum likelihood focuses on the hottest hypotheses.

Bayes focuses on the hypotheses with the most heat.

A high-Tregion may contain little heat if its cv is low or if

its volume is small.

A high-L region may contain little probability if its prior is low or ifits volume is small.

41/111

Recap of Key Ideas

8/13/2019 13_loredo07

42/111

Probability as generalized logic for appraising arguments Three theorems: BT, LTP, Normalization Calculations characterized by parameter space integrals Credible regions, posterior expectations

Marginalization over nuisance parameters Occams razor via marginal likelihoods

42/111

Outline

8/13/2019 13_loredo07

43/111

1 The Big Picture

2 FoundationsLogic & Probability Theory3 Inference With Parametric Models

Parameter EstimationModel Uncertainty




43/111

Binary Outcomes:

8/13/2019 13_loredo07

44/111


M= Existence of two outcomes, S and F; each trial has same

probability for S orF

Hi= Statements about , the probability for success on the nexttrialseek p(|D, M)

D= Sequence of results from Nobserved trials:FFSSSSFSSSFS (n= 8 successes in N= 12 trials)

Likelihood:

p(D|, M) = p(failure|, M) p(success|, M) = n(1 )Nn= L()

44/111

8/13/2019 13_loredo07

45/111

Prior

Starting with no information about beyond its definition,use as an uninformative prior p(|M) = 1. Justifications:

Intuition: Dont prefer any interval to any other of same size Bayess justification: Ignorance means that before doing the

Ntrials, we have no preference for how many will be successes:

P(n success|M) = 1N+ 1

p(|M) = 1

Consider this a conventionan assumption added to M to

make the problem well posed.

45/111

8/13/2019 13_loredo07

46/111

Prior Predictive

p(D|M) =

d n(1 )Nn

= B(n+ 1, N n+ 1) = n!(N n)!(N+ 1)!

A Beta integral, B(a, b)

dx xa1(1 x)b1 = (a)(b)(a+b).

46/111

Posterior

8/13/2019 13_loredo07

47/111

Posterior

p(|D, M) = (N+ 1)!n!(N n)!

n(1 )Nn

A Beta distribution. Summaries:

Best-fit: = n

N = 2/3;

= n+1

N+2

0.64

Uncertainty: =

(n+1)(Nn+1)(N+2)2(N+3)

0.12Find credible regions numerically, or with incomplete betafunction

Note that the posterior depends on the data only through n,not the Nbinary numbers describing the sequence.n is a (minimal) Sufficient Statistic.

47/111

8/13/2019 13_loredo07

48/111

48/111

Binary Outcomes: Model ComparisonEqual Probabilities?

8/13/2019 13_loredo07

49/111

Equal Probabilities?

M1: = 1/2

M2: [0, 1] with flat prior.Maximum Likelihoods

M1 : p(D|M1) = 12N

= 2.44 104

M2 : L() =

2

3

n 13

Nn= 4.82 104

p(D|M1)p(D|, M2) = 0.51

Maximum likelihoods favor M2 (failures more probable).

49/111

Bayes Factor (ratio of model likelihoods)

8/13/2019 13_loredo07

50/111

p(D|M1) = 12N

; and p(D|M2) = n!(N n)!(N+ 1)!

B12 p(D|M1)p(D|M2) =

(N+ 1)!

n!(N n)!2N= 1.57

Bayes factor (odds) favors M1 (equiprobable).

Note that for n= 6, B12 = 2.93; for this small amount ofdata, we can never be very sure results are equiprobable.

Ifn= 0, B121/315; ifn= 2, B121/4.8; for extremedata, 12 flips can be enough to lead us to strongly suspectoutcomes have different probabilities.

(Frequentist significance tests can reject null for any sample size.)

50/111

Binary Outcomes: Binomial DistributionSuppose D = n (number of heads in N trials) rather than the

8/13/2019 13_loredo07

51/111

SupposeD=n (number of heads in N trials), rather than theactual sequence. What is p(|n, M)?

LikelihoodLetS= a sequence of flips with n heads.

p(n

|, M) = S p(S|, M) p(n|S, , M)

n (1 )Nn

[ # successes =n]

= n(1 )NnCn,N

Cn,N= # of sequences of length N with n heads.

p(n|, M) = N!n!(N n)!

n(1 )Nn

The binomial distribution forn given , N.51/111

Posterior

8/13/2019 13_loredo07

52/111

p(|n, M) =N!

n!(Nn)! n

(1 )N

n

p(n|M)

p(n|M) =

N!

n!(N n)! d n(1 )Nn=

1

N+ 1

p(|n, M) = (N+ 1)!n!(N n)! n(1 )Nn

Same resultas when data specified the actual sequence.

52/111

Another Variation: Negative Binomial

8/13/2019 13_loredo07

53/111

SupposeD=N, the number of trials it took to obtain a predifined

number of successes, n= 8. What is p(|N, M)?Likelihood

p(N|, M) is probability for n 1 successes in N 1 trials,times probability that the final trial is a success:

p(N|, M) = (N 1)!(n 1)!(N n)!

n1(1 )Nn

= (N 1)!

(n 1)!(N n)!n(1

)Nn

The negative binomial distribution forN given , n.

53/111

P t i

8/13/2019 13_loredo07

54/111

Posterior

p(|D, M) =Cn,Nn(1 )Nn

p(D|M)

p(D|M) = Cn,N d n(1 )Nn p(|D, M) = (N+ 1)!

n!(N

n)!n(1 )Nn

Same resultas other cases.

54/111

Final Variation: Meteorological Stopping

8/13/2019 13_loredo07

55/111

SupposeD= (N, n), the number of samples and number of

successes in an observing run whose total number was determinedby the weather at the telescope. What is p(|D, M)?(M adds info about weather to M.)

Likelihood

p(D|, M) is the binomial distribution times the probabilitythat the weather allowed N samples, W(N):

p(D

|, M) = W(N)

N!

n!(N n)!n(1

)Nn

Let Cn,N=W(N)

Nn

. We get the same resultas before!

55/111

Likelihood Principle

To define L(Hi ) = p(D |Hi I ) we must contemplate what other

8/13/2019 13_loredo07

56/111

To defineL(Hi) =p(Dobs|Hi, I), we must contemplate what otherdata we might have obtained. But the real sample space may bedetermined by many complicated, seemingly irrelevant factors; itmay not be well-specified at all. Should this concern us?

Likelihood principle: The result of inferences depends only on howp(Dobs|Hi, I) varies w.r.t. hypotheses. We can ignore aspects of theobserving/sampling procedure that do not affect this dependence.

This is a sensible property that frequentist methods do not share.Frequentist probabilities are long run rates of performance, andthus depend on details of the sample space that may be irrelevantin a Bayesian calculation.

Example: Predict 10% of sample is Type A; observe nA = 5 for N= 96Significance test accepts= 0.1 for binomial sampling;

p(> 2|= 0.1) = 0.12Significance test rejects= 0.1 for negative binomial sampling;

p(> 2

|= 0.1) = 0.03

56/111

8/13/2019 13_loredo07

57/111

Gausss Observation: Sufficiency

8/13/2019 13_loredo07

58/111

Suppose our data consist ofN measurements, di=+i.

Suppose the noise contributions are independent, andiN(0, 2).

p(D|,, M) =

i

p(di|,, M)

= i

p(i=di |,, M)

=

i

1

2exp

(di )

2

22

= 1

N(2)N/2eQ()/22

58/111

Find dependence ofQ on by completing the square:

Q

( )2

8/13/2019 13_loredo07

59/111

Q =

i

(di )2

= i

d2i +N2 2Nd where d 1Ni

di

= N( d)2 +Nr2 where r2 1N

i

(di d)2

Likelihood depends on{di} only through d and r:

L(, ) = 1N(2)N/2

exp

Nr

2

22

exp

N( d)

2

22

The sample mean and variance are sufficient statistics.

This is a miraculous compression of informationthe normal distnis highly abnormalin this respect!

59/111

8/13/2019 13_loredo07

60/111

8/13/2019 13_loredo07

61/111

Uninformative prior

Translation invariance p()C, a constant.This prior is improperunless bounded.

Prior predictive/normalization

p(D|, M) = dCexpN( d)222

= C(/

N)

2

. . . minus a tiny bit from tails, using a proper prior.

61/111

8/13/2019 13_loredo07

62/111

Posterior

p(|D, , M) = 1(/

N)

2

exp

N( d)

2

22

Posterior is N(d, w2), with standard deviation w=/

N.

68.3% HPD credible region for is d /N.Note that C drops outlimit of infinite prior range is wellbehaved.

62/111

Informative Conjugate Prior

8/13/2019 13_loredo07

63/111

Use a normal prior, N(0, w20 )

Posterior

Normal N(,w2), but mean, std. deviation shrinktowardsprior.Define B= w

2

w2+w20, so B

8/13/2019 13_loredo07

64/111

Problem specification

Model: di =+i, i N(0, 2), is unknownParameter space: (, ); seek p(|D, , M)

Likelihood

p(D|,, M) = 1N(2)N/2

expNr222 expN( d)222 1

NeQ/2

2

where Q=Nr2 + ( d)2

64/111

8/13/2019 13_loredo07

65/111

8/13/2019 13_loredo07

66/111

Marginal Posterior

p(|D, M) d 1N+1

eQ/22

Let = Q22

so = Q2 and|d|=3/2

Q2

p(|D, M) 2N/2QN/2

d N

21e

QN/2

66/111

2

8/13/2019 13_loredo07

67/111

Write Q=Nr2

1 +

d

r

2

and normalize:

p(|D, M) =

N2 1

!

N2 32

!

1

r

1 +

1

N

dr/

N

2N/2

Students tdistribution, with t= (

d)

r/NA bell curve, but with power-law tailsLarge N:

p(

|D, M)

eN(d)

2/2r2

67/111

Poisson Distn: Infer a Rate from Counts

Problem: Observe n counts in T; infer rate, r

8/13/2019 13_loredo07

68/111

Likelihood

L(r)p(n|r, M) =p(n|r, M) =(rT)nn!

erT

Prior

Two standard choices: rknown to be nonzero; it is a scale parameter:p(r|M) = 1

ln(ru/rl)

1

r

rmay vanish; require p(n|M)Const:p(r|M) = 1

ru

68/111

Prior predictive

1 1 r

8/13/2019 13_loredo07

69/111

p(n|M) = 1ru

1

n! ru

0dr(rT)nerT

= 1

ruT

1

n!

ruT0

d(rT)(rT)nerT

1ruT

for ru nT

Posterior

A gamma distribution:

p(r|n, M) = T(rT)nn!

erT

69/111

Gamma Distributions

A 2 t f il f di t ib ti ti ith

8/13/2019 13_loredo07

70/111

A 2-parameter family of distributions over nonnegative x, withshape parameter and scale parameter s:

p(x|, s) = 1s()

xs

1ex/s

Moments:

E(x) =s Var(x) =s2

Our posterior corresponds to= n+ 1, s= 1/T.

Mode r= nT; meanr=

n+1T (shift down 1 with 1/r prior)

Std. devn r=

n+1T

; credible regions found by integrating (canuse incomplete gamma function)

70/111

8/13/2019 13_loredo07

71/111

71/111

The flat prior

Bayess justification: Not that ignorance of r p(r |I ) = C

8/13/2019 13_loredo07

72/111

Bayes s justification: Notthat ignorance ofr p(r|I) =CRequire (discrete) predictive distribution to be flat:

p(n|I) =

dr p(r|I)p(n|r, I) =C p(r|I) =C

Useful conventions

Use a flat prior for a rate that may be zero Use a log-flat prior (1/r) for a nonzero scale parameter Use proper (normalized, bounded) priors Plot posterior with abscissa that makes prior flat

72/111

The On/Off Problem

Basic problem

8/13/2019 13_loredo07

73/111

Look off-source; unknown background rate bCount Noffphotons in interval Toff

Look on-source; rate is r=s+bwith unknown signal sCount Non photons in interval Ton

Infer sConventional solution

b=Noff/Toff; b=

Noff/Toff

r=Non

/Ton

; r = Non/Tons= r b; s= 2r +2b

But scan benegative!

73/111

Examples

Spectra of X-Ray Sources

8/13/2019 13_loredo07

74/111

Bassani et al. 1989 Di Salvo et al. 2001

74/111

Spectrum of Ultrahigh-Energy Cosmic Rays

Nagano & Watson 2000

8/13/2019 13_loredo07

75/111

g

HiRes Team 2007

log10(E) (eV)

Flux*E3/1024

(eV2

m-2s

-1s

r-1)

AGASA

HiRes-1 Monocular

HiRes-2 Monocular

1

10

17 17.5 18 18.5 19 19.5 20 20.5 21

75/111

N is Never Large

8/13/2019 13_loredo07

76/111

Sample sizes are never large. IfN is too small to get asufficiently-precise estimate, you need to get more data (or makemore assumptions). But once N is large enough, you can startsubdividing the data to learn more (for example, in a publicopinion poll, once you have a good estimate for the entire country,you can estimate among men and women, northerners and

southerners, different age groups, etc etc). N is never enoughbecause if it were enough youd already be on to the nextproblem for which you need more data.

Similarly, you never have quite enough money. But thats another

story. Andrew Gelman (blog entry, 31 July 2005)

76/111

Backgrounds as Nuisance Parameters

8/13/2019 13_loredo07

77/111

Background marginalization with Gaussian noise

Measure background rateb=b bwith source off. Measure totalrate r= r rwith source on. Infer signal source strength s, where

r=s+b. With flat priors,

p(s, b|D, M)exp (b b)

2

22b

exp

(s+b r)

2

22r

77/111

8/13/2019 13_loredo07

78/111

Marginalizebto summarize the results for s(complete the

square to isolate bdependence; then do a simple Gaussianintegral over b):

p(s|D, M)exp (s s)2

22s

s= r b2

s

=2r

+2b

Background subtraction is a special case of backgroundmarginalization.

78/111

Bayesian Solution to On/Off Problem

8/13/2019 13_loredo07

79/111

First consider off-source data; use it to estimate b:

p(b|Noff, Ioff) = Toff(bToff)NoffebToff

Noff!

Use this as a prior for b to analyze on-source data. For on-sourceanalysis Iall = (Ion, Noff, Ioff):

p(s, b|Non) p(s)p(b)[(s+b)Ton]None(s+b)Ton || Iallp(s|Iall) is flat, but p(b|Iall) =p(b|Noff, Ioff), so

p(s, b|Non, Iall) (s+b)NonbNoffesToneb(Ton+Toff)

79/111

Now marginalize over b;

p(s|N I )

db p(s b | N I )

8/13/2019 13_loredo07

80/111

p(s|Non, Iall) =

db p(s, b|Non, Iall)

db(s+b)NonbNoffesToneb(Ton+Toff)Expand (s+b)Non and do the resulting integrals:

p(s|Non, Iall) =Noni=0

Ci Ton(sTon)ie

sTon

i!

Ci

1 +ToffTon

i(Non+Noff i)!

(Non i)!Posterior is a weighted sum of Gamma distributions, each assigning adifferent number of on-source counts to the source. (Evaluate viarecursive algorithm or confluent hypergeometric function.)

80/111

Example On/Off PosteriorsShort Integrations

Ton= 1

8/13/2019 13_loredo07

81/111

81/111

Example On/Off PosteriorsLong Background Integrations

Ton= 1

8/13/2019 13_loredo07

82/111

82/111

Multibin On/Off

The more typical on/off scenario:

8/13/2019 13_loredo07

83/111

Data = spectrum or image with counts in many bins

Model Mgives signal rate sk() in bin k, parameters

To infer , we need the likelihood:

L() = k p(Nonk, Noffk|sk(), M)For each k, we have an on/off problem as before, only we justneed the marginal likelihood for sk(not the posterior). The sameCicoefficients arise.

XSPEC and CIAO/Sherpa provide this as an option.

CHASC approach does the same thing via data augmentation.

83/111

Bayesian ComputationLarge sample size: Laplace approximation

( )

8/13/2019 13_loredo07

84/111

Approximate posterior as multivariate normal det(covar) factors Uses ingredients available in

2

/ML fitting software (MLE, Hessian) Often accurate to O(1/N)

Low-dimensional models (d5) Posterior samplingcreate RNG that samples posterior MCMC is most general framework

84/111

Outline

1 The Big Picture

8/13/2019 13_loredo07

85/111






85/111

Extrasolar PlanetsExoplanet detection/measurement methods:

Direct: Transits gravitational lensing imaging

8/13/2019 13_loredo07

86/111

Direct: Transits, gravitational lensing, imaging,interferometric nulling

Indirect: Keplerian reflex motion (line-of-sight velocity,astrometric wobble)

The Suns Wobble From 10 pc

-1000 -500 0 500 1000

-1000

-500

0

500

1000

86/111

Radial Velocity TechniqueAs of May 2007, 242 planets found, including 26 multiple-planetsystems

8/13/2019 13_loredo07

87/111

Vast majority (230) found via Doppler radial velocity (RV)

measurements

Analysis method: Identify best candidate period via periodogram;fit parameters with nonlinear least squares

Issues: Multimodality & multiple planets, nonlinearity, stellarjitter, long periods, marginal detections, population studies. . .

87/111

Keplerian Radial Velocity ModelParameters for single planet

b l d (d ) Longitude of pericenter

8/13/2019 13_loredo07

88/111

= orbital period (days) e= orbital eccentricity K= velocity amplitude (m/s)

Mean anomaly of pericenterpassage Mp System center-of-mass velocityv0

Velocity vs. time

v(t) =v0+K(ecos + cos[+(t)])

True anomaly (t) found via Keplers equation for eccentric

anomaly:

E(t) esin E(t) =2t

Mp; tan

2 = 1 +e1 e1/2

tanE

2

A strongly nonlinear model!

Keplers laws relate (K, , e) to masses, semimajor axis a,

inclination i

88/111

The Likelihood FunctionKeplerian velocity model with parameters ={K, , e, Mp, , v0}:

di=v(ti; ) +i

8/13/2019 13_loredo07

89/111

For measurement errors with std devn i, and additional jitterwith std devn J,

L(, J) p({di}|, J)

=

N

i=11

22i +2J exp 1

2

[di v(ti; )]2

2

i +2

J

i

1

2

2i +

2J

exp

1

22()

where 2(, J)

i

[di v(ti; )]22i +

2J

Ignore jitter for now . . .89/111

What To Do With ItParameter estimation

Posterior distn for parameters of model Mi with i planets:

8/13/2019 13_loredo07

90/111

p i p

p(|D, M)p(|M)Li()Summarize with mode, means, credible regions (found byintegrating over )

DetectionCalculate probability for no planets (M0), one planet(M2) . . . . Let I ={Mi}.

p(Mi|D, I) p(Mi|I)L(Mi)where L(Mi) =

dp(|Mi)L()

Marginal likelihoodL(Mi) includes Occam factor90/111

DesignPredict future datum dtat time t, accounting for modeluncertainties:

8/13/2019 13_loredo07

91/111

p(dt|D, Mi) = dp(dt|, Mi) p(|D, Mi).1st factor is Gaussian for dtwith known model; 2nd term &integral account for uncertainty

Bayesian adaptive explorationFind time likely to make updated posterior shrink the most.Information theorybest time has largest dt uncertainty(maximum entropy in p(dt|D, Mi))

Prior information & data Combined information

Interim

Strategy New data Predictions StrategyObservn Design

Inference

results

91/111

Periodogram Connections

Assume circular orbits: ={K, , Mp, v0}

8/13/2019 13_loredo07

92/111

Frequentist

For given ,maximizelikelihood over K and Mp (set v0 todata mean, v) profile likelihood:

log Lp(, v)Lomb-Scargle periodogram

Bayesian

For given , integrate (marginalize) likelihoodprior overK and Mp (set v0 to data mean, v)marginal posterior:

log p(, v|D)Lomb-Scargle periodogram

Additionally marginalize over v0 floating-mean LSP

92/111

Kepler Periodogram for Eccentric OrbitsThe circular model is linearwrt Ksin Mp, Kcos Mp coincidenceof profile and marginal.

8/13/2019 13_loredo07

93/111

The full model is nonlinearwrt e, Mp, .

Profiling is known to behave poorly for nonlinear parameters (e.g.,asymptotic confidence regions have incorrect coverage for finitesamples; can even be inconsistent).

Marginalization is valid with complete generality; also has goodfrequentist properties.

For given , marginalize likelihoodprior over allparams but :log p(

|D)

Kepler periodogram

This requires a 5-d integral.

Trick: For K priorK(interim prior), three of the integrals canbe done analytically integrate numerically only over (e, Mp)

93/111

A Bayesian Workflow for Exoplanets

U K l i d t d di i lit t 3 d ( M )

8/13/2019 13_loredo07

94/111

Use Kepler periodogram to reduce dimensionality to 3-d (, e, Mp).

Use Kepler periodogram results (p(), moments ofe, Mp) todefine initial population for adaptive, population-based MCMC.

Once we have{, e, Mp}, get associated{K, , v0} samples fromtheir exact conditional distribution.

Fix the interim prior by weighting the MCMC samples.

(See work by Phil Gregory & Eric Ford for simpler but less efficient

workflows.)

94/111

A Bayesian Workflow for Exoplanets

8/13/2019 13_loredo07

95/111

95/111

Estimation Results for HD22258224 Keck RV observations spanning 683 days; long period; hi e

Kepler Periodogram

8/13/2019 13_loredo07

96/111

96/111

Differential Evolution MCMC Performance

Marginal for (, e) Convergence: Autocorrelation

8/13/2019 13_loredo07

97/111

Reaches convergence much more quickly than simpler algorithms

97/111

Multiple Planets in HD 208487Phil Gregorys parallel tempering algorithm found a 2nd planet inHD 208487.

129 8 d 909 d

8/13/2019 13_loredo07

98/111

1 = 129.8 d; 2 = 909 d

0 0.2 0.4 0.6 0.8

e2

10

20

30

40

50

K2

m

s

1

0 0.5 1 1.5 2

P2 Orbital phase

20

10

0

10

20

Velocity

m

s

1

b

0 0.5 1 1.5 2

P1 Orbital phase

30

20

10

0

10

20

30

Velocity

m

s

1

a

98/111

Adaptive Exploration for Toy Problem

Data are Kepler velocities plus noise:

8/13/2019 13_loredo07

99/111

di=V(ti; , e, K) +ei

3 remaining geometrical params (t0, , i) are fixed.

Noise probability is Gaussian with known = 8 m s1.

Simulate data with typical Jupiter-like exoplanet parameters:

= 800 d

e = 0.5

K = 50 ms1

Goal: Estimate parameters , e and K as efficiently as possible.

99/111

8/13/2019 13_loredo07

100/111

Cycle 1: InferenceUse flat priors,

p(, e, K|D, I) exp[Q(, e, K)/22]

8/13/2019 13_loredo07

101/111

Q= sum of squared residuals using best-fit amplitudes.

Generate{j, ej, Kj} via posterior sampling.

101/111

Cycle 1 Design: Predictions, Entropy

8/13/2019 13_loredo07

102/111

102/111

Cycle 2: Observation

8/13/2019 13_loredo07

103/111

103/111

Cycle 2: Inference

8/13/2019 13_loredo07

104/111

VolumeV2 =V1/5.8

104/111

Evolution of InferencesCycle 1 (10 samples)

8/13/2019 13_loredo07

105/111

Cycle 2 (11 samples; V2 =V1/5.8)

105/111

Cycle 3 (12 samples; V3 =V2/3.9)

8/13/2019 13_loredo07

106/111

Cycle 4 (13 samples; V4

=V3

/1.8)

106/111

Outline1 The Big Picture


8/13/2019 13_loredo07

107/111





107/111

Probability & Frequency

Frequencies are relevant when modeling repeated trials, orrepeated sampling from a population or ensemble.

8/13/2019 13_loredo07

108/111

Frequencies are observables:

When available, can be used to inferprobabilities for next trial

When unavailable, can be predicted

Bayesian/Frequentist relationships:

General relationships between probability and frequency

Long-run performance of Bayesian procedures Examples of Bayesian/frequentist differences

108/111

Relationships Between Probability & Frequency

Frequency from probability

Bernoullis law of large numbers: In repeated i.i.d. trials, given

8/13/2019 13_loredo07

109/111

Bernoulli s law of large numbers: In repeated i.i.d. trials, given

P(success| . . .) =, predictNsuccessNtotal

as Ntotal

Probability from frequencyBayess An Essay Towards Solving a Problem in the Doctrineof ChancesFirst use of Bayess theorem:Probability for success in next trial of i.i.d. sequence:

E NsuccessNtotal

as Ntotal

109/111

Subtle Relationships For Non-IID CasesPredict frequency in dependent trials

rt= result of trial t; p(r1, r2 . . . rN|M) known; predict f

f = 1

p(rt = success M)

8/13/2019 13_loredo07

110/111

f

=Nt p(rt= success|M)

where p(r1|M) =

r2

rN

p(r1, r2 . . . |M3)

Expectedfrequency of outcome in many trials =

averageprobability for outcome across trials.Butalso find that f neednt converge to 0.

Infer probabilities for different but related trialsShrinkage: Biased estimators of the probability that share info

across trials are better than unbiased/BLUE/MLE estimators.

A formalism that distinguishes p from f from the outset is particularlyvaluable for exploring subtle connections. E.g., shrinkage is explored viahierarchical and empirical Bayes.

110/111

Frequentist Performance of Bayesian ProceduresMany results known for parametric Bayes performance:

Estimates are consistent if the prior doesnt exclude the true value.Credible regions found with flat priors are typically confidence

8/13/2019 13_loredo07

111/111

Credible regions found with flat priors are typically confidenceregions to O(n1/2); reference priors can improve theirperformance to O(n1).

Marginal distributions have better frequentist performance thanconventional methods like profile likelihood. (Bartlett correction,ancillaries, bootstrap are competitive but hard.)

Bayesian model comparison is asymptotically consistent (not true ofsignificance/NP tests, AIC).

For separate (not nested) models, the posterior probability for thetrue model converges to 1 exponentially quickly.

Walds complete class theorem: Optimalfrequentist methods areBayes rules(equivalent to Bayes for some prior) . . .

Parametric Bayesian methods are typically good frequentist methods.(Not so clear in nonparametric problems.)

111/111

13_loredo07

Documents

Transcript of 13_loredo07