13_loredo07
-
Upload
cesar-costa -
Category
Documents
-
view
217 -
download
0
Transcript of 13_loredo07
-
8/13/2019 13_loredo07
1/111
Introduction to Bayesian Inference
Tom Loredo
Dept. of Astronomy, Cornell Universityhttp://www.astro.cornell.edu/staff/loredo/bayes/
CASt Summer School June 7, 2007
1/111
-
8/13/2019 13_loredo07
2/111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric ModelsParameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
2/111
-
8/13/2019 13_loredo07
3/111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric ModelsParameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
3/111
-
8/13/2019 13_loredo07
4/111
Scientific Method
Science is more than a body of knowledge; it is a way of thinking.
The method of science, as stodgy and grumpy as it may seem,is far more important than the findings of science.
Carl Sagan
Scientists argue!
ArgumentCollection of statements comprising an act ofreasoning from premises to a conclusion
A key goal of science: Explain or predict quantitative
measurements
Framework: Mathematical modeling
Science uses rational argument to construct and appraisemathematical models for measurements
4/111
-
8/13/2019 13_loredo07
5/111
Mathematical Models
A model (in physics) is a representation of structure in a physical
system and/or its properties. (Hestenes)
REALWORLD
MATHEMATICALWORLD
TRANSLATION &
INTERPRETATION
A model is a surrogate
The model is not the modeled system! It stands in for the system
for a particular purpose, and is subject to revision.
A model is an idealization
A model is a caricatureof the system being modeled (Kac). It
focuses on a subset of system properties of interest.5/111
-
8/13/2019 13_loredo07
6/111
A model is an abstraction
Models identify common features of different things so that general
ideas can be created and applied to different situations.
We seek a mathematical model for quantifying uncertaintyit will share
these characteristics with physical models.
Asides
Theoriesare frameworks guiding model construction (laws, principles).
Physics as modelingis a leading school of thought in physics education research;e.g., http://modeling.la.asu.edu/
6/111
-
8/13/2019 13_loredo07
7/111
The Role of DataData do not speak for themselves!
We dont just tabulatedata, we analyzedata.
We gather data so they may speak for or against existinghypotheses, and guide the formation of new hypotheses.
A key role of data in science is to be among the premises inscientific arguments.
7/111
-
8/13/2019 13_loredo07
8/111
-
8/13/2019 13_loredo07
9/111
Bayesian Statistical Inference
A different approach to allstatistical inference problems (i.e.,not just another method in the list: BLUE, maximumlikelihood,2 testing, ANOVA, survival analysis . . . )
Foundation: Use probability theory to quantify the strength ofarguments (i.e., a more abstract view than restricting PT todescribe variability in repeated random experiments)
Focuses on deriving consequences of modeling assumptions
rather than devising and calibrating procedures
9/111
-
8/13/2019 13_loredo07
10/111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric ModelsParameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
10/111
-
8/13/2019 13_loredo07
11/111
LogicSome Essentials
Logic can be defined as the analysis and appraisal of argumentsGensler, Intro to Logic
Build arguments with propositions and logicaloperators/connectives
Propositions: Statements that may be true or false
P : Universe can be modeled with CDMA: tot[0.9, 1.1]B : is not 0
B : not B, i.e., = 0
Connectives:A B : A and Bare both trueA B : A or Bis true, or both are
11/111
-
8/13/2019 13_loredo07
12/111
Arguments
Argument: Assertion that an hypothesized conclusion, H, follows
from premises,P={A, B, C, . . .} (take , = and)Notation:
H|P : Premises P implyHHmay be deduced from PHfollows from PH is true given that P is true
Arguments are (compound) propositions.
Central role of arguments
special terminology:
A true argument is valid A false argument is invalidor fallacious
12/111
-
8/13/2019 13_loredo07
13/111
Valid vs. Sound Arguments
Content vs. form
An argument is factually correct iff all of its premises are true(it has good content).
An argument is valid iff its conclusion follows from its
premises (it has good form).
An argument is soundiff it is both factually correct and valid(it has good form and content).
We want to make soundarguments. Logic and statistical methodsaddress validity, but there is no formal approach for addressingfactual correctness.
13/111
-
8/13/2019 13_loredo07
14/111
Factual Correctness
Although logic can teach us something about validity andinvalidity, it can teach us very little about factual correctness. Thequestion of the truth or falsity of individual statements is primarilythe subject matter of the sciences.
Hardegree, Symbolic Logic
To test the truth or falsehood of premisses is the task ofscience. . . . But as a matter of fact we are interested in, and mustoften depend upon, the correctness of arguments whose premisses
are not known to be true. Copi, Introduction to Logic
14/111
-
8/13/2019 13_loredo07
15/111
Premises
Facts Things known to be true, e.g. observed data Obvious assumptions Axioms, postulates, e.g., Euclids
first 4 postulates (line segment b/t 2 points; congruency ofright angles . . . )
Reasonable or working assumptions E.g., Euclids fifthpostulate (parallel lines)
Desperate presumption!
Conclusions from other arguments
15/111
-
8/13/2019 13_loredo07
16/111
Deductive and Inductive InferenceDeductionSyllogism as prototype
Premise 1: A impliesH
Premise 2: A is trueDeduction: H is trueH|P is valid
InductionAnalogy as prototype
Premise 1: A, B, C, D, Eall share properties x, y, zPremise 2: Fhas properties x, yInduction: Fhas property zF has z|Pis not valid, but may still be rational (likely,plausible, probable); some such arguments are stronger thanothers
Boolean algebra (and/or/not over{0, 1}) quantifies deduction.Probability theory generalizes this to quantify the strength ofinductive arguments. 16/111
-
8/13/2019 13_loredo07
17/111
Real Number Representation of Induction
P(H|P)strength of argument H|PP = 0 Argument is invalid
= 1 Argument is valid (0, 1) Degree of deducibility
A mathematical model for induction:
AND (product rule) P(A B|P) = P(A|P) P(B|A P)= P(B|P) P(A|B P)
OR (sum rule) P(A B|P) = P(A|P) +P(B|P)P(A B|P)
We will explore the implications of this model.
17/111
-
8/13/2019 13_loredo07
18/111
Interpreting Bayesian Probabilities
If we like there is no harm in saying that a probability expresses adegree of reasonable belief. . . . Degree of confirmation has beenused by Carnap, and possibly avoids some confusion. But whateververbal expression we use to try to convey the primitive idea, thisexpression cannot amount to a definition. Essentially the notion
can only be described by reference to instances where it is used. Itis intended to express a kind of relation between data andconsequence that habitually arises in science and in everyday life,and the reader should be able to recognize the relation fromexamples of the circumstances when it arises.
Sir Harold Jeffreys, Scientific Inference
18/111
M O I i
-
8/13/2019 13_loredo07
19/111
More On Interpretation
Physics uses words drawn from ordinary languagemass, weight,momentum, force, temperature, heat, etc.but their technical meaningis more abstract than their colloquial meaning. We can map between thecolloquial and abstract meanings associated with specific values by usingspecific instances as calibrators.
A Thermal Analogy
Intuitive notion Quantification Calibration
Hot, cold Temperature, T Cold as ice = 273KBoiling hot = 373K
uncertainty Probability, P Certainty = 0, 1
p= 1/36:plausible as snakes eyes
p= 1/1024:plausible as 10 heads
19/111
A Bi M O I i
-
8/13/2019 13_loredo07
20/111
A Bit More On InterpretationBayesian
Probability quantifies uncertainty in an inductive inference. p(x)
describes how probabilityis distributed over the possible values x
might have taken in the single case before us:
P
x
p is distributed
x has a single,uncertain value
Frequentist
Probabilities are always (limiting) rates/proportions/frequencies in
an ensemble. p(x) describes variability, how the values of x are
distributed among the cases in the ensemble:x is distributed
x
P
20/111
A t R l ti
-
8/13/2019 13_loredo07
21/111
Arguments RelatingHypotheses, Data, and Models
We seek to appraise scientific hypotheses in light of observed dataand modeling assumptions.
Consider the data and modeling assumptions to be the premises ofan argument with each of various hypotheses, Hi, as conclusions:
Hi|Dobs, I. (I= background information, everything deemedrelevant besides the observed data)P(Hi|Dobs, I) measures the degree to which (Dobs, I) allow one todeduce Hi. It provides an ordering among arguments for various Hithat share common premises.
Probability theory tells us how to analyze and appraise theargument, i.e., how to calculate P(Hi|Dobs, I) from simpler,hopefully more accessible probabilities.
21/111
Th B i R i
-
8/13/2019 13_loredo07
22/111
The Bayesian Recipe
Assess hypotheses by calculating their probabilities p(Hi|
. . .)
conditional on known and/or presumed information using therules of probability theory.
Probability Theory Axioms:
OR (sum rule) P(H1 H2|I) = P(H1|I) +P(H2|I)P(H1, H2|I)
AND (product rule) P(H1, D
|I) = P(H1
|I) P(D
|H1, I)
= P(D|I) P(H1|D, I)
22/111
Three Important Theorems
-
8/13/2019 13_loredo07
23/111
Three Important Theorems
Bayess Theorem (BT)
Consider P(Hi, Dobs|I) using the product rule:P(Hi, Dobs|I) = P(Hi|I) P(Dobs|Hi, I)
= P(Dobs|I) P(Hi|Dobs, I)
Solve for the posterior probability:
P(Hi|Dobs, I) =P(Hi|I)P(Dobs|Hi, I)P(Dobs|I)
Theorem holds for any propositions, but for hypotheses &data the factors have names:
posteriorpriorlikelihoodnorm. const. P(Dobs|I) = prior predictive
23/111
Law of Total Probability (LTP)
-
8/13/2019 13_loredo07
24/111
y ( )
Consider exclusive, exhaustive{Bi} (Iasserts one of themmust be true),
i
P(A, Bi|I) = i
P(Bi|A, I)P(A|I) =P(A|I)
=
i
P(Bi|I)P(A|Bi, I)
If we do not see how to get P(A|I) directly, we can find a set{Bi} and use it as a basisextend the conversation:
P(A|I) =
iP(Bi|I)P(A|Bi, I)
If our problem already has Biin it, we can use LTP to getP(A|I) from the joint probabilitiesmarginalization:
P(A|I) =i
P(A, Bi|I)
24/111
-
8/13/2019 13_loredo07
25/111
Example: Take A= Dobs, Bi=Hi; then
P(Dobs|I) = i P(Dobs, Hi|I)
=
i
P(Hi|I)P(Dobs|Hi, I)
prior predictive for Dobs= Average likelihood for Hi
(akamarginal likelihood)
Normalization
For exclusive, exhaustive Hi,i
P(Hi| ) = 1
25/111
Well Posed Problems
-
8/13/2019 13_loredo07
26/111
Well-Posed ProblemsThe rules express desired probabilities in terms of otherprobabilities.
To get a numerical value out, at some point we have to putnumerical values in.
Direct probabilitiesare probabilities with numerical valuesdetermined directly by premises (via modeling assumptions,
symmetry arguments, previous calculations, desperatepresumption . . . ).
An inference problem is well posedonly if all the neededprobabilities are assignable based on the premises. We may need to
add new assumptions as we see what needs to be assigned. Wemay not be entirely comfortable with what we need to assume!(Remember Euclids fifth postulate!)
Should explore how results depend on uncomfortable assumptions
(robustness). 26/111
Recap
-
8/13/2019 13_loredo07
27/111
RecapBayesian inference is more than BT
Bayesian inference quantifies uncertainty by reportingprobabilities for things we are uncertain of, given specifiedpremises.It uses allof probability theory, not just (or even primarily)Bayess theorem.
The Rules in Plain English
Ground rule: Specify premises that include everything relevantthat you know or are willing to presume to be true (for thesake of the argument!).
BT: Make your appraisal account for all of your premises.
Things you know are false must not enter your accounting.
LTP: If the premises allow multiple arguments for ahypothesis, its appraisal must account for all of them.
Do not just focus on the most or least favorable way ahypothesis may be realized. 27/111
Outline
-
8/13/2019 13_loredo07
28/111
Outline
1 The Big Picture
2 FoundationsLogic & Probability Theory
3 Inference With Parametric ModelsParameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
28/111
Inference With Parametric Models
-
8/13/2019 13_loredo07
29/111
Inference With Parametric Models
ModelsMi (i= 1 to N), each with parameters i, each imply asampling distn (conditional predictive distn for possible data):
p(D|i, Mi)
The idependence when we fix attention on theobserveddata isthe likelihood function:
Li(i)p(Dobs|i, Mi)
We may be uncertain about i(model uncertainty) or i (parameteruncertainty).
29/111
Three Classes of Problems
-
8/13/2019 13_loredo07
30/111
Three Classes of Problems
Parameter Estimation
Premise = choice of model (pick specific i) What can we say about i?
Model Assessment
Model comparison: Premise =
{Mi
} What can we say about i? Model adequacy/GoF: Premise =M1 all alternatives
Is M1 adequate?
Model AveragingModels share some common params: i={, i} What can we say about w/o committing to one model?(Systematic error is an example)
30/111
Parameter Estimation
-
8/13/2019 13_loredo07
31/111
Parameter Estimation
Problem statement
I= Model Mwith parameters (+ any addl info)Hi= statements about ; e.g. [2.5, 3.5], or >0Probability for any such statement can be found using aprobability density function (PDF) for :
P([, +d]| ) = f()d= p(| )d
Posterior probability density
p(|D, M) = p(|M)L()dp(|M)L()
31/111
-
8/13/2019 13_loredo07
32/111
Summaries of posterior
Best fit values:
Mode, , maximizes p(|D, M) Posterior mean,= d p(|D, M)
Uncertainties:
Credible region of probability C:
C=P(|D, M) = d p(|D, M)Highest Posterior Density (HPD) region has p(|D, M) higherinside than outside
Posterior standard deviation, variance, covariances
Marginal distributions
Interesting parameters , nuisance parameters Marginal distn for: p(|D, M) = d p(, |D, M)
32/111
Nuisance Parameters and Marginalization
-
8/13/2019 13_loredo07
33/111
s c s g o
To model most data, we need to introduce parameters besidesthose of ultimate interest: nuisance parameters.
Example
We have data from measuring a rate r=s+bthat is a sumof an interesting signal sand a background b.We have additional data just about b.What do the data tell us about s?
33/111
Marginal posterior distribution
-
8/13/2019 13_loredo07
34/111
g p
p(s|D, M) = db p(s, b|D, M) p(s|M) db p(b|s) L(s, b) p(s|M)Lm(s)
with
Lm(s) the marginal likelihood for s. For broad prior,
Lm(s) p(bs|s) L(s, bs)bsbest bgiven s
buncertainty given s
Profile likelihoodLp(s) L(s, bs) gets weighted by aparameterspace volume factor
E.g., Gaussians: s= r b, 2s =2r +2b
Background subtraction is a special case of background marginalization. 34/111
Model Comparison
-
8/13/2019 13_loredo07
35/111
p
Problem statement
I = (M1 M2 . . .) Specify a set of models.Hi =Mi Hypothesis chooses a model.Posterior probability for a model
p(Mi|D, I) = p(Mi|I)p(D
|Mi, I)
p(D|I) p(Mi|I)L(Mi)
ButL(Mi) =p(D|Mi) =
dip(i|Mi)p(D|i, Mi).
Likelihood for model = Average likelihood for its parametersL(Mi) =L(i)
Varied terminology: Prior predictive = Average likelihood = Globallikelihood = Marginal likelihood = (Weight of) Evidence for model
35/111
Odds and Bayes factors
-
8/13/2019 13_loredo07
36/111
y
Ratios of probabilities for two propositions using the same premises
are called odds:
Oij p(Mi|D, I)p(Mj|D, I)
= p(Mi|I)
p(Mj|I)p(D|Mj, I)p(D|Mj, I)
The data-dependent part is called the Bayes factor:
Bij
p(D|Mj, I)p(D
|M
j, I)
It is a likelihood ratio; the BF terminology is usually reserved forcases when the likelihoods are marginal/average likelihoods.
36/111
An Automatic Occams Razor
-
8/13/2019 13_loredo07
37/111
Predictive probabilities can favor simpler models
p(D|Mi) =
dip(i|M) L(i)
DobsD
P(D|H)
Complicated H
Simple H
37/111
The Occam Factor
-
8/13/2019 13_loredo07
38/111
p, L
Prior
Likelihood
p(D|Mi) = di p(i|M)L(i)p(i|M)L(i)i L(i)ii
= Maximum Likelihood Occam Factor
Models with more parameters often make the data moreprobable for the best fitOccam factor penalizes models for wastedvolume ofparameter spaceQuantifies intuition that models shouldnt require fine-tuning
38/111
Model Averaging
-
8/13/2019 13_loredo07
39/111
Problem statement
I = (M1
M2
. . .) Specify a set of models
Models all share a set of interesting parameters, Each has different set of nuisance parameters i(or differentprior info about them)Hi= statements about
Model averaging
Calculate posterior PDF for :
p(|D, I) =i
p(Mi|D, I) p(|D, Mi)
i
L(Mi)
dip(, i|D, Mi)
The model choice is a (discrete) nuisance parameter here.
39/111
Theme: Parameter Space Volume
-
8/13/2019 13_loredo07
40/111
Bayesian calculations sum/integrate over parameter/hypothesisspace!
(Frequentist calculations average over samplespace & typically optimize
over parameter space.)
Marginalization weights the profile likelihood by a volume
factor for the nuisance parameters.
Model likelihoods have Occam factors resulting fromparameter space volume factors.
Many virtues of Bayesian methods can be attributed to thisaccounting for the size of parameter space. This idea does notarise naturally in frequentist statistics (but it can be added byhand).
40/111
Roles of the Prior
-
8/13/2019 13_loredo07
41/111
Prior hastworoles
Incorporate any relevant prior information
Convert likelihood from intensity to measure Accounts forsize of hypothesis space
Physical analogy
Heat: Q = dV cv(r)T(r)Probability: P
dp(|I)L()
Maximum likelihood focuses on the hottest hypotheses.
Bayes focuses on the hypotheses with the most heat.
A high-Tregion may contain little heat if its cv is low or if
its volume is small.
A high-L region may contain little probability if its prior is low or ifits volume is small.
41/111
Recap of Key Ideas
-
8/13/2019 13_loredo07
42/111
Probability as generalized logic for appraising arguments Three theorems: BT, LTP, Normalization Calculations characterized by parameter space integrals Credible regions, posterior expectations
Marginalization over nuisance parameters Occams razor via marginal likelihoods
42/111
Outline
-
8/13/2019 13_loredo07
43/111
1 The Big Picture
2 FoundationsLogic & Probability Theory3 Inference With Parametric Models
Parameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
43/111
Binary Outcomes:
-
8/13/2019 13_loredo07
44/111
Parameter Estimation
M= Existence of two outcomes, S and F; each trial has same
probability for S orF
Hi= Statements about , the probability for success on the nexttrialseek p(|D, M)
D= Sequence of results from Nobserved trials:FFSSSSFSSSFS (n= 8 successes in N= 12 trials)
Likelihood:
p(D|, M) = p(failure|, M) p(success|, M) = n(1 )Nn= L()
44/111
-
8/13/2019 13_loredo07
45/111
Prior
Starting with no information about beyond its definition,use as an uninformative prior p(|M) = 1. Justifications:
Intuition: Dont prefer any interval to any other of same size Bayess justification: Ignorance means that before doing the
Ntrials, we have no preference for how many will be successes:
P(n success|M) = 1N+ 1
p(|M) = 1
Consider this a conventionan assumption added to M to
make the problem well posed.
45/111
-
8/13/2019 13_loredo07
46/111
Prior Predictive
p(D|M) =
d n(1 )Nn
= B(n+ 1, N n+ 1) = n!(N n)!(N+ 1)!
A Beta integral, B(a, b)
dx xa1(1 x)b1 = (a)(b)(a+b).
46/111
Posterior
-
8/13/2019 13_loredo07
47/111
Posterior
p(|D, M) = (N+ 1)!n!(N n)!
n(1 )Nn
A Beta distribution. Summaries:
Best-fit: = n
N = 2/3;
= n+1
N+2
0.64
Uncertainty: =
(n+1)(Nn+1)(N+2)2(N+3)
0.12Find credible regions numerically, or with incomplete betafunction
Note that the posterior depends on the data only through n,not the Nbinary numbers describing the sequence.n is a (minimal) Sufficient Statistic.
47/111
-
8/13/2019 13_loredo07
48/111
48/111
Binary Outcomes: Model ComparisonEqual Probabilities?
-
8/13/2019 13_loredo07
49/111
Equal Probabilities?
M1: = 1/2
M2: [0, 1] with flat prior.Maximum Likelihoods
M1 : p(D|M1) = 12N
= 2.44 104
M2 : L() =
2
3
n 13
Nn= 4.82 104
p(D|M1)p(D|, M2) = 0.51
Maximum likelihoods favor M2 (failures more probable).
49/111
Bayes Factor (ratio of model likelihoods)
-
8/13/2019 13_loredo07
50/111
p(D|M1) = 12N
; and p(D|M2) = n!(N n)!(N+ 1)!
B12 p(D|M1)p(D|M2) =
(N+ 1)!
n!(N n)!2N= 1.57
Bayes factor (odds) favors M1 (equiprobable).
Note that for n= 6, B12 = 2.93; for this small amount ofdata, we can never be very sure results are equiprobable.
Ifn= 0, B121/315; ifn= 2, B121/4.8; for extremedata, 12 flips can be enough to lead us to strongly suspectoutcomes have different probabilities.
(Frequentist significance tests can reject null for any sample size.)
50/111
Binary Outcomes: Binomial DistributionSuppose D = n (number of heads in N trials) rather than the
-
8/13/2019 13_loredo07
51/111
SupposeD=n (number of heads in N trials), rather than theactual sequence. What is p(|n, M)?
LikelihoodLetS= a sequence of flips with n heads.
p(n
|, M) = S p(S|, M) p(n|S, , M)
n (1 )Nn
[ # successes =n]
= n(1 )NnCn,N
Cn,N= # of sequences of length N with n heads.
p(n|, M) = N!n!(N n)!
n(1 )Nn
The binomial distribution forn given , N.51/111
Posterior
-
8/13/2019 13_loredo07
52/111
p(|n, M) =N!
n!(Nn)! n
(1 )N
n
p(n|M)
p(n|M) =
N!
n!(N n)! d n(1 )Nn=
1
N+ 1
p(|n, M) = (N+ 1)!n!(N n)! n(1 )Nn
Same resultas when data specified the actual sequence.
52/111
Another Variation: Negative Binomial
-
8/13/2019 13_loredo07
53/111
SupposeD=N, the number of trials it took to obtain a predifined
number of successes, n= 8. What is p(|N, M)?Likelihood
p(N|, M) is probability for n 1 successes in N 1 trials,times probability that the final trial is a success:
p(N|, M) = (N 1)!(n 1)!(N n)!
n1(1 )Nn
= (N 1)!
(n 1)!(N n)!n(1
)Nn
The negative binomial distribution forN given , n.
53/111
P t i
-
8/13/2019 13_loredo07
54/111
Posterior
p(|D, M) =Cn,Nn(1 )Nn
p(D|M)
p(D|M) = Cn,N d n(1 )Nn p(|D, M) = (N+ 1)!
n!(N
n)!n(1 )Nn
Same resultas other cases.
54/111
Final Variation: Meteorological Stopping
-
8/13/2019 13_loredo07
55/111
SupposeD= (N, n), the number of samples and number of
successes in an observing run whose total number was determinedby the weather at the telescope. What is p(|D, M)?(M adds info about weather to M.)
Likelihood
p(D|, M) is the binomial distribution times the probabilitythat the weather allowed N samples, W(N):
p(D
|, M) = W(N)
N!
n!(N n)!n(1
)Nn
Let Cn,N=W(N)
Nn
. We get the same resultas before!
55/111
Likelihood Principle
To define L(Hi ) = p(D |Hi I ) we must contemplate what other
-
8/13/2019 13_loredo07
56/111
To defineL(Hi) =p(Dobs|Hi, I), we must contemplate what otherdata we might have obtained. But the real sample space may bedetermined by many complicated, seemingly irrelevant factors; itmay not be well-specified at all. Should this concern us?
Likelihood principle: The result of inferences depends only on howp(Dobs|Hi, I) varies w.r.t. hypotheses. We can ignore aspects of theobserving/sampling procedure that do not affect this dependence.
This is a sensible property that frequentist methods do not share.Frequentist probabilities are long run rates of performance, andthus depend on details of the sample space that may be irrelevantin a Bayesian calculation.
Example: Predict 10% of sample is Type A; observe nA = 5 for N= 96Significance test accepts= 0.1 for binomial sampling;
p(> 2|= 0.1) = 0.12Significance test rejects= 0.1 for negative binomial sampling;
p(> 2
|= 0.1) = 0.03
56/111
-
8/13/2019 13_loredo07
57/111
Gausss Observation: Sufficiency
-
8/13/2019 13_loredo07
58/111
Suppose our data consist ofN measurements, di=+i.
Suppose the noise contributions are independent, andiN(0, 2).
p(D|,, M) =
i
p(di|,, M)
= i
p(i=di |,, M)
=
i
1
2exp
(di )
2
22
= 1
N(2)N/2eQ()/22
58/111
Find dependence ofQ on by completing the square:
Q
( )2
-
8/13/2019 13_loredo07
59/111
Q =
i
(di )2
= i
d2i +N2 2Nd where d 1Ni
di
= N( d)2 +Nr2 where r2 1N
i
(di d)2
Likelihood depends on{di} only through d and r:
L(, ) = 1N(2)N/2
exp
Nr
2
22
exp
N( d)
2
22
The sample mean and variance are sufficient statistics.
This is a miraculous compression of informationthe normal distnis highly abnormalin this respect!
59/111
-
8/13/2019 13_loredo07
60/111
-
8/13/2019 13_loredo07
61/111
Uninformative prior
Translation invariance p()C, a constant.This prior is improperunless bounded.
Prior predictive/normalization
p(D|, M) = dCexpN( d)222
= C(/
N)
2
. . . minus a tiny bit from tails, using a proper prior.
61/111
-
8/13/2019 13_loredo07
62/111
Posterior
p(|D, , M) = 1(/
N)
2
exp
N( d)
2
22
Posterior is N(d, w2), with standard deviation w=/
N.
68.3% HPD credible region for is d /N.Note that C drops outlimit of infinite prior range is wellbehaved.
62/111
Informative Conjugate Prior
-
8/13/2019 13_loredo07
63/111
Use a normal prior, N(0, w20 )
Posterior
Normal N(,w2), but mean, std. deviation shrinktowardsprior.Define B= w
2
w2+w20, so B
-
8/13/2019 13_loredo07
64/111
Problem specification
Model: di =+i, i N(0, 2), is unknownParameter space: (, ); seek p(|D, , M)
Likelihood
p(D|,, M) = 1N(2)N/2
expNr222 expN( d)222 1
NeQ/2
2
where Q=Nr2 + ( d)2
64/111
-
8/13/2019 13_loredo07
65/111
-
8/13/2019 13_loredo07
66/111
Marginal Posterior
p(|D, M) d 1N+1
eQ/22
Let = Q22
so = Q2 and|d|=3/2
Q2
p(|D, M) 2N/2QN/2
d N
21e
QN/2
66/111
2
-
8/13/2019 13_loredo07
67/111
Write Q=Nr2
1 +
d
r
2
and normalize:
p(|D, M) =
N2 1
!
N2 32
!
1
r
1 +
1
N
dr/
N
2N/2
Students tdistribution, with t= (
d)
r/NA bell curve, but with power-law tailsLarge N:
p(
|D, M)
eN(d)
2/2r2
67/111
Poisson Distn: Infer a Rate from Counts
Problem: Observe n counts in T; infer rate, r
-
8/13/2019 13_loredo07
68/111
Likelihood
L(r)p(n|r, M) =p(n|r, M) =(rT)nn!
erT
Prior
Two standard choices: rknown to be nonzero; it is a scale parameter:p(r|M) = 1
ln(ru/rl)
1
r
rmay vanish; require p(n|M)Const:p(r|M) = 1
ru
68/111
Prior predictive
1 1 r
-
8/13/2019 13_loredo07
69/111
p(n|M) = 1ru
1
n! ru
0dr(rT)nerT
= 1
ruT
1
n!
ruT0
d(rT)(rT)nerT
1ruT
for ru nT
Posterior
A gamma distribution:
p(r|n, M) = T(rT)nn!
erT
69/111
Gamma Distributions
A 2 t f il f di t ib ti ti ith
-
8/13/2019 13_loredo07
70/111
A 2-parameter family of distributions over nonnegative x, withshape parameter and scale parameter s:
p(x|, s) = 1s()
xs
1ex/s
Moments:
E(x) =s Var(x) =s2
Our posterior corresponds to= n+ 1, s= 1/T.
Mode r= nT; meanr=
n+1T (shift down 1 with 1/r prior)
Std. devn r=
n+1T
; credible regions found by integrating (canuse incomplete gamma function)
70/111
-
8/13/2019 13_loredo07
71/111
71/111
The flat prior
Bayess justification: Not that ignorance of r p(r |I ) = C
-
8/13/2019 13_loredo07
72/111
Bayes s justification: Notthat ignorance ofr p(r|I) =CRequire (discrete) predictive distribution to be flat:
p(n|I) =
dr p(r|I)p(n|r, I) =C p(r|I) =C
Useful conventions
Use a flat prior for a rate that may be zero Use a log-flat prior (1/r) for a nonzero scale parameter Use proper (normalized, bounded) priors Plot posterior with abscissa that makes prior flat
72/111
The On/Off Problem
Basic problem
-
8/13/2019 13_loredo07
73/111
Look off-source; unknown background rate bCount Noffphotons in interval Toff
Look on-source; rate is r=s+bwith unknown signal sCount Non photons in interval Ton
Infer sConventional solution
b=Noff/Toff; b=
Noff/Toff
r=Non
/Ton
; r = Non/Tons= r b; s= 2r +2b
But scan benegative!
73/111
Examples
Spectra of X-Ray Sources
-
8/13/2019 13_loredo07
74/111
Bassani et al. 1989 Di Salvo et al. 2001
74/111
Spectrum of Ultrahigh-Energy Cosmic Rays
Nagano & Watson 2000
-
8/13/2019 13_loredo07
75/111
g
HiRes Team 2007
log10(E) (eV)
Flux*E3/1024
(eV2
m-2s
-1s
r-1)
AGASA
HiRes-1 Monocular
HiRes-2 Monocular
1
10
17 17.5 18 18.5 19 19.5 20 20.5 21
75/111
N is Never Large
-
8/13/2019 13_loredo07
76/111
Sample sizes are never large. IfN is too small to get asufficiently-precise estimate, you need to get more data (or makemore assumptions). But once N is large enough, you can startsubdividing the data to learn more (for example, in a publicopinion poll, once you have a good estimate for the entire country,you can estimate among men and women, northerners and
southerners, different age groups, etc etc). N is never enoughbecause if it were enough youd already be on to the nextproblem for which you need more data.
Similarly, you never have quite enough money. But thats another
story. Andrew Gelman (blog entry, 31 July 2005)
76/111
Backgrounds as Nuisance Parameters
-
8/13/2019 13_loredo07
77/111
Background marginalization with Gaussian noise
Measure background rateb=b bwith source off. Measure totalrate r= r rwith source on. Infer signal source strength s, where
r=s+b. With flat priors,
p(s, b|D, M)exp (b b)
2
22b
exp
(s+b r)
2
22r
77/111
-
8/13/2019 13_loredo07
78/111
Marginalizebto summarize the results for s(complete the
square to isolate bdependence; then do a simple Gaussianintegral over b):
p(s|D, M)exp (s s)2
22s
s= r b2
s
=2r
+2b
Background subtraction is a special case of backgroundmarginalization.
78/111
Bayesian Solution to On/Off Problem
-
8/13/2019 13_loredo07
79/111
First consider off-source data; use it to estimate b:
p(b|Noff, Ioff) = Toff(bToff)NoffebToff
Noff!
Use this as a prior for b to analyze on-source data. For on-sourceanalysis Iall = (Ion, Noff, Ioff):
p(s, b|Non) p(s)p(b)[(s+b)Ton]None(s+b)Ton || Iallp(s|Iall) is flat, but p(b|Iall) =p(b|Noff, Ioff), so
p(s, b|Non, Iall) (s+b)NonbNoffesToneb(Ton+Toff)
79/111
Now marginalize over b;
p(s|N I )
db p(s b | N I )
-
8/13/2019 13_loredo07
80/111
p(s|Non, Iall) =
db p(s, b|Non, Iall)
db(s+b)NonbNoffesToneb(Ton+Toff)Expand (s+b)Non and do the resulting integrals:
p(s|Non, Iall) =Noni=0
Ci Ton(sTon)ie
sTon
i!
Ci
1 +ToffTon
i(Non+Noff i)!
(Non i)!Posterior is a weighted sum of Gamma distributions, each assigning adifferent number of on-source counts to the source. (Evaluate viarecursive algorithm or confluent hypergeometric function.)
80/111
Example On/Off PosteriorsShort Integrations
Ton= 1
-
8/13/2019 13_loredo07
81/111
81/111
Example On/Off PosteriorsLong Background Integrations
Ton= 1
-
8/13/2019 13_loredo07
82/111
82/111
Multibin On/Off
The more typical on/off scenario:
-
8/13/2019 13_loredo07
83/111
Data = spectrum or image with counts in many bins
Model Mgives signal rate sk() in bin k, parameters
To infer , we need the likelihood:
L() = k p(Nonk, Noffk|sk(), M)For each k, we have an on/off problem as before, only we justneed the marginal likelihood for sk(not the posterior). The sameCicoefficients arise.
XSPEC and CIAO/Sherpa provide this as an option.
CHASC approach does the same thing via data augmentation.
83/111
Bayesian ComputationLarge sample size: Laplace approximation
( )
-
8/13/2019 13_loredo07
84/111
Approximate posterior as multivariate normal det(covar) factors Uses ingredients available in
2
/ML fitting software (MLE, Hessian) Often accurate to O(1/N)
Low-dimensional models (d5) Posterior samplingcreate RNG that samples posterior MCMC is most general framework
84/111
Outline
1 The Big Picture
-
8/13/2019 13_loredo07
85/111
2 FoundationsLogic & Probability Theory
3 Inference With Parametric ModelsParameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
85/111
Extrasolar PlanetsExoplanet detection/measurement methods:
Direct: Transits gravitational lensing imaging
-
8/13/2019 13_loredo07
86/111
Direct: Transits, gravitational lensing, imaging,interferometric nulling
Indirect: Keplerian reflex motion (line-of-sight velocity,astrometric wobble)
The Suns Wobble From 10 pc
-1000 -500 0 500 1000
-1000
-500
0
500
1000
86/111
Radial Velocity TechniqueAs of May 2007, 242 planets found, including 26 multiple-planetsystems
-
8/13/2019 13_loredo07
87/111
Vast majority (230) found via Doppler radial velocity (RV)
measurements
Analysis method: Identify best candidate period via periodogram;fit parameters with nonlinear least squares
Issues: Multimodality & multiple planets, nonlinearity, stellarjitter, long periods, marginal detections, population studies. . .
87/111
Keplerian Radial Velocity ModelParameters for single planet
b l d (d ) Longitude of pericenter
-
8/13/2019 13_loredo07
88/111
= orbital period (days) e= orbital eccentricity K= velocity amplitude (m/s)
Mean anomaly of pericenterpassage Mp System center-of-mass velocityv0
Velocity vs. time
v(t) =v0+K(ecos + cos[+(t)])
True anomaly (t) found via Keplers equation for eccentric
anomaly:
E(t) esin E(t) =2t
Mp; tan
2 = 1 +e1 e1/2
tanE
2
A strongly nonlinear model!
Keplers laws relate (K, , e) to masses, semimajor axis a,
inclination i
88/111
The Likelihood FunctionKeplerian velocity model with parameters ={K, , e, Mp, , v0}:
di=v(ti; ) +i
-
8/13/2019 13_loredo07
89/111
For measurement errors with std devn i, and additional jitterwith std devn J,
L(, J) p({di}|, J)
=
N
i=11
22i +2J exp 1
2
[di v(ti; )]2
2
i +2
J
i
1
2
2i +
2J
exp
1
22()
where 2(, J)
i
[di v(ti; )]22i +
2J
Ignore jitter for now . . .89/111
What To Do With ItParameter estimation
Posterior distn for parameters of model Mi with i planets:
-
8/13/2019 13_loredo07
90/111
p i p
p(|D, M)p(|M)Li()Summarize with mode, means, credible regions (found byintegrating over )
DetectionCalculate probability for no planets (M0), one planet(M2) . . . . Let I ={Mi}.
p(Mi|D, I) p(Mi|I)L(Mi)where L(Mi) =
dp(|Mi)L()
Marginal likelihoodL(Mi) includes Occam factor90/111
DesignPredict future datum dtat time t, accounting for modeluncertainties:
-
8/13/2019 13_loredo07
91/111
p(dt|D, Mi) = dp(dt|, Mi) p(|D, Mi).1st factor is Gaussian for dtwith known model; 2nd term &integral account for uncertainty
Bayesian adaptive explorationFind time likely to make updated posterior shrink the most.Information theorybest time has largest dt uncertainty(maximum entropy in p(dt|D, Mi))
Prior information & data Combined information
Interim
Strategy New data Predictions StrategyObservn Design
Inference
results
91/111
Periodogram Connections
Assume circular orbits: ={K, , Mp, v0}
-
8/13/2019 13_loredo07
92/111
Frequentist
For given ,maximizelikelihood over K and Mp (set v0 todata mean, v) profile likelihood:
log Lp(, v)Lomb-Scargle periodogram
Bayesian
For given , integrate (marginalize) likelihoodprior overK and Mp (set v0 to data mean, v)marginal posterior:
log p(, v|D)Lomb-Scargle periodogram
Additionally marginalize over v0 floating-mean LSP
92/111
Kepler Periodogram for Eccentric OrbitsThe circular model is linearwrt Ksin Mp, Kcos Mp coincidenceof profile and marginal.
-
8/13/2019 13_loredo07
93/111
The full model is nonlinearwrt e, Mp, .
Profiling is known to behave poorly for nonlinear parameters (e.g.,asymptotic confidence regions have incorrect coverage for finitesamples; can even be inconsistent).
Marginalization is valid with complete generality; also has goodfrequentist properties.
For given , marginalize likelihoodprior over allparams but :log p(
|D)
Kepler periodogram
This requires a 5-d integral.
Trick: For K priorK(interim prior), three of the integrals canbe done analytically integrate numerically only over (e, Mp)
93/111
A Bayesian Workflow for Exoplanets
U K l i d t d di i lit t 3 d ( M )
-
8/13/2019 13_loredo07
94/111
Use Kepler periodogram to reduce dimensionality to 3-d (, e, Mp).
Use Kepler periodogram results (p(), moments ofe, Mp) todefine initial population for adaptive, population-based MCMC.
Once we have{, e, Mp}, get associated{K, , v0} samples fromtheir exact conditional distribution.
Fix the interim prior by weighting the MCMC samples.
(See work by Phil Gregory & Eric Ford for simpler but less efficient
workflows.)
94/111
A Bayesian Workflow for Exoplanets
-
8/13/2019 13_loredo07
95/111
95/111
Estimation Results for HD22258224 Keck RV observations spanning 683 days; long period; hi e
Kepler Periodogram
-
8/13/2019 13_loredo07
96/111
96/111
Differential Evolution MCMC Performance
Marginal for (, e) Convergence: Autocorrelation
-
8/13/2019 13_loredo07
97/111
Reaches convergence much more quickly than simpler algorithms
97/111
Multiple Planets in HD 208487Phil Gregorys parallel tempering algorithm found a 2nd planet inHD 208487.
129 8 d 909 d
-
8/13/2019 13_loredo07
98/111
1 = 129.8 d; 2 = 909 d
0 0.2 0.4 0.6 0.8
e2
10
20
30
40
50
K2
m
s
1
0 0.5 1 1.5 2
P2 Orbital phase
20
10
0
10
20
Velocity
m
s
1
b
0 0.5 1 1.5 2
P1 Orbital phase
30
20
10
0
10
20
30
Velocity
m
s
1
a
98/111
Adaptive Exploration for Toy Problem
Data are Kepler velocities plus noise:
-
8/13/2019 13_loredo07
99/111
di=V(ti; , e, K) +ei
3 remaining geometrical params (t0, , i) are fixed.
Noise probability is Gaussian with known = 8 m s1.
Simulate data with typical Jupiter-like exoplanet parameters:
= 800 d
e = 0.5
K = 50 ms1
Goal: Estimate parameters , e and K as efficiently as possible.
99/111
-
8/13/2019 13_loredo07
100/111
Cycle 1: InferenceUse flat priors,
p(, e, K|D, I) exp[Q(, e, K)/22]
-
8/13/2019 13_loredo07
101/111
Q= sum of squared residuals using best-fit amplitudes.
Generate{j, ej, Kj} via posterior sampling.
101/111
Cycle 1 Design: Predictions, Entropy
-
8/13/2019 13_loredo07
102/111
102/111
Cycle 2: Observation
-
8/13/2019 13_loredo07
103/111
103/111
Cycle 2: Inference
-
8/13/2019 13_loredo07
104/111
VolumeV2 =V1/5.8
104/111
Evolution of InferencesCycle 1 (10 samples)
-
8/13/2019 13_loredo07
105/111
Cycle 2 (11 samples; V2 =V1/5.8)
105/111
Cycle 3 (12 samples; V3 =V2/3.9)
-
8/13/2019 13_loredo07
106/111
Cycle 4 (13 samples; V4
=V3
/1.8)
106/111
Outline1 The Big Picture
2 FoundationsLogic & Probability Theory
-
8/13/2019 13_loredo07
107/111
3 Inference With Parametric ModelsParameter EstimationModel Uncertainty
4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution
5 Application: Extrasolar Planets
6 Probability & Frequency
107/111
Probability & Frequency
Frequencies are relevant when modeling repeated trials, orrepeated sampling from a population or ensemble.
-
8/13/2019 13_loredo07
108/111
Frequencies are observables:
When available, can be used to inferprobabilities for next trial
When unavailable, can be predicted
Bayesian/Frequentist relationships:
General relationships between probability and frequency
Long-run performance of Bayesian procedures Examples of Bayesian/frequentist differences
108/111
Relationships Between Probability & Frequency
Frequency from probability
Bernoullis law of large numbers: In repeated i.i.d. trials, given
-
8/13/2019 13_loredo07
109/111
Bernoulli s law of large numbers: In repeated i.i.d. trials, given
P(success| . . .) =, predictNsuccessNtotal
as Ntotal
Probability from frequencyBayess An Essay Towards Solving a Problem in the Doctrineof ChancesFirst use of Bayess theorem:Probability for success in next trial of i.i.d. sequence:
E NsuccessNtotal
as Ntotal
109/111
Subtle Relationships For Non-IID CasesPredict frequency in dependent trials
rt= result of trial t; p(r1, r2 . . . rN|M) known; predict f
f = 1
p(rt = success M)
-
8/13/2019 13_loredo07
110/111
f
=Nt p(rt= success|M)
where p(r1|M) =
r2
rN
p(r1, r2 . . . |M3)
Expectedfrequency of outcome in many trials =
averageprobability for outcome across trials.Butalso find that f neednt converge to 0.
Infer probabilities for different but related trialsShrinkage: Biased estimators of the probability that share info
across trials are better than unbiased/BLUE/MLE estimators.
A formalism that distinguishes p from f from the outset is particularlyvaluable for exploring subtle connections. E.g., shrinkage is explored viahierarchical and empirical Bayes.
110/111
Frequentist Performance of Bayesian ProceduresMany results known for parametric Bayes performance:
Estimates are consistent if the prior doesnt exclude the true value.Credible regions found with flat priors are typically confidence
-
8/13/2019 13_loredo07
111/111
Credible regions found with flat priors are typically confidenceregions to O(n1/2); reference priors can improve theirperformance to O(n1).
Marginal distributions have better frequentist performance thanconventional methods like profile likelihood. (Bartlett correction,ancillaries, bootstrap are competitive but hard.)
Bayesian model comparison is asymptotically consistent (not true ofsignificance/NP tests, AIC).
For separate (not nested) models, the posterior probability for thetrue model converges to 1 exponentially quickly.
Walds complete class theorem: Optimalfrequentist methods areBayes rules(equivalent to Bayes for some prior) . . .
Parametric Bayesian methods are typically good frequentist methods.(Not so clear in nonparametric problems.)
111/111