13_loredo07

download 13_loredo07

of 111

Transcript of 13_loredo07

  • 8/13/2019 13_loredo07

    1/111

    Introduction to Bayesian Inference

    Tom Loredo

    Dept. of Astronomy, Cornell Universityhttp://www.astro.cornell.edu/staff/loredo/bayes/

    CASt Summer School June 7, 2007

    1/111

  • 8/13/2019 13_loredo07

    2/111

    Outline

    1 The Big Picture

    2 FoundationsLogic & Probability Theory

    3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    2/111

  • 8/13/2019 13_loredo07

    3/111

    Outline

    1 The Big Picture

    2 FoundationsLogic & Probability Theory

    3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    3/111

  • 8/13/2019 13_loredo07

    4/111

    Scientific Method

    Science is more than a body of knowledge; it is a way of thinking.

    The method of science, as stodgy and grumpy as it may seem,is far more important than the findings of science.

    Carl Sagan

    Scientists argue!

    ArgumentCollection of statements comprising an act ofreasoning from premises to a conclusion

    A key goal of science: Explain or predict quantitative

    measurements

    Framework: Mathematical modeling

    Science uses rational argument to construct and appraisemathematical models for measurements

    4/111

  • 8/13/2019 13_loredo07

    5/111

    Mathematical Models

    A model (in physics) is a representation of structure in a physical

    system and/or its properties. (Hestenes)

    REALWORLD

    MATHEMATICALWORLD

    TRANSLATION &

    INTERPRETATION

    A model is a surrogate

    The model is not the modeled system! It stands in for the system

    for a particular purpose, and is subject to revision.

    A model is an idealization

    A model is a caricatureof the system being modeled (Kac). It

    focuses on a subset of system properties of interest.5/111

  • 8/13/2019 13_loredo07

    6/111

    A model is an abstraction

    Models identify common features of different things so that general

    ideas can be created and applied to different situations.

    We seek a mathematical model for quantifying uncertaintyit will share

    these characteristics with physical models.

    Asides

    Theoriesare frameworks guiding model construction (laws, principles).

    Physics as modelingis a leading school of thought in physics education research;e.g., http://modeling.la.asu.edu/

    6/111

  • 8/13/2019 13_loredo07

    7/111

    The Role of DataData do not speak for themselves!

    We dont just tabulatedata, we analyzedata.

    We gather data so they may speak for or against existinghypotheses, and guide the formation of new hypotheses.

    A key role of data in science is to be among the premises inscientific arguments.

    7/111

  • 8/13/2019 13_loredo07

    8/111

  • 8/13/2019 13_loredo07

    9/111

    Bayesian Statistical Inference

    A different approach to allstatistical inference problems (i.e.,not just another method in the list: BLUE, maximumlikelihood,2 testing, ANOVA, survival analysis . . . )

    Foundation: Use probability theory to quantify the strength ofarguments (i.e., a more abstract view than restricting PT todescribe variability in repeated random experiments)

    Focuses on deriving consequences of modeling assumptions

    rather than devising and calibrating procedures

    9/111

  • 8/13/2019 13_loredo07

    10/111

    Outline

    1 The Big Picture

    2 FoundationsLogic & Probability Theory

    3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    10/111

  • 8/13/2019 13_loredo07

    11/111

    LogicSome Essentials

    Logic can be defined as the analysis and appraisal of argumentsGensler, Intro to Logic

    Build arguments with propositions and logicaloperators/connectives

    Propositions: Statements that may be true or false

    P : Universe can be modeled with CDMA: tot[0.9, 1.1]B : is not 0

    B : not B, i.e., = 0

    Connectives:A B : A and Bare both trueA B : A or Bis true, or both are

    11/111

  • 8/13/2019 13_loredo07

    12/111

    Arguments

    Argument: Assertion that an hypothesized conclusion, H, follows

    from premises,P={A, B, C, . . .} (take , = and)Notation:

    H|P : Premises P implyHHmay be deduced from PHfollows from PH is true given that P is true

    Arguments are (compound) propositions.

    Central role of arguments

    special terminology:

    A true argument is valid A false argument is invalidor fallacious

    12/111

  • 8/13/2019 13_loredo07

    13/111

    Valid vs. Sound Arguments

    Content vs. form

    An argument is factually correct iff all of its premises are true(it has good content).

    An argument is valid iff its conclusion follows from its

    premises (it has good form).

    An argument is soundiff it is both factually correct and valid(it has good form and content).

    We want to make soundarguments. Logic and statistical methodsaddress validity, but there is no formal approach for addressingfactual correctness.

    13/111

  • 8/13/2019 13_loredo07

    14/111

    Factual Correctness

    Although logic can teach us something about validity andinvalidity, it can teach us very little about factual correctness. Thequestion of the truth or falsity of individual statements is primarilythe subject matter of the sciences.

    Hardegree, Symbolic Logic

    To test the truth or falsehood of premisses is the task ofscience. . . . But as a matter of fact we are interested in, and mustoften depend upon, the correctness of arguments whose premisses

    are not known to be true. Copi, Introduction to Logic

    14/111

  • 8/13/2019 13_loredo07

    15/111

    Premises

    Facts Things known to be true, e.g. observed data Obvious assumptions Axioms, postulates, e.g., Euclids

    first 4 postulates (line segment b/t 2 points; congruency ofright angles . . . )

    Reasonable or working assumptions E.g., Euclids fifthpostulate (parallel lines)

    Desperate presumption!

    Conclusions from other arguments

    15/111

  • 8/13/2019 13_loredo07

    16/111

    Deductive and Inductive InferenceDeductionSyllogism as prototype

    Premise 1: A impliesH

    Premise 2: A is trueDeduction: H is trueH|P is valid

    InductionAnalogy as prototype

    Premise 1: A, B, C, D, Eall share properties x, y, zPremise 2: Fhas properties x, yInduction: Fhas property zF has z|Pis not valid, but may still be rational (likely,plausible, probable); some such arguments are stronger thanothers

    Boolean algebra (and/or/not over{0, 1}) quantifies deduction.Probability theory generalizes this to quantify the strength ofinductive arguments. 16/111

  • 8/13/2019 13_loredo07

    17/111

    Real Number Representation of Induction

    P(H|P)strength of argument H|PP = 0 Argument is invalid

    = 1 Argument is valid (0, 1) Degree of deducibility

    A mathematical model for induction:

    AND (product rule) P(A B|P) = P(A|P) P(B|A P)= P(B|P) P(A|B P)

    OR (sum rule) P(A B|P) = P(A|P) +P(B|P)P(A B|P)

    We will explore the implications of this model.

    17/111

  • 8/13/2019 13_loredo07

    18/111

    Interpreting Bayesian Probabilities

    If we like there is no harm in saying that a probability expresses adegree of reasonable belief. . . . Degree of confirmation has beenused by Carnap, and possibly avoids some confusion. But whateververbal expression we use to try to convey the primitive idea, thisexpression cannot amount to a definition. Essentially the notion

    can only be described by reference to instances where it is used. Itis intended to express a kind of relation between data andconsequence that habitually arises in science and in everyday life,and the reader should be able to recognize the relation fromexamples of the circumstances when it arises.

    Sir Harold Jeffreys, Scientific Inference

    18/111

    M O I i

  • 8/13/2019 13_loredo07

    19/111

    More On Interpretation

    Physics uses words drawn from ordinary languagemass, weight,momentum, force, temperature, heat, etc.but their technical meaningis more abstract than their colloquial meaning. We can map between thecolloquial and abstract meanings associated with specific values by usingspecific instances as calibrators.

    A Thermal Analogy

    Intuitive notion Quantification Calibration

    Hot, cold Temperature, T Cold as ice = 273KBoiling hot = 373K

    uncertainty Probability, P Certainty = 0, 1

    p= 1/36:plausible as snakes eyes

    p= 1/1024:plausible as 10 heads

    19/111

    A Bi M O I i

  • 8/13/2019 13_loredo07

    20/111

    A Bit More On InterpretationBayesian

    Probability quantifies uncertainty in an inductive inference. p(x)

    describes how probabilityis distributed over the possible values x

    might have taken in the single case before us:

    P

    x

    p is distributed

    x has a single,uncertain value

    Frequentist

    Probabilities are always (limiting) rates/proportions/frequencies in

    an ensemble. p(x) describes variability, how the values of x are

    distributed among the cases in the ensemble:x is distributed

    x

    P

    20/111

    A t R l ti

  • 8/13/2019 13_loredo07

    21/111

    Arguments RelatingHypotheses, Data, and Models

    We seek to appraise scientific hypotheses in light of observed dataand modeling assumptions.

    Consider the data and modeling assumptions to be the premises ofan argument with each of various hypotheses, Hi, as conclusions:

    Hi|Dobs, I. (I= background information, everything deemedrelevant besides the observed data)P(Hi|Dobs, I) measures the degree to which (Dobs, I) allow one todeduce Hi. It provides an ordering among arguments for various Hithat share common premises.

    Probability theory tells us how to analyze and appraise theargument, i.e., how to calculate P(Hi|Dobs, I) from simpler,hopefully more accessible probabilities.

    21/111

    Th B i R i

  • 8/13/2019 13_loredo07

    22/111

    The Bayesian Recipe

    Assess hypotheses by calculating their probabilities p(Hi|

    . . .)

    conditional on known and/or presumed information using therules of probability theory.

    Probability Theory Axioms:

    OR (sum rule) P(H1 H2|I) = P(H1|I) +P(H2|I)P(H1, H2|I)

    AND (product rule) P(H1, D

    |I) = P(H1

    |I) P(D

    |H1, I)

    = P(D|I) P(H1|D, I)

    22/111

    Three Important Theorems

  • 8/13/2019 13_loredo07

    23/111

    Three Important Theorems

    Bayess Theorem (BT)

    Consider P(Hi, Dobs|I) using the product rule:P(Hi, Dobs|I) = P(Hi|I) P(Dobs|Hi, I)

    = P(Dobs|I) P(Hi|Dobs, I)

    Solve for the posterior probability:

    P(Hi|Dobs, I) =P(Hi|I)P(Dobs|Hi, I)P(Dobs|I)

    Theorem holds for any propositions, but for hypotheses &data the factors have names:

    posteriorpriorlikelihoodnorm. const. P(Dobs|I) = prior predictive

    23/111

    Law of Total Probability (LTP)

  • 8/13/2019 13_loredo07

    24/111

    y ( )

    Consider exclusive, exhaustive{Bi} (Iasserts one of themmust be true),

    i

    P(A, Bi|I) = i

    P(Bi|A, I)P(A|I) =P(A|I)

    =

    i

    P(Bi|I)P(A|Bi, I)

    If we do not see how to get P(A|I) directly, we can find a set{Bi} and use it as a basisextend the conversation:

    P(A|I) =

    iP(Bi|I)P(A|Bi, I)

    If our problem already has Biin it, we can use LTP to getP(A|I) from the joint probabilitiesmarginalization:

    P(A|I) =i

    P(A, Bi|I)

    24/111

  • 8/13/2019 13_loredo07

    25/111

    Example: Take A= Dobs, Bi=Hi; then

    P(Dobs|I) = i P(Dobs, Hi|I)

    =

    i

    P(Hi|I)P(Dobs|Hi, I)

    prior predictive for Dobs= Average likelihood for Hi

    (akamarginal likelihood)

    Normalization

    For exclusive, exhaustive Hi,i

    P(Hi| ) = 1

    25/111

    Well Posed Problems

  • 8/13/2019 13_loredo07

    26/111

    Well-Posed ProblemsThe rules express desired probabilities in terms of otherprobabilities.

    To get a numerical value out, at some point we have to putnumerical values in.

    Direct probabilitiesare probabilities with numerical valuesdetermined directly by premises (via modeling assumptions,

    symmetry arguments, previous calculations, desperatepresumption . . . ).

    An inference problem is well posedonly if all the neededprobabilities are assignable based on the premises. We may need to

    add new assumptions as we see what needs to be assigned. Wemay not be entirely comfortable with what we need to assume!(Remember Euclids fifth postulate!)

    Should explore how results depend on uncomfortable assumptions

    (robustness). 26/111

    Recap

  • 8/13/2019 13_loredo07

    27/111

    RecapBayesian inference is more than BT

    Bayesian inference quantifies uncertainty by reportingprobabilities for things we are uncertain of, given specifiedpremises.It uses allof probability theory, not just (or even primarily)Bayess theorem.

    The Rules in Plain English

    Ground rule: Specify premises that include everything relevantthat you know or are willing to presume to be true (for thesake of the argument!).

    BT: Make your appraisal account for all of your premises.

    Things you know are false must not enter your accounting.

    LTP: If the premises allow multiple arguments for ahypothesis, its appraisal must account for all of them.

    Do not just focus on the most or least favorable way ahypothesis may be realized. 27/111

    Outline

  • 8/13/2019 13_loredo07

    28/111

    Outline

    1 The Big Picture

    2 FoundationsLogic & Probability Theory

    3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    28/111

    Inference With Parametric Models

  • 8/13/2019 13_loredo07

    29/111

    Inference With Parametric Models

    ModelsMi (i= 1 to N), each with parameters i, each imply asampling distn (conditional predictive distn for possible data):

    p(D|i, Mi)

    The idependence when we fix attention on theobserveddata isthe likelihood function:

    Li(i)p(Dobs|i, Mi)

    We may be uncertain about i(model uncertainty) or i (parameteruncertainty).

    29/111

    Three Classes of Problems

  • 8/13/2019 13_loredo07

    30/111

    Three Classes of Problems

    Parameter Estimation

    Premise = choice of model (pick specific i) What can we say about i?

    Model Assessment

    Model comparison: Premise =

    {Mi

    } What can we say about i? Model adequacy/GoF: Premise =M1 all alternatives

    Is M1 adequate?

    Model AveragingModels share some common params: i={, i} What can we say about w/o committing to one model?(Systematic error is an example)

    30/111

    Parameter Estimation

  • 8/13/2019 13_loredo07

    31/111

    Parameter Estimation

    Problem statement

    I= Model Mwith parameters (+ any addl info)Hi= statements about ; e.g. [2.5, 3.5], or >0Probability for any such statement can be found using aprobability density function (PDF) for :

    P([, +d]| ) = f()d= p(| )d

    Posterior probability density

    p(|D, M) = p(|M)L()dp(|M)L()

    31/111

  • 8/13/2019 13_loredo07

    32/111

    Summaries of posterior

    Best fit values:

    Mode, , maximizes p(|D, M) Posterior mean,= d p(|D, M)

    Uncertainties:

    Credible region of probability C:

    C=P(|D, M) = d p(|D, M)Highest Posterior Density (HPD) region has p(|D, M) higherinside than outside

    Posterior standard deviation, variance, covariances

    Marginal distributions

    Interesting parameters , nuisance parameters Marginal distn for: p(|D, M) = d p(, |D, M)

    32/111

    Nuisance Parameters and Marginalization

  • 8/13/2019 13_loredo07

    33/111

    s c s g o

    To model most data, we need to introduce parameters besidesthose of ultimate interest: nuisance parameters.

    Example

    We have data from measuring a rate r=s+bthat is a sumof an interesting signal sand a background b.We have additional data just about b.What do the data tell us about s?

    33/111

    Marginal posterior distribution

  • 8/13/2019 13_loredo07

    34/111

    g p

    p(s|D, M) = db p(s, b|D, M) p(s|M) db p(b|s) L(s, b) p(s|M)Lm(s)

    with

    Lm(s) the marginal likelihood for s. For broad prior,

    Lm(s) p(bs|s) L(s, bs)bsbest bgiven s

    buncertainty given s

    Profile likelihoodLp(s) L(s, bs) gets weighted by aparameterspace volume factor

    E.g., Gaussians: s= r b, 2s =2r +2b

    Background subtraction is a special case of background marginalization. 34/111

    Model Comparison

  • 8/13/2019 13_loredo07

    35/111

    p

    Problem statement

    I = (M1 M2 . . .) Specify a set of models.Hi =Mi Hypothesis chooses a model.Posterior probability for a model

    p(Mi|D, I) = p(Mi|I)p(D

    |Mi, I)

    p(D|I) p(Mi|I)L(Mi)

    ButL(Mi) =p(D|Mi) =

    dip(i|Mi)p(D|i, Mi).

    Likelihood for model = Average likelihood for its parametersL(Mi) =L(i)

    Varied terminology: Prior predictive = Average likelihood = Globallikelihood = Marginal likelihood = (Weight of) Evidence for model

    35/111

    Odds and Bayes factors

  • 8/13/2019 13_loredo07

    36/111

    y

    Ratios of probabilities for two propositions using the same premises

    are called odds:

    Oij p(Mi|D, I)p(Mj|D, I)

    = p(Mi|I)

    p(Mj|I)p(D|Mj, I)p(D|Mj, I)

    The data-dependent part is called the Bayes factor:

    Bij

    p(D|Mj, I)p(D

    |M

    j, I)

    It is a likelihood ratio; the BF terminology is usually reserved forcases when the likelihoods are marginal/average likelihoods.

    36/111

    An Automatic Occams Razor

  • 8/13/2019 13_loredo07

    37/111

    Predictive probabilities can favor simpler models

    p(D|Mi) =

    dip(i|M) L(i)

    DobsD

    P(D|H)

    Complicated H

    Simple H

    37/111

    The Occam Factor

  • 8/13/2019 13_loredo07

    38/111

    p, L

    Prior

    Likelihood

    p(D|Mi) = di p(i|M)L(i)p(i|M)L(i)i L(i)ii

    = Maximum Likelihood Occam Factor

    Models with more parameters often make the data moreprobable for the best fitOccam factor penalizes models for wastedvolume ofparameter spaceQuantifies intuition that models shouldnt require fine-tuning

    38/111

    Model Averaging

  • 8/13/2019 13_loredo07

    39/111

    Problem statement

    I = (M1

    M2

    . . .) Specify a set of models

    Models all share a set of interesting parameters, Each has different set of nuisance parameters i(or differentprior info about them)Hi= statements about

    Model averaging

    Calculate posterior PDF for :

    p(|D, I) =i

    p(Mi|D, I) p(|D, Mi)

    i

    L(Mi)

    dip(, i|D, Mi)

    The model choice is a (discrete) nuisance parameter here.

    39/111

    Theme: Parameter Space Volume

  • 8/13/2019 13_loredo07

    40/111

    Bayesian calculations sum/integrate over parameter/hypothesisspace!

    (Frequentist calculations average over samplespace & typically optimize

    over parameter space.)

    Marginalization weights the profile likelihood by a volume

    factor for the nuisance parameters.

    Model likelihoods have Occam factors resulting fromparameter space volume factors.

    Many virtues of Bayesian methods can be attributed to thisaccounting for the size of parameter space. This idea does notarise naturally in frequentist statistics (but it can be added byhand).

    40/111

    Roles of the Prior

  • 8/13/2019 13_loredo07

    41/111

    Prior hastworoles

    Incorporate any relevant prior information

    Convert likelihood from intensity to measure Accounts forsize of hypothesis space

    Physical analogy

    Heat: Q = dV cv(r)T(r)Probability: P

    dp(|I)L()

    Maximum likelihood focuses on the hottest hypotheses.

    Bayes focuses on the hypotheses with the most heat.

    A high-Tregion may contain little heat if its cv is low or if

    its volume is small.

    A high-L region may contain little probability if its prior is low or ifits volume is small.

    41/111

    Recap of Key Ideas

  • 8/13/2019 13_loredo07

    42/111

    Probability as generalized logic for appraising arguments Three theorems: BT, LTP, Normalization Calculations characterized by parameter space integrals Credible regions, posterior expectations

    Marginalization over nuisance parameters Occams razor via marginal likelihoods

    42/111

    Outline

  • 8/13/2019 13_loredo07

    43/111

    1 The Big Picture

    2 FoundationsLogic & Probability Theory3 Inference With Parametric Models

    Parameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    43/111

    Binary Outcomes:

  • 8/13/2019 13_loredo07

    44/111

    Parameter Estimation

    M= Existence of two outcomes, S and F; each trial has same

    probability for S orF

    Hi= Statements about , the probability for success on the nexttrialseek p(|D, M)

    D= Sequence of results from Nobserved trials:FFSSSSFSSSFS (n= 8 successes in N= 12 trials)

    Likelihood:

    p(D|, M) = p(failure|, M) p(success|, M) = n(1 )Nn= L()

    44/111

  • 8/13/2019 13_loredo07

    45/111

    Prior

    Starting with no information about beyond its definition,use as an uninformative prior p(|M) = 1. Justifications:

    Intuition: Dont prefer any interval to any other of same size Bayess justification: Ignorance means that before doing the

    Ntrials, we have no preference for how many will be successes:

    P(n success|M) = 1N+ 1

    p(|M) = 1

    Consider this a conventionan assumption added to M to

    make the problem well posed.

    45/111

  • 8/13/2019 13_loredo07

    46/111

    Prior Predictive

    p(D|M) =

    d n(1 )Nn

    = B(n+ 1, N n+ 1) = n!(N n)!(N+ 1)!

    A Beta integral, B(a, b)

    dx xa1(1 x)b1 = (a)(b)(a+b).

    46/111

    Posterior

  • 8/13/2019 13_loredo07

    47/111

    Posterior

    p(|D, M) = (N+ 1)!n!(N n)!

    n(1 )Nn

    A Beta distribution. Summaries:

    Best-fit: = n

    N = 2/3;

    = n+1

    N+2

    0.64

    Uncertainty: =

    (n+1)(Nn+1)(N+2)2(N+3)

    0.12Find credible regions numerically, or with incomplete betafunction

    Note that the posterior depends on the data only through n,not the Nbinary numbers describing the sequence.n is a (minimal) Sufficient Statistic.

    47/111

  • 8/13/2019 13_loredo07

    48/111

    48/111

    Binary Outcomes: Model ComparisonEqual Probabilities?

  • 8/13/2019 13_loredo07

    49/111

    Equal Probabilities?

    M1: = 1/2

    M2: [0, 1] with flat prior.Maximum Likelihoods

    M1 : p(D|M1) = 12N

    = 2.44 104

    M2 : L() =

    2

    3

    n 13

    Nn= 4.82 104

    p(D|M1)p(D|, M2) = 0.51

    Maximum likelihoods favor M2 (failures more probable).

    49/111

    Bayes Factor (ratio of model likelihoods)

  • 8/13/2019 13_loredo07

    50/111

    p(D|M1) = 12N

    ; and p(D|M2) = n!(N n)!(N+ 1)!

    B12 p(D|M1)p(D|M2) =

    (N+ 1)!

    n!(N n)!2N= 1.57

    Bayes factor (odds) favors M1 (equiprobable).

    Note that for n= 6, B12 = 2.93; for this small amount ofdata, we can never be very sure results are equiprobable.

    Ifn= 0, B121/315; ifn= 2, B121/4.8; for extremedata, 12 flips can be enough to lead us to strongly suspectoutcomes have different probabilities.

    (Frequentist significance tests can reject null for any sample size.)

    50/111

    Binary Outcomes: Binomial DistributionSuppose D = n (number of heads in N trials) rather than the

  • 8/13/2019 13_loredo07

    51/111

    SupposeD=n (number of heads in N trials), rather than theactual sequence. What is p(|n, M)?

    LikelihoodLetS= a sequence of flips with n heads.

    p(n

    |, M) = S p(S|, M) p(n|S, , M)

    n (1 )Nn

    [ # successes =n]

    = n(1 )NnCn,N

    Cn,N= # of sequences of length N with n heads.

    p(n|, M) = N!n!(N n)!

    n(1 )Nn

    The binomial distribution forn given , N.51/111

    Posterior

  • 8/13/2019 13_loredo07

    52/111

    p(|n, M) =N!

    n!(Nn)! n

    (1 )N

    n

    p(n|M)

    p(n|M) =

    N!

    n!(N n)! d n(1 )Nn=

    1

    N+ 1

    p(|n, M) = (N+ 1)!n!(N n)! n(1 )Nn

    Same resultas when data specified the actual sequence.

    52/111

    Another Variation: Negative Binomial

  • 8/13/2019 13_loredo07

    53/111

    SupposeD=N, the number of trials it took to obtain a predifined

    number of successes, n= 8. What is p(|N, M)?Likelihood

    p(N|, M) is probability for n 1 successes in N 1 trials,times probability that the final trial is a success:

    p(N|, M) = (N 1)!(n 1)!(N n)!

    n1(1 )Nn

    = (N 1)!

    (n 1)!(N n)!n(1

    )Nn

    The negative binomial distribution forN given , n.

    53/111

    P t i

  • 8/13/2019 13_loredo07

    54/111

    Posterior

    p(|D, M) =Cn,Nn(1 )Nn

    p(D|M)

    p(D|M) = Cn,N d n(1 )Nn p(|D, M) = (N+ 1)!

    n!(N

    n)!n(1 )Nn

    Same resultas other cases.

    54/111

    Final Variation: Meteorological Stopping

  • 8/13/2019 13_loredo07

    55/111

    SupposeD= (N, n), the number of samples and number of

    successes in an observing run whose total number was determinedby the weather at the telescope. What is p(|D, M)?(M adds info about weather to M.)

    Likelihood

    p(D|, M) is the binomial distribution times the probabilitythat the weather allowed N samples, W(N):

    p(D

    |, M) = W(N)

    N!

    n!(N n)!n(1

    )Nn

    Let Cn,N=W(N)

    Nn

    . We get the same resultas before!

    55/111

    Likelihood Principle

    To define L(Hi ) = p(D |Hi I ) we must contemplate what other

  • 8/13/2019 13_loredo07

    56/111

    To defineL(Hi) =p(Dobs|Hi, I), we must contemplate what otherdata we might have obtained. But the real sample space may bedetermined by many complicated, seemingly irrelevant factors; itmay not be well-specified at all. Should this concern us?

    Likelihood principle: The result of inferences depends only on howp(Dobs|Hi, I) varies w.r.t. hypotheses. We can ignore aspects of theobserving/sampling procedure that do not affect this dependence.

    This is a sensible property that frequentist methods do not share.Frequentist probabilities are long run rates of performance, andthus depend on details of the sample space that may be irrelevantin a Bayesian calculation.

    Example: Predict 10% of sample is Type A; observe nA = 5 for N= 96Significance test accepts= 0.1 for binomial sampling;

    p(> 2|= 0.1) = 0.12Significance test rejects= 0.1 for negative binomial sampling;

    p(> 2

    |= 0.1) = 0.03

    56/111

  • 8/13/2019 13_loredo07

    57/111

    Gausss Observation: Sufficiency

  • 8/13/2019 13_loredo07

    58/111

    Suppose our data consist ofN measurements, di=+i.

    Suppose the noise contributions are independent, andiN(0, 2).

    p(D|,, M) =

    i

    p(di|,, M)

    = i

    p(i=di |,, M)

    =

    i

    1

    2exp

    (di )

    2

    22

    = 1

    N(2)N/2eQ()/22

    58/111

    Find dependence ofQ on by completing the square:

    Q

    ( )2

  • 8/13/2019 13_loredo07

    59/111

    Q =

    i

    (di )2

    = i

    d2i +N2 2Nd where d 1Ni

    di

    = N( d)2 +Nr2 where r2 1N

    i

    (di d)2

    Likelihood depends on{di} only through d and r:

    L(, ) = 1N(2)N/2

    exp

    Nr

    2

    22

    exp

    N( d)

    2

    22

    The sample mean and variance are sufficient statistics.

    This is a miraculous compression of informationthe normal distnis highly abnormalin this respect!

    59/111

  • 8/13/2019 13_loredo07

    60/111

  • 8/13/2019 13_loredo07

    61/111

    Uninformative prior

    Translation invariance p()C, a constant.This prior is improperunless bounded.

    Prior predictive/normalization

    p(D|, M) = dCexpN( d)222

    = C(/

    N)

    2

    . . . minus a tiny bit from tails, using a proper prior.

    61/111

  • 8/13/2019 13_loredo07

    62/111

    Posterior

    p(|D, , M) = 1(/

    N)

    2

    exp

    N( d)

    2

    22

    Posterior is N(d, w2), with standard deviation w=/

    N.

    68.3% HPD credible region for is d /N.Note that C drops outlimit of infinite prior range is wellbehaved.

    62/111

    Informative Conjugate Prior

  • 8/13/2019 13_loredo07

    63/111

    Use a normal prior, N(0, w20 )

    Posterior

    Normal N(,w2), but mean, std. deviation shrinktowardsprior.Define B= w

    2

    w2+w20, so B

  • 8/13/2019 13_loredo07

    64/111

    Problem specification

    Model: di =+i, i N(0, 2), is unknownParameter space: (, ); seek p(|D, , M)

    Likelihood

    p(D|,, M) = 1N(2)N/2

    expNr222 expN( d)222 1

    NeQ/2

    2

    where Q=Nr2 + ( d)2

    64/111

  • 8/13/2019 13_loredo07

    65/111

  • 8/13/2019 13_loredo07

    66/111

    Marginal Posterior

    p(|D, M) d 1N+1

    eQ/22

    Let = Q22

    so = Q2 and|d|=3/2

    Q2

    p(|D, M) 2N/2QN/2

    d N

    21e

    QN/2

    66/111

    2

  • 8/13/2019 13_loredo07

    67/111

    Write Q=Nr2

    1 +

    d

    r

    2

    and normalize:

    p(|D, M) =

    N2 1

    !

    N2 32

    !

    1

    r

    1 +

    1

    N

    dr/

    N

    2N/2

    Students tdistribution, with t= (

    d)

    r/NA bell curve, but with power-law tailsLarge N:

    p(

    |D, M)

    eN(d)

    2/2r2

    67/111

    Poisson Distn: Infer a Rate from Counts

    Problem: Observe n counts in T; infer rate, r

  • 8/13/2019 13_loredo07

    68/111

    Likelihood

    L(r)p(n|r, M) =p(n|r, M) =(rT)nn!

    erT

    Prior

    Two standard choices: rknown to be nonzero; it is a scale parameter:p(r|M) = 1

    ln(ru/rl)

    1

    r

    rmay vanish; require p(n|M)Const:p(r|M) = 1

    ru

    68/111

    Prior predictive

    1 1 r

  • 8/13/2019 13_loredo07

    69/111

    p(n|M) = 1ru

    1

    n! ru

    0dr(rT)nerT

    = 1

    ruT

    1

    n!

    ruT0

    d(rT)(rT)nerT

    1ruT

    for ru nT

    Posterior

    A gamma distribution:

    p(r|n, M) = T(rT)nn!

    erT

    69/111

    Gamma Distributions

    A 2 t f il f di t ib ti ti ith

  • 8/13/2019 13_loredo07

    70/111

    A 2-parameter family of distributions over nonnegative x, withshape parameter and scale parameter s:

    p(x|, s) = 1s()

    xs

    1ex/s

    Moments:

    E(x) =s Var(x) =s2

    Our posterior corresponds to= n+ 1, s= 1/T.

    Mode r= nT; meanr=

    n+1T (shift down 1 with 1/r prior)

    Std. devn r=

    n+1T

    ; credible regions found by integrating (canuse incomplete gamma function)

    70/111

  • 8/13/2019 13_loredo07

    71/111

    71/111

    The flat prior

    Bayess justification: Not that ignorance of r p(r |I ) = C

  • 8/13/2019 13_loredo07

    72/111

    Bayes s justification: Notthat ignorance ofr p(r|I) =CRequire (discrete) predictive distribution to be flat:

    p(n|I) =

    dr p(r|I)p(n|r, I) =C p(r|I) =C

    Useful conventions

    Use a flat prior for a rate that may be zero Use a log-flat prior (1/r) for a nonzero scale parameter Use proper (normalized, bounded) priors Plot posterior with abscissa that makes prior flat

    72/111

    The On/Off Problem

    Basic problem

  • 8/13/2019 13_loredo07

    73/111

    Look off-source; unknown background rate bCount Noffphotons in interval Toff

    Look on-source; rate is r=s+bwith unknown signal sCount Non photons in interval Ton

    Infer sConventional solution

    b=Noff/Toff; b=

    Noff/Toff

    r=Non

    /Ton

    ; r = Non/Tons= r b; s= 2r +2b

    But scan benegative!

    73/111

    Examples

    Spectra of X-Ray Sources

  • 8/13/2019 13_loredo07

    74/111

    Bassani et al. 1989 Di Salvo et al. 2001

    74/111

    Spectrum of Ultrahigh-Energy Cosmic Rays

    Nagano & Watson 2000

  • 8/13/2019 13_loredo07

    75/111

    g

    HiRes Team 2007

    log10(E) (eV)

    Flux*E3/1024

    (eV2

    m-2s

    -1s

    r-1)

    AGASA

    HiRes-1 Monocular

    HiRes-2 Monocular

    1

    10

    17 17.5 18 18.5 19 19.5 20 20.5 21

    75/111

    N is Never Large

  • 8/13/2019 13_loredo07

    76/111

    Sample sizes are never large. IfN is too small to get asufficiently-precise estimate, you need to get more data (or makemore assumptions). But once N is large enough, you can startsubdividing the data to learn more (for example, in a publicopinion poll, once you have a good estimate for the entire country,you can estimate among men and women, northerners and

    southerners, different age groups, etc etc). N is never enoughbecause if it were enough youd already be on to the nextproblem for which you need more data.

    Similarly, you never have quite enough money. But thats another

    story. Andrew Gelman (blog entry, 31 July 2005)

    76/111

    Backgrounds as Nuisance Parameters

  • 8/13/2019 13_loredo07

    77/111

    Background marginalization with Gaussian noise

    Measure background rateb=b bwith source off. Measure totalrate r= r rwith source on. Infer signal source strength s, where

    r=s+b. With flat priors,

    p(s, b|D, M)exp (b b)

    2

    22b

    exp

    (s+b r)

    2

    22r

    77/111

  • 8/13/2019 13_loredo07

    78/111

    Marginalizebto summarize the results for s(complete the

    square to isolate bdependence; then do a simple Gaussianintegral over b):

    p(s|D, M)exp (s s)2

    22s

    s= r b2

    s

    =2r

    +2b

    Background subtraction is a special case of backgroundmarginalization.

    78/111

    Bayesian Solution to On/Off Problem

  • 8/13/2019 13_loredo07

    79/111

    First consider off-source data; use it to estimate b:

    p(b|Noff, Ioff) = Toff(bToff)NoffebToff

    Noff!

    Use this as a prior for b to analyze on-source data. For on-sourceanalysis Iall = (Ion, Noff, Ioff):

    p(s, b|Non) p(s)p(b)[(s+b)Ton]None(s+b)Ton || Iallp(s|Iall) is flat, but p(b|Iall) =p(b|Noff, Ioff), so

    p(s, b|Non, Iall) (s+b)NonbNoffesToneb(Ton+Toff)

    79/111

    Now marginalize over b;

    p(s|N I )

    db p(s b | N I )

  • 8/13/2019 13_loredo07

    80/111

    p(s|Non, Iall) =

    db p(s, b|Non, Iall)

    db(s+b)NonbNoffesToneb(Ton+Toff)Expand (s+b)Non and do the resulting integrals:

    p(s|Non, Iall) =Noni=0

    Ci Ton(sTon)ie

    sTon

    i!

    Ci

    1 +ToffTon

    i(Non+Noff i)!

    (Non i)!Posterior is a weighted sum of Gamma distributions, each assigning adifferent number of on-source counts to the source. (Evaluate viarecursive algorithm or confluent hypergeometric function.)

    80/111

    Example On/Off PosteriorsShort Integrations

    Ton= 1

  • 8/13/2019 13_loredo07

    81/111

    81/111

    Example On/Off PosteriorsLong Background Integrations

    Ton= 1

  • 8/13/2019 13_loredo07

    82/111

    82/111

    Multibin On/Off

    The more typical on/off scenario:

  • 8/13/2019 13_loredo07

    83/111

    Data = spectrum or image with counts in many bins

    Model Mgives signal rate sk() in bin k, parameters

    To infer , we need the likelihood:

    L() = k p(Nonk, Noffk|sk(), M)For each k, we have an on/off problem as before, only we justneed the marginal likelihood for sk(not the posterior). The sameCicoefficients arise.

    XSPEC and CIAO/Sherpa provide this as an option.

    CHASC approach does the same thing via data augmentation.

    83/111

    Bayesian ComputationLarge sample size: Laplace approximation

    ( )

  • 8/13/2019 13_loredo07

    84/111

    Approximate posterior as multivariate normal det(covar) factors Uses ingredients available in

    2

    /ML fitting software (MLE, Hessian) Often accurate to O(1/N)

    Low-dimensional models (d5) Posterior samplingcreate RNG that samples posterior MCMC is most general framework

    84/111

    Outline

    1 The Big Picture

  • 8/13/2019 13_loredo07

    85/111

    2 FoundationsLogic & Probability Theory

    3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    85/111

    Extrasolar PlanetsExoplanet detection/measurement methods:

    Direct: Transits gravitational lensing imaging

  • 8/13/2019 13_loredo07

    86/111

    Direct: Transits, gravitational lensing, imaging,interferometric nulling

    Indirect: Keplerian reflex motion (line-of-sight velocity,astrometric wobble)

    The Suns Wobble From 10 pc

    -1000 -500 0 500 1000

    -1000

    -500

    0

    500

    1000

    86/111

    Radial Velocity TechniqueAs of May 2007, 242 planets found, including 26 multiple-planetsystems

  • 8/13/2019 13_loredo07

    87/111

    Vast majority (230) found via Doppler radial velocity (RV)

    measurements

    Analysis method: Identify best candidate period via periodogram;fit parameters with nonlinear least squares

    Issues: Multimodality & multiple planets, nonlinearity, stellarjitter, long periods, marginal detections, population studies. . .

    87/111

    Keplerian Radial Velocity ModelParameters for single planet

    b l d (d ) Longitude of pericenter

  • 8/13/2019 13_loredo07

    88/111

    = orbital period (days) e= orbital eccentricity K= velocity amplitude (m/s)

    Mean anomaly of pericenterpassage Mp System center-of-mass velocityv0

    Velocity vs. time

    v(t) =v0+K(ecos + cos[+(t)])

    True anomaly (t) found via Keplers equation for eccentric

    anomaly:

    E(t) esin E(t) =2t

    Mp; tan

    2 = 1 +e1 e1/2

    tanE

    2

    A strongly nonlinear model!

    Keplers laws relate (K, , e) to masses, semimajor axis a,

    inclination i

    88/111

    The Likelihood FunctionKeplerian velocity model with parameters ={K, , e, Mp, , v0}:

    di=v(ti; ) +i

  • 8/13/2019 13_loredo07

    89/111

    For measurement errors with std devn i, and additional jitterwith std devn J,

    L(, J) p({di}|, J)

    =

    N

    i=11

    22i +2J exp 1

    2

    [di v(ti; )]2

    2

    i +2

    J

    i

    1

    2

    2i +

    2J

    exp

    1

    22()

    where 2(, J)

    i

    [di v(ti; )]22i +

    2J

    Ignore jitter for now . . .89/111

    What To Do With ItParameter estimation

    Posterior distn for parameters of model Mi with i planets:

  • 8/13/2019 13_loredo07

    90/111

    p i p

    p(|D, M)p(|M)Li()Summarize with mode, means, credible regions (found byintegrating over )

    DetectionCalculate probability for no planets (M0), one planet(M2) . . . . Let I ={Mi}.

    p(Mi|D, I) p(Mi|I)L(Mi)where L(Mi) =

    dp(|Mi)L()

    Marginal likelihoodL(Mi) includes Occam factor90/111

    DesignPredict future datum dtat time t, accounting for modeluncertainties:

  • 8/13/2019 13_loredo07

    91/111

    p(dt|D, Mi) = dp(dt|, Mi) p(|D, Mi).1st factor is Gaussian for dtwith known model; 2nd term &integral account for uncertainty

    Bayesian adaptive explorationFind time likely to make updated posterior shrink the most.Information theorybest time has largest dt uncertainty(maximum entropy in p(dt|D, Mi))

    Prior information & data Combined information

    Interim

    Strategy New data Predictions StrategyObservn Design

    Inference

    results

    91/111

    Periodogram Connections

    Assume circular orbits: ={K, , Mp, v0}

  • 8/13/2019 13_loredo07

    92/111

    Frequentist

    For given ,maximizelikelihood over K and Mp (set v0 todata mean, v) profile likelihood:

    log Lp(, v)Lomb-Scargle periodogram

    Bayesian

    For given , integrate (marginalize) likelihoodprior overK and Mp (set v0 to data mean, v)marginal posterior:

    log p(, v|D)Lomb-Scargle periodogram

    Additionally marginalize over v0 floating-mean LSP

    92/111

    Kepler Periodogram for Eccentric OrbitsThe circular model is linearwrt Ksin Mp, Kcos Mp coincidenceof profile and marginal.

  • 8/13/2019 13_loredo07

    93/111

    The full model is nonlinearwrt e, Mp, .

    Profiling is known to behave poorly for nonlinear parameters (e.g.,asymptotic confidence regions have incorrect coverage for finitesamples; can even be inconsistent).

    Marginalization is valid with complete generality; also has goodfrequentist properties.

    For given , marginalize likelihoodprior over allparams but :log p(

    |D)

    Kepler periodogram

    This requires a 5-d integral.

    Trick: For K priorK(interim prior), three of the integrals canbe done analytically integrate numerically only over (e, Mp)

    93/111

    A Bayesian Workflow for Exoplanets

    U K l i d t d di i lit t 3 d ( M )

  • 8/13/2019 13_loredo07

    94/111

    Use Kepler periodogram to reduce dimensionality to 3-d (, e, Mp).

    Use Kepler periodogram results (p(), moments ofe, Mp) todefine initial population for adaptive, population-based MCMC.

    Once we have{, e, Mp}, get associated{K, , v0} samples fromtheir exact conditional distribution.

    Fix the interim prior by weighting the MCMC samples.

    (See work by Phil Gregory & Eric Ford for simpler but less efficient

    workflows.)

    94/111

    A Bayesian Workflow for Exoplanets

  • 8/13/2019 13_loredo07

    95/111

    95/111

    Estimation Results for HD22258224 Keck RV observations spanning 683 days; long period; hi e

    Kepler Periodogram

  • 8/13/2019 13_loredo07

    96/111

    96/111

    Differential Evolution MCMC Performance

    Marginal for (, e) Convergence: Autocorrelation

  • 8/13/2019 13_loredo07

    97/111

    Reaches convergence much more quickly than simpler algorithms

    97/111

    Multiple Planets in HD 208487Phil Gregorys parallel tempering algorithm found a 2nd planet inHD 208487.

    129 8 d 909 d

  • 8/13/2019 13_loredo07

    98/111

    1 = 129.8 d; 2 = 909 d

    0 0.2 0.4 0.6 0.8

    e2

    10

    20

    30

    40

    50

    K2

    m

    s

    1

    0 0.5 1 1.5 2

    P2 Orbital phase

    20

    10

    0

    10

    20

    Velocity

    m

    s

    1

    b

    0 0.5 1 1.5 2

    P1 Orbital phase

    30

    20

    10

    0

    10

    20

    30

    Velocity

    m

    s

    1

    a

    98/111

    Adaptive Exploration for Toy Problem

    Data are Kepler velocities plus noise:

  • 8/13/2019 13_loredo07

    99/111

    di=V(ti; , e, K) +ei

    3 remaining geometrical params (t0, , i) are fixed.

    Noise probability is Gaussian with known = 8 m s1.

    Simulate data with typical Jupiter-like exoplanet parameters:

    = 800 d

    e = 0.5

    K = 50 ms1

    Goal: Estimate parameters , e and K as efficiently as possible.

    99/111

  • 8/13/2019 13_loredo07

    100/111

    Cycle 1: InferenceUse flat priors,

    p(, e, K|D, I) exp[Q(, e, K)/22]

  • 8/13/2019 13_loredo07

    101/111

    Q= sum of squared residuals using best-fit amplitudes.

    Generate{j, ej, Kj} via posterior sampling.

    101/111

    Cycle 1 Design: Predictions, Entropy

  • 8/13/2019 13_loredo07

    102/111

    102/111

    Cycle 2: Observation

  • 8/13/2019 13_loredo07

    103/111

    103/111

    Cycle 2: Inference

  • 8/13/2019 13_loredo07

    104/111

    VolumeV2 =V1/5.8

    104/111

    Evolution of InferencesCycle 1 (10 samples)

  • 8/13/2019 13_loredo07

    105/111

    Cycle 2 (11 samples; V2 =V1/5.8)

    105/111

    Cycle 3 (12 samples; V3 =V2/3.9)

  • 8/13/2019 13_loredo07

    106/111

    Cycle 4 (13 samples; V4

    =V3

    /1.8)

    106/111

    Outline1 The Big Picture

    2 FoundationsLogic & Probability Theory

  • 8/13/2019 13_loredo07

    107/111

    3 Inference With Parametric ModelsParameter EstimationModel Uncertainty

    4 Simple ExamplesBinary OutcomesNormal DistributionPoisson Distribution

    5 Application: Extrasolar Planets

    6 Probability & Frequency

    107/111

    Probability & Frequency

    Frequencies are relevant when modeling repeated trials, orrepeated sampling from a population or ensemble.

  • 8/13/2019 13_loredo07

    108/111

    Frequencies are observables:

    When available, can be used to inferprobabilities for next trial

    When unavailable, can be predicted

    Bayesian/Frequentist relationships:

    General relationships between probability and frequency

    Long-run performance of Bayesian procedures Examples of Bayesian/frequentist differences

    108/111

    Relationships Between Probability & Frequency

    Frequency from probability

    Bernoullis law of large numbers: In repeated i.i.d. trials, given

  • 8/13/2019 13_loredo07

    109/111

    Bernoulli s law of large numbers: In repeated i.i.d. trials, given

    P(success| . . .) =, predictNsuccessNtotal

    as Ntotal

    Probability from frequencyBayess An Essay Towards Solving a Problem in the Doctrineof ChancesFirst use of Bayess theorem:Probability for success in next trial of i.i.d. sequence:

    E NsuccessNtotal

    as Ntotal

    109/111

    Subtle Relationships For Non-IID CasesPredict frequency in dependent trials

    rt= result of trial t; p(r1, r2 . . . rN|M) known; predict f

    f = 1

    p(rt = success M)

  • 8/13/2019 13_loredo07

    110/111

    f

    =Nt p(rt= success|M)

    where p(r1|M) =

    r2

    rN

    p(r1, r2 . . . |M3)

    Expectedfrequency of outcome in many trials =

    averageprobability for outcome across trials.Butalso find that f neednt converge to 0.

    Infer probabilities for different but related trialsShrinkage: Biased estimators of the probability that share info

    across trials are better than unbiased/BLUE/MLE estimators.

    A formalism that distinguishes p from f from the outset is particularlyvaluable for exploring subtle connections. E.g., shrinkage is explored viahierarchical and empirical Bayes.

    110/111

    Frequentist Performance of Bayesian ProceduresMany results known for parametric Bayes performance:

    Estimates are consistent if the prior doesnt exclude the true value.Credible regions found with flat priors are typically confidence

  • 8/13/2019 13_loredo07

    111/111

    Credible regions found with flat priors are typically confidenceregions to O(n1/2); reference priors can improve theirperformance to O(n1).

    Marginal distributions have better frequentist performance thanconventional methods like profile likelihood. (Bartlett correction,ancillaries, bootstrap are competitive but hard.)

    Bayesian model comparison is asymptotically consistent (not true ofsignificance/NP tests, AIC).

    For separate (not nested) models, the posterior probability for thetrue model converges to 1 exponentially quickly.

    Walds complete class theorem: Optimalfrequentist methods areBayes rules(equivalent to Bayes for some prior) . . .

    Parametric Bayesian methods are typically good frequentist methods.(Not so clear in nonparametric problems.)

    111/111