BAYESIAN DATA ANALYSIS - ITAMallman.rhon.itam.mx/~lnieto/index_archivos/NotesBDA2.pdfStat 422 & GS01...

STAT 422 & GS01 0013

BAYESIAN DATA ANALYSIS

INSTRUCTORS:

GARY ROSNER ([email protected])

LUIS NIETO-BARAJAS ([email protected])

INSTRUCTORS: G. RONER & L. NIETO-BARAJAS

Stat 422 & GS01 0013 Bayesian Data Analysis

2

1. Decision Theory

1.1 Foundations and axioms of coherence

Ø The OBJECTIVE of Statistics, and in particular of Bayesian Statistics, is to

provide a methodology to adequately analyze the available information

(data analysis) and to decide in a reasonable way the best way to proceed

(decision theory).

Ø DIAGRAM of Statistics:

Ø Types of INFERENCE:

Classic Bayesian

Parametric √√√ √√

Non parametric √√ √

Population

Sample

Sampling Inference

DDaattaa aannaallyyss iiss

DDeecciiss iioonn mmaakkiinngg



3

Ø Statistics is based on PROBABILITY THEORY. Formally, probability is a

function that satisfies certain conditions (axioms of probability), but in

general it can be understood as a measure or quantification of uncertainty.

Ø Although there is only one mathematical definition of probability, there are

several interpretations of probability: classic, frequentist and subjective.

BAYESIAN THEORY is based on the subjective interpretation of probability

and has its roots in Bayes Theorem.

Reverend Thomas Bayes (1702-1761).

Ø Statistical Inference is a way of taking decisions. Classical methods of

inference ignore important aspects of the decision-making process;

however, Bayesian methods of inference do take them into account.

Ø What is a decision problem?. We face a decision problem when we have to

select from two or more ways of proceeding.

Ø TAKING DECISIONS is a fundamental aspect in the life of a professional

person, for instance, an administrator must take decisions constantly in an

environment with uncertainty, decisions about the best project to undertake

or the opportunity of investing some money, etc.



4

Ø DECISION THEORY proposes a method of taking decisions based on some

basic principles about the coherent election between alternative options.

Ø ELEMENTS OF A DECISION PROBLEM under uncertainty:

A decision problem is defined by the quadruplet (D, E, C, ≤), where:

q D : Space of decisions. Set of possible alternatives, it has to be exhaustive

(contains all possibilities) and exclusive (electing one element in D

excludes the election of any other).

D = d1,d2,...,dk.

q E : Space of uncertain events. Contains uncertain events relevant to the

decision problem.

Ei = Ei1,Ei2,...,Eimi., i=1,2,…,k. Ei = E i1, E i2,K, E im i ,i = 1,2,K, k

q C : Space of consequences. Set of possible consequences and describes the

consequence of electing a decision.

C = c1,c2,...,ck.

q ≤ : Preference relation among different options. Is defined in such a way

that d1≤d2 if d2 is preferred over d1.

• REMARK: For the moment we will consider discrete spaces (decisions,

events and consequences), although the theory is applied to continuous

spaces.



5

Ø DECISION TREE (under uncertainty):

Ø EXAMPLE 1: A physician needs to decide whether to carry out surgery on a

person he believes has a tumor or to treat with chemotherapy. If the patient

does not have a tumor, the life expectancy is 20 years. If he has a tumor,

undergoes surgery, and survives, he is given 10 years of life; whereas if he

has a tumor and does not undergo surgery, he is only given 2 years of life.

d1

di

dk

c11

c12

c1m1

There is not full information about the consequences of taking a decision.

E11

E12

E1m1

Ei1

Ek1

Ei2

Ek2

Eimi

Ekmk

ci1

ck1

ci2

ck2

cimi

ckmk

Decision node Uncertainty (random) node



6

D = d1, d2, where d1 = surgery, d2 = therapy

E = E11, E12, E13, E21, E22, where E11 = survival / tumor, E12 = survival /

no tumor, E13 = dead, E21 = tumor, E22 = no tumor

C = c11, c12, c13, c21, c22, where c11=10, c12=20, c13=0, c21=2, c22=20

Ø In practice, most decision problems have a much more complex structure.

For instance, one may have to decide whether or not to carry out an

experiment, and if one does the experiment, take another decision

according to the result of the experiment. (Sequential decision problems).

Ø Frequently, the set of uncertain events is the same for all decisions, that is,

Ei = E i1, E i2,K, E im i = E1, E 2,K, Em = E , for all i. In this case, the

problem can be represented as:

Surgery

Therapy

Surv

Dead

Tumor

No tum

Tumor

No tum

10 yrs.

20 yrs.

0 yrs.

2 yrs.

20 yrs.



7

E1 ... Ej ... Em

d1 c11 ... c1j ... c1m

M M M M

di ci1 ... cij ... cim

M M M M

dk ck1 ... ckj ... ckm

Ø The OBJECTIVE of a decision problem under uncertainty is then to take the

best decision di from the set D without knowing which of the events Eij

from Ei will occur.

Ø Although the events that form each Ei are uncertain, in the sense that we do

not know which of them will occur, in general, we have an idea of the

probability of each of them. For instance,

25 years

Ø Sometimes it is difficult to order our preferences among all possible

different consequences. It might be simpler to assign a utility measure to

each of the consequences and then order them according to their utility.

live 10 yrs. more

die in 1 month

reach 90 yrs.

Which is more probable?



8

Ø QUANTIFICATION of uncertain events and of consequences.

q The information that the decision maker has about the possible occurrence

of the events can be quantified through a probability function on the space

E.

q In the same way, it is possible to quantify the preferences of the decision

maker among different consequences through a utility function in such a

way that 'j'iij cc ≤ ⇔ ( ) ( )'j'iij cucu ≤ .

Ø Alternatively, it is possible to represent the decision tree as follows:

Consequences Earn much money & have little available

time

Earn little money & have much available

time

Earn regular money & have regular available

time

d1

di

dk

u(c11)

u(c12)

u(c1m1)

P(E11|d1)

P(E12|d1)

P(E1m1|d1)

P(Ei1|di)

P(Ek1|dk)

P(Ei2|di)

P(Ek2|dk)

P(Eimi|di)

P(Ekmk|dk)

u(ci1)

u(ck1)

u(ci2)

u(ck2)

u(cimi)

u(ckmk)



9

Ø How to take the best decision?

If in some way we were able to make the uncertainty disappear, we could

order our preferences according to the utility of each decision. Then the

best decision would be the one that has the maximum utility.

Ø STRATEGIES: In principle we will study 4 strategies or criteria proposed in

the literature to take decisions.

1) Optimistic: Assume that what will occur is the best consequence of each

option.

2) Pessimistic (or minimax): Assume that what will occur is the worst

consequence of each option.

3) Conditional or most probable: Assume that what will occur is the most

probable consequence.

4) Expected utility: Assume that what will occur is the average

consequence of each option.

q Whichever strategy one takes, the best option is the one that maximizes the

tree “without uncertainty.”

Ø EXAMPLE 2: In a parliamentary elections in the UK, there were two parties

competing: Conservative and Labor. A gambling house offered the

following options:

Uncertainty Decider Go away



10

a) To someone who bets in favor of the Conservative party, the house was

willing to pay $7 for each bet of $4 if the election favors the

Conservatives; otherwise, the gambler will lose the bet.

b) To someone who bets in favor of the Labor party, the house was willing

to pay $5 for each bet of $4 if the Labor party wins; otherwise, the

gambler will lose his money.

o D = d1,d2

where, d1 = Bet in favor of the Conservatives

d2 = Bet in favor of the Labor

o E = E1, E2

where, E1 = Conservative party wins

E2 = Labor party wins

o C = c11, c12, c21, c22. If the bet is of $k

then, k43

k47

kc11 =+−=

kc12 −=

kc 21 −=

k41

k45

kc22 =+−=

What party do I bet for?

Labor Conservative



11

Assume that the utility is proportional to the money won, i.e.,

u(cij) = cij, & Let π = P(E1) & π−1 = P(E2).

P(E) π π−1

u(d,E) E1 E2

d1 (3/4)k -k

d2 -k (1/4)k

1) Optimistic: d1 (bet in favor of the Conservatives)

2) Pessimistic: d1 or d2 (whichever)

3) Conditional: d1 or d2 (it depends on the value of π)

If 2/1>π , we take E1 as a “sure event” ⇒ d1

Si 2/1≤π , we take E2 as a “sure event” ⇒ d2

4) Expected utility: d1 ó d2 (depending)

The expected utilities are:

( ) k1)4/7()k)(1(k)4/3(du 1 −π=−π−+π=

( ) k)4/5()4/1(k)4/1)(1()k(du 2 π−=π−+−π=

Then, the best decision would be:

If ( )1du > ( )2du ⇔ 12/5>π ⇒ d1

If ( )1du < ( )2du ⇔ 12/5<π ⇒ d2

If ( )1du = ( )2du ⇔ 12/5=π ⇒ d1 ó d2

We can see this graphically if we define the functions

( ) ( )11 dukk47

g =−π

=π , and

( ) ( )22 du4k

k45

g =+π

−=π



12

then, if k = 1,

The bold line represents the best solution to the decision problem given

by the expected utility criterion.

Remark: If [ ]7/4,5/1∈π , the expected utility of the best decision is

negative!.

Question 1: Would you bet if [ ]7/4,5/1∈π ?

Question 2: What would be the value the house believes π had?

Ø INADMISSIBILITY of an option: One option d1 is inadmissible if there exists

another option d2 such that d2 is at least as preferred as d1 no matter what

happens (for every uncertain event) and there exists one case (uncertain

event) for which d2 is preferred over d1.

Ø AXIOMS OF COHERENCE. These are a series of principles that establish the

conditions for taking coherent decisions and that clarify the possible

5/12

1/5 4/7

g1(π)

g2(π)

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

-0.5

0.0

0.5

1.0



13

ambiguity in the process of taking a decision. There are four axioms of

coherence:

1. COMPARABILITY. This axiom establishes that we should at least be able

to express preferences between two different options and, therefore,

between two possible consequences. That is, not all options nor all the

conditions are the same.

q For all pairs of options d1 and d2 in D, one and only one of the

following conditions is true:

d2 is preferred over d1 ⇔ d1 < d2

d1 is preferred over d2 ⇔ d2 < d1

d1 and d2 are equally preferred ⇔ d1 ∼ d2

2. TRANSITIVITY. This axiom establishes that preferences must be

transitive to avoid contradictions.

q If d1, d2 and d3 are three options and d1<d2 and d2<d3, then it must

happen that d1<d3. Similarly, if d1∼d2 and d2∼d3, then d1∼d3.

3. SUBSTITUTION AND DOMINATION. This axiom establishes that if you

have two situations such that for every result of the first situation there

exists a result preferred in the second situation, then the second situation is

preferred over the first one no matter the result.



14

q If d1 and d2 are two options and E is an uncertain event and it happens

that d1<d2 when E occurs and d1<d2 when E does not occur, then d1<d2

(no matter the uncertain events). Similarly, if d1∼d2 when E occurs and

d1∼d2 when E does not occur, then d1∼d2.

4. REFERENCE EVENTS. This axiom establishes that to be able to take

reasonable decisions, it is necessary to measure the information and the

preferences of the decision maker in a quantitative form. Thus, there must

be a measure (P) based on reference events.

q The decision maker can imagine a way to generate points in the unit

square of two dimensions in such a way that for any two regions R1 and

R2 in that square, the event 1Rz ∈ is more possible than the event

2Rz ∈ only if the area of R1 is larger than the area of R2.



15

1.2 Maximum expected utility principle

Ø IMPLICATIONS from the coherence axioms:

q In general, every option di can be written as the set of all possible

consequences given the uncertain events, that is,

di = cij Eij , j= 1,K,mi .

1) The consequences can be seen as particular cases of options:

c ~ dc = Ωc ,

where Ω is the sure event.

From this formulation, we can compare consequences:

c1 ≤ c2 ⇔ c1|Ω ≤ c2|Ω.

Therefore, it is possible to find two consequences c∗ (the worst) and c∗

(the best), such that for any other consequence c, c∗ ≤ c ≤ c∗.

2) The uncertain events can also be seen as particular cases of options:

E ~ dE = c* E, c∗ E c ,

where c∗ and c* are the worst and the best consequences.

In this way, we can compare also the uncertain events:

E ≤ F ⇔ c*|E, c∗|Ec ≤ c*|F, c∗|Fc,

In this case we would say that E is not more plausible than F.

Ø QUANTIFICATION OF THE CONSEQUENCES: The quantification of a

consequence c will be a number u(c) measured in the scale [0,1]. This

quantification will be based on reference events..



16

o Definition: Utility.

The utility u(c) of a consequence c is the probability q that is assigned to

the best consequence c* such that consequence c is equally preferred than

option

c*|Rq, c∗|Rqc

(this can also be written as c*|q, c∗|1−q), where Rq is a reference event in

the unit square of area q. From this definition, for all consequences c,

there exists an option based on reference events such that

c ∼ c*|u(c), c∗|1−u(c).

q Based on the axioms of coherence 1, 2 and 3, there always exists a number

u(c)∈[0,1] that satisfies the condition above, because

c* ~ c*|R0, c∗|R0c ⇒ u(c∗) = 0

c* ~ c*|R1, c∗|R1c ⇒ u(c∗) = 1

Thus, for all c such that c∗ ≤ c ≤ c∗ ⇒ 0 ≤ u(c) ≤ 1.

q EXAMPLE 3: Utility of money. Let us assume that the worst and the best

consequences when playing a chance game are:

c* = $0 (the worst) y c* = $1,000 (the best)

The idea is to determine a utility function for every consequence c such

that c* ≤ c ≤ c*. Consider the lottery:

What do you prefer?

Win Surely

c

Win c* with probability q

or Win c* with

probability 1-q

c|1

c*|q, c∗ |1−q



17

If the number of consequences is large or even infinite, the utility function

can be approximated by a model resulting in one of the following forms:

q REMARK: In some cases, it is more convenient to define the utility function

on a scale different from [0,1], for example, in a time scale, in negative

numbers, number of products sold, years of life, etc. It is possible to prove

that a utility function defined on a scale different from [0,1] can be seen as

a linear transformation from the original utility defined in [0,1].

Ø QUANTIFICATION OF THE UNCERTAIN EVENTS: The quantification of the

uncertain events E will also be based on the reference events.

o Definition: Probability.

The probability P(E) of an event E is the area of a region R from the unit

square chosen in such a way that the options c*|E, c∗|Ec and c*|R, c∗|Rc

are equally preferred (equivalents).

0

1

0 1,000

c

u(c)

To avoid falling into a paradox, it is convenient if the utility function be “risk aversive”.

Risk indifference

Risk lover

Risk aversion



18

o In other words, if dE = c∗|E, c∗|Ec and dR = c∗|Rq, c∗|Rqc are such that

dE ∼ dRq ⇒ P(E) = q.

q EXAMPLE 4: Assign a probability to an event E. Suppose that we are facing

the problem of deciding which treatment is best for a patient and that the

worst and the best consequences are:

c∗ = 0 (the worst) and c* = 20 (the best)

Let E = tumor. To determine the probability of E, we will consider the

following lotteries:

Finally, this procedure is applied to each of the events, say, E1,E2,...,Ek. If

the number of events is large or even infinite, the probability function can

be approximated by a model (discrete or continuous) having the following

shape,

What do you prefer?

Win c* with probability q

or Win c∗ with

probability 1-q

Win c* if E occurs

or Win c∗ if

E does not occur

a

θ

P(θ)

Continuous model

b

If Eθ =θ ⇒ E = θ | θ∈[a,b]

c*|q, c∗ |1−q c*|E, c∗ |Ec



19

q REMARK: The probability assigned to an event is always conditional on the

available information at the moment of the assignment, i.e., there does not

exist absolute probabilities.

Ø DERIVING THE EXPECTED UTILITY:

Up to now we have quantified the consequences and the uncertain events.

Finally we want to assign a number to the options in such a way that the

best option is the one assigned the highest number.

o Theorem: Bayesian decision criteria.

Consider the decision problem defined by D = d1,d2,...,dk, where di =

ijij m,,1j,Ec K= , i=1,...,k. Let P(Eij|di) be the probability of occurrence

of Eij if option di is selected, and let u(cij) be the utility of the consequence

cij. Then, the quantification of the option di is its expected utility, i.e.,

( ) ( ) ( )∑=

=im

1jiijiji dEPcudu .

The optimal decision is d∗ such that ( ) ( )ii

* dumaxdu = .

PROOF.

ii imim2i2i1i1iiijiji Ec,,Ec,Ecm,,1j,Ecd KK ===

Additionally we know,

ijc ∼ ( ) ( ) ijij*c

ijij* cu1c,cucRc,Rc −= ∗∗ ,

where ( ) ( )ijij RAreacu = , and

c*E Ec,Ecd ∗= ∼ ( ) ( ) jj

*cEE

* EP1c,EPcRc,Rc −= ∗∗ ,

where ( ) ( )ERAreaEP = .



20

Then, combining both expressions we get

1ii im

cimim1i

c1i1ii ERc,Rc,,ERc,Rcd ∗

∗∗

∗= K

iiii im

cimimim1i

c1i1i1i ERc,ERc,,ERc,ERc ∩∩∩∩= ∗

∗∗

∗ K

( ) ( ) ( ) ( ) iiii im

cim1i

c1iimim1i1i ERERc ,ERERc ∩∪∪∩∩∪∪∩= ∗

∗ LL

Finally,

( ) ( ) ( ) ii imim1i1ii ERERAreadu ∩∪∪∩= L

( ) ( )∑=

=im

1jijij EAreaRArea ( ) ( )∑

=

=im

1jijij EPcu

Ø IN SUMMARY: If we accept the coherence axioms, we will necessarily

proceed in the following way:

1) Assign a utility u(c) for all c in C.

2) Assign a probability P(E) for all E in E.

3) Select the (optimal) option that maximizes the expected utility.



21

1.3 The learning process and the predictive distribution

Ø The natural reaction of someone who needs to take a decision for which the

consequences depend on the occurrence of uncertain events E, is to try to

reduce the uncertainty by obtaining more information about E.

Ø How to reduce the uncertainty of an event E?.

Obtain additional information (Z) about E.

Ø THE IDEA is to gather information that will help us reduce the uncertainty

of the events, that is, to improve the knowledge we have about E.

Ø From where do we obtain additional information?.

Surveys, previous studies, experiments, etc.

Ø The main problem in statistical inference is to produce a methodology that

allows us to understand and interpret available information with the aim of

improving our initial knowledge.

Ø How to improve our knowledge on E?.

( )EP ( )ZEP

Use Bayes Theorem.

¿ ?



22

o BAYES THEOREM : Let Jj,E j ∈ be a finite partition of Ω (E), i.e.,

Ej∩Ek=∅ ∀j≠k &y Ω=∈U

JjjE . Let Z ≠ ∅ be an event. Then,

P Ei Z( )=P Z Ei( )P Ei( )P Z E j( )P E j( )

j∈J∑

, i =1,2,...,k.

PROOF.

( ) ( )( )

( ) ( )( )ZP

EPEZPZP

ZEPZEP iii

i =∩

=

Given that ( )UUJj

jJj

j EZEZZZ∈∈

∩=

∩=Ω∩= such that

( ) ( ) ∅=∩∩∩ kj EZEZ ∀j≠k

( ) ( ) ( ) ( ) ( )∑∑∈∈∈

=∩=

∩=⇒Jj

jjJj

jJj

j EPEZPEZPEZPZP U .

Ø Comments:

1) An alternative way of writing Bayes Theorem is:

P Ei Z( )∝ P Z Ei( )P Ei( )

P(Z) is called the proportionality constant.

2) P(Ej) are called prior (a-priori) probabilities and P(Ej|Z) are called final

(a-posteriori) probabilities. Moreover, P(Z|Ej) is called the likelihood

and P(Z) is called the marginal probability of the additional

information.

Ø Remember that all of these prior and final quantifications of the events

arise because we want to reduce uncertainty in a decision problem.



23

Assume that for a given problem we have the following:

( )ijEP : initial quantification of the events

( )ijcu : quantification of the consequences

Z: additional information about the events

( )EP ( )ZEP

In this case we have two situations:

1) Initial situation (a-priori):

( )ijEP , ( )ijcu , ( ) ( )∑j

ijij EPcu

2) Final situation (a-posteriori):

( )ZEP ij , ( )ijcu , ( ) ( )∑j

ijij ZEPcu

Ø What would happen if in some way we manage to obtain yet more

information about E? Assume that we first have access to Z1 (additional

information about E) and later we obtain Z2 (more information about E).

There exist two ways of updating all available information about E:

1) Sequential updating:

( )EP ( )1ZEP ( )21 Z,ZEP

The steps are:

Step 1: ( ) ( ) ( )( )1

11 ZP

EPEZPZEP = ,

Bayes Theo.

Initial expected

utility

Final expected

utility

Z1 Z2



24

Step 2: ( ) ( ) ( )( )12

11221 ZZP

ZEPE,ZZPZ,ZEP = .

2) Simultaneous updating:

( )EP ( )21 Z,ZEP

How do we do it?

Unique step: ( ) ( ) ( )( )21

2121 Z,ZP

EPEZ,ZPZ,ZEP = .

q These two ways of updating the information are equivalent…

P E Z1,Z2( )=P Z2 Z1,E( )P E Z1( )

P Z2 Z1( )=

P Z1,Z2,E( )P Z1,E( )

P Z1,E( )P Z1( )

P Z1,Z2( )P Z1( )

=P Z1,Z2 ,E( )P Z1,Z2( )

=P Z1,Z2 E( )P E( )

P Z1,Z2( ).

Ø EXAMPLE 5: A patient goes to a physician with a certain disease. Suppose

that the patient’s disease falls in one of the following three categories:

E1 = Frequent disease (cold)

E2 = Relatively frequent disease (flu)

E3 = Not frequent disease (pneumonia)

The physician knows through expertise that

P(E1)=0.6, P(E2)=0.3, P(E3)=0.1 (prior probabilities)

Z1,Z2



25

The physician examines the patient and obtains additional information (Z =

symptoms) about the possible disease of the patient. According to the

symptoms, the physician determines that

P(Z | E1)=0.2, P(Z | E2)=0.6, P(Z | E3)=0.6 (likelihood)

What is the most probable disease for this patient?.

Using Bayes Theorem we obtain:

P Z( )= P Z E j( )P E j( )j=1

3

∑ = (0.2)(0.6) + (0.6)(0.3) + (0.6)(0.1) = 0.36

P E1 Z( )= (0.2)(0.6)0.36

= 13

P E2 Z( )= (0.6)(0.3)0.36

= 12

P E3 Z( )= (0.6)(0.1)0.36

= 16

Therefore, it is most probable that the patient has a relatively frequent

disease (E2).

Inference problems.

Ø PARAMETRIC INFERENCE. Let F ( ) Θ∈θθ= ,xf be a parametric family

indexed by the parameter θ∈Θ. Let X1,...,Xn be a random sample (r. s.) of

observations from f(x|θ) ∈F. The inference problem consists of estimating

the real value of the parameter θ.

q A statistical inference problem can be seen as a decision problem with the

following elements:

D = space of decisions according to the specific problem

(final probabilities)



26

E = Θ (parameter space)

C = ( ) Θ∈θ∈θ ,d:,d D

≤ : will be represented by a utility function or a loss function.

Ø The sample gives additional information about the uncertain events θ ∈ Θ.

The problem consists of how to update the information.

Ø As we saw in the coherence axioms, the decision maker is capable of

quantifying his or her knowledge about the uncertain events through a

probability function. We then define,

( )θf the prior distribution (or a-priori). Quantifies the initial

knowledge about θ.

( )θxf sample information generating process. Gives additional

information about θ.

( )θxf the likelihood function. Contains all information about θ given

by the sample ( )n1 X,XX K= .

q All this information about θ is combined to obtain a final knowledge or a-

posteriori after having observed the sample. The way to do it is by means

of Bayes Theorem:

f θ x( )=f xθ( )f θ( )

f x( ),

where f x( )= f xθ( )f θ( )dθΘ∫ or f x θ( )f θ( )

θ∑ .

As f θ x( ) is a function of θ, then we can write

f θ x( )∝ f xθ( )f θ( )



27

Finally,

f θ x( ) the posterior distribution (or a-posteriori). Summarizes all

available knowledge about θ (prior+sample).

Ø REMARK: As θ is a random quantity, since we are uncertain about the true

value of θ, the density function that generates relevant information about θ

is actually a conditional density.

o Definition: We will call a random sample (r.s.) of size n, from a population

f(x|θ) that depends on θ, the set X1,...,Xn of random variables conditionally

independent given θ, i.e.,

( ) ( ) ( )θθ=θ n1n1 xfxfx,xf LK .

In this case, the likelihood function is the conditional joint density of the

sample, seen as a function of the parameter, i.e.,

f x θ( )= f xi θ( )i=1

n

∏ .

Ø PREDICTIVE DISTRIBUTION: The preditive distribution is the marginal

density function f(x) that allows us to determine which values of the

random variable are more probable.

q What we know about X is conditioned on the value of the parameter θ, i.e.,

f(x|θ) (its conditional density). As θ is an unknown quantity, f(x|θ) can not

be used to describe the behavior of the r.v. X.



28

q Prior predictive distribution. Although the true value of θ is unknown, we

will always have some information about θ (through its prior distribution

f(θ)). This information can be combined to yield some information about

the values of X. The way to do it is :

( ) ( ) ( )∫ θθθ= dfxfxf or ( ) ( ) ( )∑θ

θθ= fxfxf

q Assuming that we have available additional information (sample

information) X1,X2,..,Xn from the density f(x|θ), it is possible to have a

final knowledge about θ through its posterior distribution ( )xf θ .

q Posterior predictive distribution. Suppose we want to obtain information

regarding the possible values of a new random variable XF from the same

population f(x|θ). If XF is (conditionally) independent from the sample

X1,X2,..,Xn, then

( ) ( ) ( )∫ θθθ= dxfxfxxf FF or ( ) ( ) ( )∑θ

θθ= xfxfxxf FF

Ø EXAMPLE 6: Toss a coin. Consider a random experiment that consists of

tossing a coin. Let X be the r.v. that takes the value of 1 if the coin falls

head and 0 if tail, i.e., X∼Ber(θ). Actually we have X|θ ∼Ber(θ), where θ is

the probability of success (head).

( ) ( ) )x(I1xf 1,0x1x −θ−θ=θ .

The prior knowledge we have about the coin is that it could be fair or

unfair (two heads).

P(fair coin) = 0.95 & P(unfair coin) = 0.05



29

How to quantify this knowledge about θ?

Fair coin ⇔ θ = 12

Unfair coin ⇔ θ = 1

θ ∈ 12 ,1

Fair coin ⇔ θ = 1/2 θ ∈ 1/2, 1

Unfair coin ⇔ θ = 1

therefore,

( ) 95.02/1P ==θ and ( ) 05.01P ==θ

That is,

( )

=θ

=θ=θ

1 if ,05.0

2/1 if ,95.0f

Suppose that the coin is thrown once and we got a head, i.e, X1=1. Then

the likelihood is

( ) ( ) θ=θ−θ=θ= 011 11XP .

Combining the prior knowledge with the likelihood we get,

( ) ( ) ( ) ( ) ( )1P11XP2/1P2/11XP1XP 111 =θ=θ=+=θ=θ===

( )( ) ( )( ) 525.005.0195.05.0 =+=

( ) ( ) ( )( )

( )( )9047.0

525.095.05.0

1XP2/1P2/11XP

1X2/1P1

11 ==

==θ=θ=

===θ

( ) ( ) ( )( )

( )( )0953.0

525.005.01

1XP1P11XP

1X1P1

11 ==

==θ=θ=

===θ

in other words,

( )

=θ

=θ==θ

1 if ,0953.0

2/1 if ,9047.01xf 1 .

Now, the prior predictive distribution is



30

( ) ( ) ( ) ( ) ( )1P11XP2/1P2/11XP1XP =θ=θ=+=θ=θ===

( )( ) ( )( ) 525.005.0195.05.0 =+=

( ) ( ) ( ) ( ) ( )1P10XP2/1P2/10XP0XP =θ=θ=+=θ=θ===

( )( ) ( )( ) 475.005.0095.05.0 =+=

that is,

f x( )=0.525, if x = 10.475, if x = 0

,

And the posterior predictive distribution is

( ) ( ) ( ) ( ) ( )1x1P11XP1x2/1P2/11XP1x1XP 1F1F1F ==θ=θ=+==θ=θ==== ( )( ) ( )( ) 54755.00953.019047.05.0 =+=

( ) ( ) ( ) ( ) ( )1x1P10XP1x2/1P2/10XP1x0XP 1F1F1F ==θ=θ=+==θ=θ==== ( )( ) ( )( ) 45235.00953.009047.05.0 =+=

that is,

f xF x1 = 1( )=0.548, if xF = 10.452, if xF = 0

.

Ø EXAMPLE 7: Amount of tyrosine. The consequences of certain treatment can

be determined by the amount of tyrosine (θ) in the urine. The prior

information about this quantity in patients shows that it is around

39mg./24hrs. and that the percentage of times this quantity exceeds

49mg./24hrs. is 25%.

According to this information, “it can be implied” that the normal

distribution models “reasonably well” this behavior, so

θ ∼ N µ,τ 2( ), where µ=E(θ)=mean and τ2=Var(θ)=variance. Moreover,



31

How?

P θ > 49( )= P Z > 49 − 39τ

= 0.25 ⇒ Z0.25 = 49 − 39

τ ,

as Z0.25 = 0.675 (from tables) ⇒ τ = 100.675

Therefore, θ ∼ N(39, 219.47).

In order to assess the conditions of a patient, the amount of tyrosine will be

measured. Due to measurement errors, the measured value will not be, in

general, the true value, but a random variable with normal distribution

centered at θ and with a standard deviation of σ=2 (that depends on the

precision of the instrument).

X|θ ∼ N(θ, 4) & θ ∼ N(39, 219.47)

It can be proven that the prior predictive distribution is of the form

X ∼ N(39, 223.47).

What can be said with this predictive distribution?

( ) ( ) 0808.04047.1ZP47.2233960

ZP60XP =>=

−

>=> ,

Which means that is very unlikely that a patient had a measure larger than

60mg./24hrs..

With the objective of improving the prior information, 3 measurements

were obtained from the same patient x1=40.62, x2=41.8, and x3=40.44.

Amount of tyrosine (θ) around 39 µ=39

P(θ > 49) = 0.25 (given µ=39)

τ=14.81



32

It can be proven that if

where,

20

2

020

2

1 1n

1x

n

σ+

σ

θσ

+σ

=θ and

20

2

21 1n

1

σ+

σ

=σ .

Continuing with the example,

x =40.9533, θ0 = 39, σ2 = 4, σ02 = 219.47, n=3

θ1 = 40.9415, σ12 = 1.3252 ∴ xθ ∼ N(40.9415, 1.3252)

X|θ ∼ N(θ, σ2) and θ ∼ N(θ0, σ02) ⇒ xθ ∼N(θ1, σ1

2)



33

1.4 Informative, noninformative and conjugate prior distributions.

Ø There exist several classes of prior distributions. In terms of the amount of

information they carry, we classify them as informative and

noninformative.

Ø INFORMATIVE PRIOR DISTRIBUTIONS: These are prior distributions that

contain relevant information about the occurrence of the uncertain events

θ.

Ø EXAMPLE 8: The prior distribution for θ in Example 7 is an example of an

informative prior. In the same context of Example 7, suppose now that

there exist only 3 possible values (categories) for the amount of tyrosine:

θ1 = low, θ2 = medium, & θ3 = high. Assume even further that θ2 is three

times as frequent as θ1 and that θ3 is twice as frequent as θ1. We can

specify the prior distribution for the amount of tyrosine by,

letting pi=P(θi), i =1,2,3. Then,

p2=3p1 and p3=2p1. Moreover, 1ppp 321 =++

⇒ p1+ 3p1+ 2p1=1 ⇔ 6p1=1 ∴ p1=1/6, p2=1/2 and p3=1/3

Ø NONINFORMATIVE PRIOR DISTRIBUTIONS: These are prior distributions that

do not give us any relevant information about the occurrence of the

uncertain events θ.

Ø There are several criteria to define a noninformative prior:



34

1) Principle of insufficient reasoning: Bayes (1763) and Laplace (1814,

1952). According to this principle, in the absence of evidence against, all

possibilities have the same prior probability.

o In particular, if θ can take a finite number of values, say m, the

noninformative prior is:

( ) )(Im1

fm21 ,,, θ=θ θθθ K

o What would happen if the number of possible values (m) that θ can take

goes to infinity?

( ) .ctef ∝θ

In this case it is said that f(θ) is an improper prior distribution, because

it does not satisfy all the properties to be a proper density.

2) Invariant prior distribution: Jeffreys (1946) proposed a noninformative

prior distribution that is invariant under re-parameterizations. That is, if

πθ(θ) is the noninformative prior for θ then, ( ) )(J)()( ϕϕθπ=ϕπ θθϕ should

be the noninformative prior for ϕ = ϕ(θ). This prior is generally improper.

o The rule of Jeffreys is as follows: Let F = ( ) Θ∈θθ :xf , Θ⊂ℜd a

parametric model for X. Jeffreys’ noninformative prior for θ with

respect to the model F is

2/1)(Idet)( θ∝θπ , θ∈Θ,

where ( )

θ∂θ∂θ∂

−=θ θ ' Xflog

E)(I2

|X is Fisher’s information matrix.



35

o EXAMPLE 9: Let X be a r.v. with conditional distribution given θ,

Ber(θ), i.e., ( ) ( ) )x(I1xf 1,0x1x −θ−θ=θ , θ∈(0,1).

( ) )x(Ilog)1log()x1()log(xxflog 1,0+θ−−+θ=θ

( )θ−

−−θ

=θθ∂∂

1x1xxflog

( ) 222

2

)1(x1x

xflogθ−

−−

θ−=θ

θ∂∂

( ) ( ) ( )( ) )1(

11

XE1XE)1(

X1XEI 2222|X θ−θ==

θ−

θ−+

θθ

=

θ−−−

θ−−=θ θ L

( ) )(I1)1(

1)( )1,0(2/12/1

2/1

θθ−θ=

θ−θ∝θπ −−

∴ ( )2/1,2/1Beta)( θ=θπ .

3) Reference criteria : Bernardo (1986) proposed a new methodology for

obtaining prior distributions called minimum informative or reference

priors, based on the idea that the data contain all relevant information in an

inference problem.

o The reference prior is the distribution that maximizes the expected

distance between the prior and the posterior distributions, when the

sample size is infinity.

o Examples of reference priors are in the list of formulas.

Ø CONJUGATE PRIORS: Conjugate priors arose when trying to quantify the

prior knowledge in such a way that the posterior distribution was easy to

obtain “analytically”. Due to the technological developments, this

justification is not critical anymore.



36

o Definition: Conjugate family. A family of distributions for θ is said to

be conjugate with respect to a certain probabilistic model f(x|θ) if for

any prior distribution belonging to such a family, the posterior

distribution also belongs to the same family.

o EXAMPLE 10: Let X1,X2,...,Xn be a r.s. from Ber(θ). Let θ∼Beta(a,b) be

the prior distribution for θ. Then,

( ) ( ) ( )∏=

−∑∑ θ−θ=θn

1ii1,0

xnxxI1xf ii

( ) ( ) )(I1)b()a()ba(

f )1,0(1b1a θθ−θ

ΓΓ+Γ

=θ −−

⇒ ( ) ( ) )(I1xf )1,0(1xnb1xa ii θθ−θ∝θ −−+−+ ∑∑

∴ ( ) ( ) )(I1)b()a()ba(xf )1,0(

1b1a

11

11 11 θθ−θΓΓ+Γ=θ −− ,

where ∑+= i1 xaa and ∑−+= i1 xnbb . That is, )b,a(Betax 11∼θ .

o More examples of conjugate families can be found in the list of

formulas.



37

1.5 Parametric inference problems

Ø The typical parametric inference problems are: point estimation, interval

estimation, and hypothesis testing.

Ø POINT ESTIMATION. The point estimation problem, as a decision problem, is

described as follows:

o D = E = Θ.

o ( )θθ,~v the loss when estimating θ with θ~ . Consider three loss

functions:

1) Square loss function:

( ) ( )2~ a,~v θ−θ=θθ , where a > 0

In this case, the optimal decision that minimizes the expected loss is

( )θ=θ E~ .

2) Absolute loss function:

( ) θ−θ=θθ ~ a,~v , where a > 0

In this case, the optimal decision that minimizes the expected loss is

( )θ=θ Med~ .

The best estimator of θ with square loss is the mean of the (available) distribution of θ.

The best estimator of θ with absolute loss is the median of the (available) distribution of θ.



38

3) Neighborhood loss function:

( ) )(I1,~v )~(B θ−=θθ θε,

where ( )θε~B denotes the neighborhood of radius ε centered at θ~ .

In this case, the optimal decision that minimizes the expected loss when

0→ε is

( )θ=θ Mode~ .

Ø EXAMPLE 11: Let X1,X2,...,Xn be a r.s. from the population Ber(θ). Assume

that the available prior information can be described with a Beta

distribution, i.e., θ ∼ Beta(a,b). As shown in the previous example, the

posterior distribution is again a Beta distribution, i.e.,

θ|x ∼ Beta

−++ ∑∑==

n

1ii

n

1ii Xnb ,Xa .

The idea is to produce a point estimate of θ,

1) If we use a square loss:

( )nba

xaxE~ i

+++

=θ=θ ∑ ,

2) If we use a neighborhood loss:

˜ θ = Mode θ x( )=a + xi∑ −1a + b + n − 2

.

Ø INTERVAL ESTIMATION. The problem of interval estimation can be

described as a decision problem in the following way:

The best estimator of θ with neighborhood loss is the mode of the (available) distribution of θ.



39

o D = D : D ⊂ Θ,

where D is a probability interval of size (1-α) if ( ) α−=θθ∫ 1dfD

.

Remark: for fixed α∈(0,1), there is not a unique probability interval.

o E = Θ.

o ( ) )(ID,Dv D θ−=θ the loss to estimate θ with D.

This loss function represents the idea that for a fixed α, a smaller

interval is preferred. Therefore,

q The interval D* whose length is the smallest satisfies the property of being

a high density interval, that is

if θ1∈D* and θ2∉D* ⇒ f(θ1) ≥ f(θ2)

q How to produce a high density interval?

Follow these steps:

o Locate the highest point in the density of θ.

o From this point onwards, trace descending horizontal lines up to the

point when you accumulate a probability of (1-α).

The best interval estimate of θ is the interval D* with smallest length.



40

Shape,Scale2,1

Gamma Distribution

x

dens

ity

0 2 4 6 8 100

0.1

0.2

0.3

0.4

Ø HYPOTHESIS TESTING. The problem of hypothesis testing is a simple

decision problem and consists of selecting between two models or

hypotheses H0 and H1. In this case,

o D = E = H0, H1

o ( )θ,dv the loss function of the form,

v(d,θ) Truth

Choice H0 H1

H0 v00 v01

H1 v10 v11

where, v00 and v11 are the losses when taking a correct decision (generally

v00 = v11 = 0),

v10 is the loss of rejecting H0 (accepting H1) when H0 is correct and

v01 is the loss of not rejecting H0 (accepting H0) when H0 is false.

θ

| | |

| | |

1-α



41

Let p0 = P(H0) = probability associated with the hypothesis H0 at the

moment of taking the decision (prior or posterior). Then, the expected loss

for each hypothesis is:

( ) ( ) ( ) 00001010010000 pvvvp1vpvHvE −−=−+=

( ) ( ) ( ) 01011110110101 pvvvp1vpvHvE −−=−+=

The graphical representation is:

where p* =v01 − v11

v10 − v11 + v01 − v00

.

Finally, the optimal solution is the one that minimizes the expected loss:

if ( ) ( ) 0H pp vvvv

p-1p

HvEHvE *0

0010

1101

0

010 ⇒>⇔

−−

>⇔<

H0 if p0 is big enough compared with 1-p0.

if ( ) ( ) 1H pp vvvv

p-1p

HvEHvE *0

0010

1101

0

010 ⇒<⇔

−−

<⇔>

H1 if p0 is small enough compared with 1-p0.

if p0 = p* ⇒ H0 or H1

Indifference between H0 and H1 if p0 is not big enough nor small

enough compared with 1-p0.

0 1

p0

p* H1 H0

v01 v10

( ) 1HvE ( ) 0HvE

v11 v00

BAYESIAN DATA ANALYSIS - ITAMallman.rhon.itam.mx/~lnieto/index_archivos/NotesBDA2.pdfStat 422 & GS01...

Documents

Transcript of BAYESIAN DATA ANALYSIS - ITAMallman.rhon.itam.mx/~lnieto/index_archivos/NotesBDA2.pdfStat 422 & GS01...