BAYESIAN DATA ANALYSIS - ITAMallman.rhon.itam.mx/~lnieto/index_archivos/NotesBDA2.pdfStat 422 & GS01...
Transcript of BAYESIAN DATA ANALYSIS - ITAMallman.rhon.itam.mx/~lnieto/index_archivos/NotesBDA2.pdfStat 422 & GS01...
STAT 422 & GS01 0013
BAYESIAN DATA ANALYSIS
INSTRUCTORS:
GARY ROSNER ([email protected])
LUIS NIETO-BARAJAS ([email protected])
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
2
1. Decision Theory
1.1 Foundations and axioms of coherence
Ø The OBJECTIVE of Statistics, and in particular of Bayesian Statistics, is to
provide a methodology to adequately analyze the available information
(data analysis) and to decide in a reasonable way the best way to proceed
(decision theory).
Ø DIAGRAM of Statistics:
Ø Types of INFERENCE:
Classic Bayesian
Parametric √√√ √√
Non parametric √√ √
Population
Sample
Sampling Inference
DDaattaa aannaallyyss iiss
DDeecciiss iioonn mmaakkiinngg
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
3
Ø Statistics is based on PROBABILITY THEORY. Formally, probability is a
function that satisfies certain conditions (axioms of probability), but in
general it can be understood as a measure or quantification of uncertainty.
Ø Although there is only one mathematical definition of probability, there are
several interpretations of probability: classic, frequentist and subjective.
BAYESIAN THEORY is based on the subjective interpretation of probability
and has its roots in Bayes Theorem.
Reverend Thomas Bayes (1702-1761).
Ø Statistical Inference is a way of taking decisions. Classical methods of
inference ignore important aspects of the decision-making process;
however, Bayesian methods of inference do take them into account.
Ø What is a decision problem?. We face a decision problem when we have to
select from two or more ways of proceeding.
Ø TAKING DECISIONS is a fundamental aspect in the life of a professional
person, for instance, an administrator must take decisions constantly in an
environment with uncertainty, decisions about the best project to undertake
or the opportunity of investing some money, etc.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
4
Ø DECISION THEORY proposes a method of taking decisions based on some
basic principles about the coherent election between alternative options.
Ø ELEMENTS OF A DECISION PROBLEM under uncertainty:
A decision problem is defined by the quadruplet (D, E, C, ≤), where:
q D : Space of decisions. Set of possible alternatives, it has to be exhaustive
(contains all possibilities) and exclusive (electing one element in D
excludes the election of any other).
D = d1,d2,...,dk.
q E : Space of uncertain events. Contains uncertain events relevant to the
decision problem.
Ei = Ei1,Ei2,...,Eimi., i=1,2,…,k. Ei = E i1, E i2,K, E im i ,i = 1,2,K, k
q C : Space of consequences. Set of possible consequences and describes the
consequence of electing a decision.
C = c1,c2,...,ck.
q ≤ : Preference relation among different options. Is defined in such a way
that d1≤d2 if d2 is preferred over d1.
• REMARK: For the moment we will consider discrete spaces (decisions,
events and consequences), although the theory is applied to continuous
spaces.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
5
Ø DECISION TREE (under uncertainty):
Ø EXAMPLE 1: A physician needs to decide whether to carry out surgery on a
person he believes has a tumor or to treat with chemotherapy. If the patient
does not have a tumor, the life expectancy is 20 years. If he has a tumor,
undergoes surgery, and survives, he is given 10 years of life; whereas if he
has a tumor and does not undergo surgery, he is only given 2 years of life.
d1
di
dk
c11
c12
c1m1
There is not full information about the consequences of taking a decision.
E11
E12
E1m1
Ei1
Ek1
Ei2
Ek2
Eimi
Ekmk
ci1
ck1
ci2
ck2
cimi
ckmk
Decision node Uncertainty (random) node
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
6
D = d1, d2, where d1 = surgery, d2 = therapy
E = E11, E12, E13, E21, E22, where E11 = survival / tumor, E12 = survival /
no tumor, E13 = dead, E21 = tumor, E22 = no tumor
C = c11, c12, c13, c21, c22, where c11=10, c12=20, c13=0, c21=2, c22=20
Ø In practice, most decision problems have a much more complex structure.
For instance, one may have to decide whether or not to carry out an
experiment, and if one does the experiment, take another decision
according to the result of the experiment. (Sequential decision problems).
Ø Frequently, the set of uncertain events is the same for all decisions, that is,
Ei = E i1, E i2,K, E im i = E1, E 2,K, Em = E , for all i. In this case, the
problem can be represented as:
Surgery
Therapy
Surv
Dead
Tumor
No tum
Tumor
No tum
10 yrs.
20 yrs.
0 yrs.
2 yrs.
20 yrs.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
7
E1 ... Ej ... Em
d1 c11 ... c1j ... c1m
M M M M
di ci1 ... cij ... cim
M M M M
dk ck1 ... ckj ... ckm
Ø The OBJECTIVE of a decision problem under uncertainty is then to take the
best decision di from the set D without knowing which of the events Eij
from Ei will occur.
Ø Although the events that form each Ei are uncertain, in the sense that we do
not know which of them will occur, in general, we have an idea of the
probability of each of them. For instance,
25 years
Ø Sometimes it is difficult to order our preferences among all possible
different consequences. It might be simpler to assign a utility measure to
each of the consequences and then order them according to their utility.
live 10 yrs. more
die in 1 month
reach 90 yrs.
Which is more probable?
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
8
Ø QUANTIFICATION of uncertain events and of consequences.
q The information that the decision maker has about the possible occurrence
of the events can be quantified through a probability function on the space
E.
q In the same way, it is possible to quantify the preferences of the decision
maker among different consequences through a utility function in such a
way that 'j'iij cc ≤ ⇔ ( ) ( )'j'iij cucu ≤ .
Ø Alternatively, it is possible to represent the decision tree as follows:
Consequences Earn much money & have little available
time
Earn little money & have much available
time
Earn regular money & have regular available
time
d1
di
dk
u(c11)
u(c12)
u(c1m1)
P(E11|d1)
P(E12|d1)
P(E1m1|d1)
P(Ei1|di)
P(Ek1|dk)
P(Ei2|di)
P(Ek2|dk)
P(Eimi|di)
P(Ekmk|dk)
u(ci1)
u(ck1)
u(ci2)
u(ck2)
u(cimi)
u(ckmk)
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
9
Ø How to take the best decision?
If in some way we were able to make the uncertainty disappear, we could
order our preferences according to the utility of each decision. Then the
best decision would be the one that has the maximum utility.
Ø STRATEGIES: In principle we will study 4 strategies or criteria proposed in
the literature to take decisions.
1) Optimistic: Assume that what will occur is the best consequence of each
option.
2) Pessimistic (or minimax): Assume that what will occur is the worst
consequence of each option.
3) Conditional or most probable: Assume that what will occur is the most
probable consequence.
4) Expected utility: Assume that what will occur is the average
consequence of each option.
q Whichever strategy one takes, the best option is the one that maximizes the
tree “without uncertainty.”
Ø EXAMPLE 2: In a parliamentary elections in the UK, there were two parties
competing: Conservative and Labor. A gambling house offered the
following options:
Uncertainty Decider Go away
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
10
a) To someone who bets in favor of the Conservative party, the house was
willing to pay $7 for each bet of $4 if the election favors the
Conservatives; otherwise, the gambler will lose the bet.
b) To someone who bets in favor of the Labor party, the house was willing
to pay $5 for each bet of $4 if the Labor party wins; otherwise, the
gambler will lose his money.
o D = d1,d2
where, d1 = Bet in favor of the Conservatives
d2 = Bet in favor of the Labor
o E = E1, E2
where, E1 = Conservative party wins
E2 = Labor party wins
o C = c11, c12, c21, c22. If the bet is of $k
then, k43
k47
kc11 =+−=
kc12 −=
kc 21 −=
k41
k45
kc22 =+−=
What party do I bet for?
Labor Conservative
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
11
Assume that the utility is proportional to the money won, i.e.,
u(cij) = cij, & Let π = P(E1) & π−1 = P(E2).
P(E) π π−1
u(d,E) E1 E2
d1 (3/4)k -k
d2 -k (1/4)k
1) Optimistic: d1 (bet in favor of the Conservatives)
2) Pessimistic: d1 or d2 (whichever)
3) Conditional: d1 or d2 (it depends on the value of π)
If 2/1>π , we take E1 as a “sure event” ⇒ d1
Si 2/1≤π , we take E2 as a “sure event” ⇒ d2
4) Expected utility: d1 ó d2 (depending)
The expected utilities are:
( ) k1)4/7()k)(1(k)4/3(du 1 −π=−π−+π=
( ) k)4/5()4/1(k)4/1)(1()k(du 2 π−=π−+−π=
Then, the best decision would be:
If ( )1du > ( )2du ⇔ 12/5>π ⇒ d1
If ( )1du < ( )2du ⇔ 12/5<π ⇒ d2
If ( )1du = ( )2du ⇔ 12/5=π ⇒ d1 ó d2
We can see this graphically if we define the functions
( ) ( )11 dukk47
g =−π
=π , and
( ) ( )22 du4k
k45
g =+π
−=π
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
12
then, if k = 1,
The bold line represents the best solution to the decision problem given
by the expected utility criterion.
Remark: If [ ]7/4,5/1∈π , the expected utility of the best decision is
negative!.
Question 1: Would you bet if [ ]7/4,5/1∈π ?
Question 2: What would be the value the house believes π had?
Ø INADMISSIBILITY of an option: One option d1 is inadmissible if there exists
another option d2 such that d2 is at least as preferred as d1 no matter what
happens (for every uncertain event) and there exists one case (uncertain
event) for which d2 is preferred over d1.
Ø AXIOMS OF COHERENCE. These are a series of principles that establish the
conditions for taking coherent decisions and that clarify the possible
5/12
1/5 4/7
g1(π)
g2(π)
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
-0.5
0.0
0.5
1.0
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
13
ambiguity in the process of taking a decision. There are four axioms of
coherence:
1. COMPARABILITY. This axiom establishes that we should at least be able
to express preferences between two different options and, therefore,
between two possible consequences. That is, not all options nor all the
conditions are the same.
q For all pairs of options d1 and d2 in D, one and only one of the
following conditions is true:
d2 is preferred over d1 ⇔ d1 < d2
d1 is preferred over d2 ⇔ d2 < d1
d1 and d2 are equally preferred ⇔ d1 ∼ d2
2. TRANSITIVITY. This axiom establishes that preferences must be
transitive to avoid contradictions.
q If d1, d2 and d3 are three options and d1<d2 and d2<d3, then it must
happen that d1<d3. Similarly, if d1∼d2 and d2∼d3, then d1∼d3.
3. SUBSTITUTION AND DOMINATION. This axiom establishes that if you
have two situations such that for every result of the first situation there
exists a result preferred in the second situation, then the second situation is
preferred over the first one no matter the result.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
14
q If d1 and d2 are two options and E is an uncertain event and it happens
that d1<d2 when E occurs and d1<d2 when E does not occur, then d1<d2
(no matter the uncertain events). Similarly, if d1∼d2 when E occurs and
d1∼d2 when E does not occur, then d1∼d2.
4. REFERENCE EVENTS. This axiom establishes that to be able to take
reasonable decisions, it is necessary to measure the information and the
preferences of the decision maker in a quantitative form. Thus, there must
be a measure (P) based on reference events.
q The decision maker can imagine a way to generate points in the unit
square of two dimensions in such a way that for any two regions R1 and
R2 in that square, the event 1Rz ∈ is more possible than the event
2Rz ∈ only if the area of R1 is larger than the area of R2.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
15
1.2 Maximum expected utility principle
Ø IMPLICATIONS from the coherence axioms:
q In general, every option di can be written as the set of all possible
consequences given the uncertain events, that is,
di = cij Eij , j= 1,K,mi .
1) The consequences can be seen as particular cases of options:
c ~ dc = Ωc ,
where Ω is the sure event.
From this formulation, we can compare consequences:
c1 ≤ c2 ⇔ c1|Ω ≤ c2|Ω.
Therefore, it is possible to find two consequences c∗ (the worst) and c∗
(the best), such that for any other consequence c, c∗ ≤ c ≤ c∗.
2) The uncertain events can also be seen as particular cases of options:
E ~ dE = c* E, c∗ E c ,
where c∗ and c* are the worst and the best consequences.
In this way, we can compare also the uncertain events:
E ≤ F ⇔ c*|E, c∗|Ec ≤ c*|F, c∗|Fc,
In this case we would say that E is not more plausible than F.
Ø QUANTIFICATION OF THE CONSEQUENCES: The quantification of a
consequence c will be a number u(c) measured in the scale [0,1]. This
quantification will be based on reference events..
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
16
o Definition: Utility.
The utility u(c) of a consequence c is the probability q that is assigned to
the best consequence c* such that consequence c is equally preferred than
option
c*|Rq, c∗|Rqc
(this can also be written as c*|q, c∗|1−q), where Rq is a reference event in
the unit square of area q. From this definition, for all consequences c,
there exists an option based on reference events such that
c ∼ c*|u(c), c∗|1−u(c).
q Based on the axioms of coherence 1, 2 and 3, there always exists a number
u(c)∈[0,1] that satisfies the condition above, because
c* ~ c*|R0, c∗|R0c ⇒ u(c∗) = 0
c* ~ c*|R1, c∗|R1c ⇒ u(c∗) = 1
Thus, for all c such that c∗ ≤ c ≤ c∗ ⇒ 0 ≤ u(c) ≤ 1.
q EXAMPLE 3: Utility of money. Let us assume that the worst and the best
consequences when playing a chance game are:
c* = $0 (the worst) y c* = $1,000 (the best)
The idea is to determine a utility function for every consequence c such
that c* ≤ c ≤ c*. Consider the lottery:
What do you prefer?
Win Surely
c
Win c* with probability q
or Win c* with
probability 1-q
c|1
c*|q, c∗ |1−q
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
17
If the number of consequences is large or even infinite, the utility function
can be approximated by a model resulting in one of the following forms:
q REMARK: In some cases, it is more convenient to define the utility function
on a scale different from [0,1], for example, in a time scale, in negative
numbers, number of products sold, years of life, etc. It is possible to prove
that a utility function defined on a scale different from [0,1] can be seen as
a linear transformation from the original utility defined in [0,1].
Ø QUANTIFICATION OF THE UNCERTAIN EVENTS: The quantification of the
uncertain events E will also be based on the reference events.
o Definition: Probability.
The probability P(E) of an event E is the area of a region R from the unit
square chosen in such a way that the options c*|E, c∗|Ec and c*|R, c∗|Rc
are equally preferred (equivalents).
0
1
0 1,000
c
u(c)
To avoid falling into a paradox, it is convenient if the utility function be “risk aversive”.
Risk indifference
Risk lover
Risk aversion
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
18
o In other words, if dE = c∗|E, c∗|Ec and dR = c∗|Rq, c∗|Rqc are such that
dE ∼ dRq ⇒ P(E) = q.
q EXAMPLE 4: Assign a probability to an event E. Suppose that we are facing
the problem of deciding which treatment is best for a patient and that the
worst and the best consequences are:
c∗ = 0 (the worst) and c* = 20 (the best)
Let E = tumor. To determine the probability of E, we will consider the
following lotteries:
Finally, this procedure is applied to each of the events, say, E1,E2,...,Ek. If
the number of events is large or even infinite, the probability function can
be approximated by a model (discrete or continuous) having the following
shape,
What do you prefer?
Win c* with probability q
or Win c∗ with
probability 1-q
Win c* if E occurs
or Win c∗ if
E does not occur
a
θ
P(θ)
Continuous model
b
If Eθ =θ ⇒ E = θ | θ∈[a,b]
c*|q, c∗ |1−q c*|E, c∗ |Ec
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
19
q REMARK: The probability assigned to an event is always conditional on the
available information at the moment of the assignment, i.e., there does not
exist absolute probabilities.
Ø DERIVING THE EXPECTED UTILITY:
Up to now we have quantified the consequences and the uncertain events.
Finally we want to assign a number to the options in such a way that the
best option is the one assigned the highest number.
o Theorem: Bayesian decision criteria.
Consider the decision problem defined by D = d1,d2,...,dk, where di =
ijij m,,1j,Ec K= , i=1,...,k. Let P(Eij|di) be the probability of occurrence
of Eij if option di is selected, and let u(cij) be the utility of the consequence
cij. Then, the quantification of the option di is its expected utility, i.e.,
( ) ( ) ( )∑=
=im
1jiijiji dEPcudu .
The optimal decision is d∗ such that ( ) ( )ii
* dumaxdu = .
PROOF.
ii imim2i2i1i1iiijiji Ec,,Ec,Ecm,,1j,Ecd KK ===
Additionally we know,
ijc ∼ ( ) ( ) ijij*c
ijij* cu1c,cucRc,Rc −= ∗∗ ,
where ( ) ( )ijij RAreacu = , and
c*E Ec,Ecd ∗= ∼ ( ) ( ) jj
*cEE
* EP1c,EPcRc,Rc −= ∗∗ ,
where ( ) ( )ERAreaEP = .
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
20
Then, combining both expressions we get
1ii im
cimim1i
c1i1ii ERc,Rc,,ERc,Rcd ∗
∗∗
∗= K
iiii im
cimimim1i
c1i1i1i ERc,ERc,,ERc,ERc ∩∩∩∩= ∗
∗∗
∗ K
( ) ( ) ( ) ( ) iiii im
cim1i
c1iimim1i1i ERERc ,ERERc ∩∪∪∩∩∪∪∩= ∗
∗ LL
Finally,
( ) ( ) ( ) ii imim1i1ii ERERAreadu ∩∪∪∩= L
( ) ( )∑=
=im
1jijij EAreaRArea ( ) ( )∑
=
=im
1jijij EPcu
Ø IN SUMMARY: If we accept the coherence axioms, we will necessarily
proceed in the following way:
1) Assign a utility u(c) for all c in C.
2) Assign a probability P(E) for all E in E.
3) Select the (optimal) option that maximizes the expected utility.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
21
1.3 The learning process and the predictive distribution
Ø The natural reaction of someone who needs to take a decision for which the
consequences depend on the occurrence of uncertain events E, is to try to
reduce the uncertainty by obtaining more information about E.
Ø How to reduce the uncertainty of an event E?.
Obtain additional information (Z) about E.
Ø THE IDEA is to gather information that will help us reduce the uncertainty
of the events, that is, to improve the knowledge we have about E.
Ø From where do we obtain additional information?.
Surveys, previous studies, experiments, etc.
Ø The main problem in statistical inference is to produce a methodology that
allows us to understand and interpret available information with the aim of
improving our initial knowledge.
Ø How to improve our knowledge on E?.
( )EP ( )ZEP
Use Bayes Theorem.
¿ ?
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
22
o BAYES THEOREM : Let Jj,E j ∈ be a finite partition of Ω (E), i.e.,
Ej∩Ek=∅ ∀j≠k &y Ω=∈U
JjjE . Let Z ≠ ∅ be an event. Then,
P Ei Z( )=P Z Ei( )P Ei( )P Z E j( )P E j( )
j∈J∑
, i =1,2,...,k.
PROOF.
( ) ( )( )
( ) ( )( )ZP
EPEZPZP
ZEPZEP iii
i =∩
=
Given that ( )UUJj
jJj
j EZEZZZ∈∈
∩=
∩=Ω∩= such that
( ) ( ) ∅=∩∩∩ kj EZEZ ∀j≠k
( ) ( ) ( ) ( ) ( )∑∑∈∈∈
=∩=
∩=⇒Jj
jjJj
jJj
j EPEZPEZPEZPZP U .
Ø Comments:
1) An alternative way of writing Bayes Theorem is:
P Ei Z( )∝ P Z Ei( )P Ei( )
P(Z) is called the proportionality constant.
2) P(Ej) are called prior (a-priori) probabilities and P(Ej|Z) are called final
(a-posteriori) probabilities. Moreover, P(Z|Ej) is called the likelihood
and P(Z) is called the marginal probability of the additional
information.
Ø Remember that all of these prior and final quantifications of the events
arise because we want to reduce uncertainty in a decision problem.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
23
Assume that for a given problem we have the following:
( )ijEP : initial quantification of the events
( )ijcu : quantification of the consequences
Z: additional information about the events
( )EP ( )ZEP
In this case we have two situations:
1) Initial situation (a-priori):
( )ijEP , ( )ijcu , ( ) ( )∑j
ijij EPcu
2) Final situation (a-posteriori):
( )ZEP ij , ( )ijcu , ( ) ( )∑j
ijij ZEPcu
Ø What would happen if in some way we manage to obtain yet more
information about E? Assume that we first have access to Z1 (additional
information about E) and later we obtain Z2 (more information about E).
There exist two ways of updating all available information about E:
1) Sequential updating:
( )EP ( )1ZEP ( )21 Z,ZEP
The steps are:
Step 1: ( ) ( ) ( )( )1
11 ZP
EPEZPZEP = ,
Bayes Theo.
Initial expected
utility
Final expected
utility
Z1 Z2
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
24
Step 2: ( ) ( ) ( )( )12
11221 ZZP
ZEPE,ZZPZ,ZEP = .
2) Simultaneous updating:
( )EP ( )21 Z,ZEP
How do we do it?
Unique step: ( ) ( ) ( )( )21
2121 Z,ZP
EPEZ,ZPZ,ZEP = .
q These two ways of updating the information are equivalent…
P E Z1,Z2( )=P Z2 Z1,E( )P E Z1( )
P Z2 Z1( )=
P Z1,Z2,E( )P Z1,E( )
P Z1,E( )P Z1( )
P Z1,Z2( )P Z1( )
=P Z1,Z2 ,E( )P Z1,Z2( )
=P Z1,Z2 E( )P E( )
P Z1,Z2( ).
Ø EXAMPLE 5: A patient goes to a physician with a certain disease. Suppose
that the patient’s disease falls in one of the following three categories:
E1 = Frequent disease (cold)
E2 = Relatively frequent disease (flu)
E3 = Not frequent disease (pneumonia)
The physician knows through expertise that
P(E1)=0.6, P(E2)=0.3, P(E3)=0.1 (prior probabilities)
Z1,Z2
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
25
The physician examines the patient and obtains additional information (Z =
symptoms) about the possible disease of the patient. According to the
symptoms, the physician determines that
P(Z | E1)=0.2, P(Z | E2)=0.6, P(Z | E3)=0.6 (likelihood)
What is the most probable disease for this patient?.
Using Bayes Theorem we obtain:
P Z( )= P Z E j( )P E j( )j=1
3
∑ = (0.2)(0.6) + (0.6)(0.3) + (0.6)(0.1) = 0.36
P E1 Z( )= (0.2)(0.6)0.36
= 13
P E2 Z( )= (0.6)(0.3)0.36
= 12
P E3 Z( )= (0.6)(0.1)0.36
= 16
Therefore, it is most probable that the patient has a relatively frequent
disease (E2).
Inference problems.
Ø PARAMETRIC INFERENCE. Let F ( ) Θ∈θθ= ,xf be a parametric family
indexed by the parameter θ∈Θ. Let X1,...,Xn be a random sample (r. s.) of
observations from f(x|θ) ∈F. The inference problem consists of estimating
the real value of the parameter θ.
q A statistical inference problem can be seen as a decision problem with the
following elements:
D = space of decisions according to the specific problem
(final probabilities)
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
26
E = Θ (parameter space)
C = ( ) Θ∈θ∈θ ,d:,d D
≤ : will be represented by a utility function or a loss function.
Ø The sample gives additional information about the uncertain events θ ∈ Θ.
The problem consists of how to update the information.
Ø As we saw in the coherence axioms, the decision maker is capable of
quantifying his or her knowledge about the uncertain events through a
probability function. We then define,
( )θf the prior distribution (or a-priori). Quantifies the initial
knowledge about θ.
( )θxf sample information generating process. Gives additional
information about θ.
( )θxf the likelihood function. Contains all information about θ given
by the sample ( )n1 X,XX K= .
q All this information about θ is combined to obtain a final knowledge or a-
posteriori after having observed the sample. The way to do it is by means
of Bayes Theorem:
f θ x( )=f xθ( )f θ( )
f x( ),
where f x( )= f xθ( )f θ( )dθΘ∫ or f x θ( )f θ( )
θ∑ .
As f θ x( ) is a function of θ, then we can write
f θ x( )∝ f xθ( )f θ( )
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
27
Finally,
f θ x( ) the posterior distribution (or a-posteriori). Summarizes all
available knowledge about θ (prior+sample).
Ø REMARK: As θ is a random quantity, since we are uncertain about the true
value of θ, the density function that generates relevant information about θ
is actually a conditional density.
o Definition: We will call a random sample (r.s.) of size n, from a population
f(x|θ) that depends on θ, the set X1,...,Xn of random variables conditionally
independent given θ, i.e.,
( ) ( ) ( )θθ=θ n1n1 xfxfx,xf LK .
In this case, the likelihood function is the conditional joint density of the
sample, seen as a function of the parameter, i.e.,
f x θ( )= f xi θ( )i=1
n
∏ .
Ø PREDICTIVE DISTRIBUTION: The preditive distribution is the marginal
density function f(x) that allows us to determine which values of the
random variable are more probable.
q What we know about X is conditioned on the value of the parameter θ, i.e.,
f(x|θ) (its conditional density). As θ is an unknown quantity, f(x|θ) can not
be used to describe the behavior of the r.v. X.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
28
q Prior predictive distribution. Although the true value of θ is unknown, we
will always have some information about θ (through its prior distribution
f(θ)). This information can be combined to yield some information about
the values of X. The way to do it is :
( ) ( ) ( )∫ θθθ= dfxfxf or ( ) ( ) ( )∑θ
θθ= fxfxf
q Assuming that we have available additional information (sample
information) X1,X2,..,Xn from the density f(x|θ), it is possible to have a
final knowledge about θ through its posterior distribution ( )xf θ .
q Posterior predictive distribution. Suppose we want to obtain information
regarding the possible values of a new random variable XF from the same
population f(x|θ). If XF is (conditionally) independent from the sample
X1,X2,..,Xn, then
( ) ( ) ( )∫ θθθ= dxfxfxxf FF or ( ) ( ) ( )∑θ
θθ= xfxfxxf FF
Ø EXAMPLE 6: Toss a coin. Consider a random experiment that consists of
tossing a coin. Let X be the r.v. that takes the value of 1 if the coin falls
head and 0 if tail, i.e., X∼Ber(θ). Actually we have X|θ ∼Ber(θ), where θ is
the probability of success (head).
( ) ( ) )x(I1xf 1,0x1x −θ−θ=θ .
The prior knowledge we have about the coin is that it could be fair or
unfair (two heads).
P(fair coin) = 0.95 & P(unfair coin) = 0.05
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
29
How to quantify this knowledge about θ?
Fair coin ⇔ θ = 12
Unfair coin ⇔ θ = 1
θ ∈ 12 ,1
Fair coin ⇔ θ = 1/2 θ ∈ 1/2, 1
Unfair coin ⇔ θ = 1
therefore,
( ) 95.02/1P ==θ and ( ) 05.01P ==θ
That is,
( )
=θ
=θ=θ
1 if ,05.0
2/1 if ,95.0f
Suppose that the coin is thrown once and we got a head, i.e, X1=1. Then
the likelihood is
( ) ( ) θ=θ−θ=θ= 011 11XP .
Combining the prior knowledge with the likelihood we get,
( ) ( ) ( ) ( ) ( )1P11XP2/1P2/11XP1XP 111 =θ=θ=+=θ=θ===
( )( ) ( )( ) 525.005.0195.05.0 =+=
( ) ( ) ( )( )
( )( )9047.0
525.095.05.0
1XP2/1P2/11XP
1X2/1P1
11 ==
==θ=θ=
===θ
( ) ( ) ( )( )
( )( )0953.0
525.005.01
1XP1P11XP
1X1P1
11 ==
==θ=θ=
===θ
in other words,
( )
=θ
=θ==θ
1 if ,0953.0
2/1 if ,9047.01xf 1 .
Now, the prior predictive distribution is
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
30
( ) ( ) ( ) ( ) ( )1P11XP2/1P2/11XP1XP =θ=θ=+=θ=θ===
( )( ) ( )( ) 525.005.0195.05.0 =+=
( ) ( ) ( ) ( ) ( )1P10XP2/1P2/10XP0XP =θ=θ=+=θ=θ===
( )( ) ( )( ) 475.005.0095.05.0 =+=
that is,
f x( )=0.525, if x = 10.475, if x = 0
,
And the posterior predictive distribution is
( ) ( ) ( ) ( ) ( )1x1P11XP1x2/1P2/11XP1x1XP 1F1F1F ==θ=θ=+==θ=θ==== ( )( ) ( )( ) 54755.00953.019047.05.0 =+=
( ) ( ) ( ) ( ) ( )1x1P10XP1x2/1P2/10XP1x0XP 1F1F1F ==θ=θ=+==θ=θ==== ( )( ) ( )( ) 45235.00953.009047.05.0 =+=
that is,
f xF x1 = 1( )=0.548, if xF = 10.452, if xF = 0
.
Ø EXAMPLE 7: Amount of tyrosine. The consequences of certain treatment can
be determined by the amount of tyrosine (θ) in the urine. The prior
information about this quantity in patients shows that it is around
39mg./24hrs. and that the percentage of times this quantity exceeds
49mg./24hrs. is 25%.
According to this information, “it can be implied” that the normal
distribution models “reasonably well” this behavior, so
θ ∼ N µ,τ 2( ), where µ=E(θ)=mean and τ2=Var(θ)=variance. Moreover,
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
31
How?
P θ > 49( )= P Z > 49 − 39τ
= 0.25 ⇒ Z0.25 = 49 − 39
τ ,
as Z0.25 = 0.675 (from tables) ⇒ τ = 100.675
Therefore, θ ∼ N(39, 219.47).
In order to assess the conditions of a patient, the amount of tyrosine will be
measured. Due to measurement errors, the measured value will not be, in
general, the true value, but a random variable with normal distribution
centered at θ and with a standard deviation of σ=2 (that depends on the
precision of the instrument).
X|θ ∼ N(θ, 4) & θ ∼ N(39, 219.47)
It can be proven that the prior predictive distribution is of the form
X ∼ N(39, 223.47).
What can be said with this predictive distribution?
( ) ( ) 0808.04047.1ZP47.2233960
ZP60XP =>=
−
>=> ,
Which means that is very unlikely that a patient had a measure larger than
60mg./24hrs..
With the objective of improving the prior information, 3 measurements
were obtained from the same patient x1=40.62, x2=41.8, and x3=40.44.
Amount of tyrosine (θ) around 39 µ=39
P(θ > 49) = 0.25 (given µ=39)
τ=14.81
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
32
It can be proven that if
where,
20
2
020
2
1 1n
1x
n
σ+
σ
θσ
+σ
=θ and
20
2
21 1n
1
σ+
σ
=σ .
Continuing with the example,
x =40.9533, θ0 = 39, σ2 = 4, σ02 = 219.47, n=3
θ1 = 40.9415, σ12 = 1.3252 ∴ xθ ∼ N(40.9415, 1.3252)
X|θ ∼ N(θ, σ2) and θ ∼ N(θ0, σ02) ⇒ xθ ∼N(θ1, σ1
2)
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
33
1.4 Informative, noninformative and conjugate prior distributions.
Ø There exist several classes of prior distributions. In terms of the amount of
information they carry, we classify them as informative and
noninformative.
Ø INFORMATIVE PRIOR DISTRIBUTIONS: These are prior distributions that
contain relevant information about the occurrence of the uncertain events
θ.
Ø EXAMPLE 8: The prior distribution for θ in Example 7 is an example of an
informative prior. In the same context of Example 7, suppose now that
there exist only 3 possible values (categories) for the amount of tyrosine:
θ1 = low, θ2 = medium, & θ3 = high. Assume even further that θ2 is three
times as frequent as θ1 and that θ3 is twice as frequent as θ1. We can
specify the prior distribution for the amount of tyrosine by,
letting pi=P(θi), i =1,2,3. Then,
p2=3p1 and p3=2p1. Moreover, 1ppp 321 =++
⇒ p1+ 3p1+ 2p1=1 ⇔ 6p1=1 ∴ p1=1/6, p2=1/2 and p3=1/3
Ø NONINFORMATIVE PRIOR DISTRIBUTIONS: These are prior distributions that
do not give us any relevant information about the occurrence of the
uncertain events θ.
Ø There are several criteria to define a noninformative prior:
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
34
1) Principle of insufficient reasoning: Bayes (1763) and Laplace (1814,
1952). According to this principle, in the absence of evidence against, all
possibilities have the same prior probability.
o In particular, if θ can take a finite number of values, say m, the
noninformative prior is:
( ) )(Im1
fm21 ,,, θ=θ θθθ K
o What would happen if the number of possible values (m) that θ can take
goes to infinity?
( ) .ctef ∝θ
In this case it is said that f(θ) is an improper prior distribution, because
it does not satisfy all the properties to be a proper density.
2) Invariant prior distribution: Jeffreys (1946) proposed a noninformative
prior distribution that is invariant under re-parameterizations. That is, if
πθ(θ) is the noninformative prior for θ then, ( ) )(J)()( ϕϕθπ=ϕπ θθϕ should
be the noninformative prior for ϕ = ϕ(θ). This prior is generally improper.
o The rule of Jeffreys is as follows: Let F = ( ) Θ∈θθ :xf , Θ⊂ℜd a
parametric model for X. Jeffreys’ noninformative prior for θ with
respect to the model F is
2/1)(Idet)( θ∝θπ , θ∈Θ,
where ( )
θ∂θ∂θ∂
−=θ θ ' Xflog
E)(I2
|X is Fisher’s information matrix.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
35
o EXAMPLE 9: Let X be a r.v. with conditional distribution given θ,
Ber(θ), i.e., ( ) ( ) )x(I1xf 1,0x1x −θ−θ=θ , θ∈(0,1).
( ) )x(Ilog)1log()x1()log(xxflog 1,0+θ−−+θ=θ
( )θ−
−−θ
=θθ∂∂
1x1xxflog
( ) 222
2
)1(x1x
xflogθ−
−−
θ−=θ
θ∂∂
( ) ( ) ( )( ) )1(
11
XE1XE)1(
X1XEI 2222|X θ−θ==
θ−
θ−+
θθ
=
θ−−−
θ−−=θ θ L
( ) )(I1)1(
1)( )1,0(2/12/1
2/1
θθ−θ=
θ−θ∝θπ −−
∴ ( )2/1,2/1Beta)( θ=θπ .
3) Reference criteria : Bernardo (1986) proposed a new methodology for
obtaining prior distributions called minimum informative or reference
priors, based on the idea that the data contain all relevant information in an
inference problem.
o The reference prior is the distribution that maximizes the expected
distance between the prior and the posterior distributions, when the
sample size is infinity.
o Examples of reference priors are in the list of formulas.
Ø CONJUGATE PRIORS: Conjugate priors arose when trying to quantify the
prior knowledge in such a way that the posterior distribution was easy to
obtain “analytically”. Due to the technological developments, this
justification is not critical anymore.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
36
o Definition: Conjugate family. A family of distributions for θ is said to
be conjugate with respect to a certain probabilistic model f(x|θ) if for
any prior distribution belonging to such a family, the posterior
distribution also belongs to the same family.
o EXAMPLE 10: Let X1,X2,...,Xn be a r.s. from Ber(θ). Let θ∼Beta(a,b) be
the prior distribution for θ. Then,
( ) ( ) ( )∏=
−∑∑ θ−θ=θn
1ii1,0
xnxxI1xf ii
( ) ( ) )(I1)b()a()ba(
f )1,0(1b1a θθ−θ
ΓΓ+Γ
=θ −−
⇒ ( ) ( ) )(I1xf )1,0(1xnb1xa ii θθ−θ∝θ −−+−+ ∑∑
∴ ( ) ( ) )(I1)b()a()ba(xf )1,0(
1b1a
11
11 11 θθ−θΓΓ+Γ=θ −− ,
where ∑+= i1 xaa and ∑−+= i1 xnbb . That is, )b,a(Betax 11∼θ .
o More examples of conjugate families can be found in the list of
formulas.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
37
1.5 Parametric inference problems
Ø The typical parametric inference problems are: point estimation, interval
estimation, and hypothesis testing.
Ø POINT ESTIMATION. The point estimation problem, as a decision problem, is
described as follows:
o D = E = Θ.
o ( )θθ,~v the loss when estimating θ with θ~ . Consider three loss
functions:
1) Square loss function:
( ) ( )2~ a,~v θ−θ=θθ , where a > 0
In this case, the optimal decision that minimizes the expected loss is
( )θ=θ E~ .
2) Absolute loss function:
( ) θ−θ=θθ ~ a,~v , where a > 0
In this case, the optimal decision that minimizes the expected loss is
( )θ=θ Med~ .
The best estimator of θ with square loss is the mean of the (available) distribution of θ.
The best estimator of θ with absolute loss is the median of the (available) distribution of θ.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
38
3) Neighborhood loss function:
( ) )(I1,~v )~(B θ−=θθ θε,
where ( )θε~B denotes the neighborhood of radius ε centered at θ~ .
In this case, the optimal decision that minimizes the expected loss when
0→ε is
( )θ=θ Mode~ .
Ø EXAMPLE 11: Let X1,X2,...,Xn be a r.s. from the population Ber(θ). Assume
that the available prior information can be described with a Beta
distribution, i.e., θ ∼ Beta(a,b). As shown in the previous example, the
posterior distribution is again a Beta distribution, i.e.,
θ|x ∼ Beta
−++ ∑∑==
n
1ii
n
1ii Xnb ,Xa .
The idea is to produce a point estimate of θ,
1) If we use a square loss:
( )nba
xaxE~ i
+++
=θ=θ ∑ ,
2) If we use a neighborhood loss:
˜ θ = Mode θ x( )=a + xi∑ −1a + b + n − 2
.
Ø INTERVAL ESTIMATION. The problem of interval estimation can be
described as a decision problem in the following way:
The best estimator of θ with neighborhood loss is the mode of the (available) distribution of θ.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
39
o D = D : D ⊂ Θ,
where D is a probability interval of size (1-α) if ( ) α−=θθ∫ 1dfD
.
Remark: for fixed α∈(0,1), there is not a unique probability interval.
o E = Θ.
o ( ) )(ID,Dv D θ−=θ the loss to estimate θ with D.
This loss function represents the idea that for a fixed α, a smaller
interval is preferred. Therefore,
q The interval D* whose length is the smallest satisfies the property of being
a high density interval, that is
if θ1∈D* and θ2∉D* ⇒ f(θ1) ≥ f(θ2)
q How to produce a high density interval?
Follow these steps:
o Locate the highest point in the density of θ.
o From this point onwards, trace descending horizontal lines up to the
point when you accumulate a probability of (1-α).
The best interval estimate of θ is the interval D* with smallest length.
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
40
Shape,Scale2,1
Gamma Distribution
x
dens
ity
0 2 4 6 8 100
0.1
0.2
0.3
0.4
Ø HYPOTHESIS TESTING. The problem of hypothesis testing is a simple
decision problem and consists of selecting between two models or
hypotheses H0 and H1. In this case,
o D = E = H0, H1
o ( )θ,dv the loss function of the form,
v(d,θ) Truth
Choice H0 H1
H0 v00 v01
H1 v10 v11
where, v00 and v11 are the losses when taking a correct decision (generally
v00 = v11 = 0),
v10 is the loss of rejecting H0 (accepting H1) when H0 is correct and
v01 is the loss of not rejecting H0 (accepting H0) when H0 is false.
θ
| | |
| | |
1-α
INSTRUCTORS: G. RONER & L. NIETO-BARAJAS
Stat 422 & GS01 0013 Bayesian Data Analysis
41
Let p0 = P(H0) = probability associated with the hypothesis H0 at the
moment of taking the decision (prior or posterior). Then, the expected loss
for each hypothesis is:
( ) ( ) ( ) 00001010010000 pvvvp1vpvHvE −−=−+=
( ) ( ) ( ) 01011110110101 pvvvp1vpvHvE −−=−+=
The graphical representation is:
where p* =v01 − v11
v10 − v11 + v01 − v00
.
Finally, the optimal solution is the one that minimizes the expected loss:
if ( ) ( ) 0H pp vvvv
p-1p
HvEHvE *0
0010
1101
0
010 ⇒>⇔
−−
>⇔<
H0 if p0 is big enough compared with 1-p0.
if ( ) ( ) 1H pp vvvv
p-1p
HvEHvE *0
0010
1101
0
010 ⇒<⇔
−−
<⇔>
H1 if p0 is small enough compared with 1-p0.
if p0 = p* ⇒ H0 or H1
Indifference between H0 and H1 if p0 is not big enough nor small
enough compared with 1-p0.
0 1
p0
p* H1 H0
v01 v10
( ) 1HvE ( ) 0HvE
v11 v00