Post on 28-Feb-2022
Decision Theory
Framework for formalizing inference problems
I Inference targets: unknown quantities of interest
I Observations: access to information about unknown quantities
I Decision rules: take action based on observations
I Loss and risk: assessments of performance, comparison of decision rules
General Setting
Set X of possible observations, called the sample space
I Single measurement X = {0, 1}, N, or R
I Multiple measurements X = {0, 1}n, Nn, or Rn
Family P = {f(x|θ) : θ ∈ Θ} of pdf/pmf on X
I Indices θ ∈ Θ called parameters, “states of nature”
I Index set Θ called parameter space, assumed known
Inference
I Observe X ∈ X (random) with X ∼ f(x|θ) ∈ P, where θ is unknown
I Goal: Learn about unknown θ based on the observed value x of X
Principal Inference Problems
1. Point estimation
I Observe X ∼ f(x|θ). Obtain an estimate θ of θ, where θ ∈ Θ.
2. Hypothesis testing: Given subset Θ0 ⊆ Θ of interest
I Observe X ∼ f(x|θ). Decide if θ ∈ Θ0 or θ ∈ Θc0.
3. Confidence set estimation
I Observe X ∼ f(x|θ). Find a small set C ⊆ Θ that is likely to contain θ.
Inference as Deterministic/Stochastic Procedure
Two complementary perspectives
I Inference process as deterministic map from data x to estimates/decisions
I Stochastic behavior of these maps when applied to random observation X
Actions and Decisions
Inference amounts to making a decision about the parameter θ based on the observedvalue x ∈ X of X ∼ f(·|θ). Use term “data” for realized values of random quantities
I Decision space A = set of all possible decisions
I Decision rule is a map d : X → A from data to decisions
I Family D of allowable decision rules
Decision space A and family D will depend on
I Nature of the inference problem at hand
I Criteria such as invariance, smoothness, unbiasedness
I Computational constraints
Loss and Risk
Definition: A loss function is a map ` : Θ×A → R. Interpret `(θ, a) as the cost if wemake decision a when true state of nature is θ
Definition: The risk function of a decision rule d : X → A is defined by
R(θ, d) = Eθ`(θ, d(X)) θ ∈ Θ
I Eθh(X) is the expectation of h(X) when X ∼ f(x|θ)
I R(θ, d) = expected loss of rule d when applied to observation X ∼ f(x|θ)
I Continuous case R(θ, d) =∫`(θ, d(x))f(x|θ) dx
I Discrete case R(θ, d) =∑x `(θ, d(x)) p(x|θ)
Framework for Point Estimation
Goal: Estimate of parameter θ based on observation X ∼ f(x|θ) ∈ P
I Typically A = Θ, i.e., the decision space is the parameter space
I Decision rule d : X → Θ is an estimator. Common to write d(x) as θ(x)
Common loss functions
I Squared loss `(θ, θ′) = (θ − θ′)2
I Absolute loss `(θ, θ′) = |θ − θ′|
I qth power loss `(θ, θ′) = ||θ − θ′||q some q > 0
I Zero-one loss `(θ, θ′) = I(θ 6= θ′)
I Kullback-Liebler loss `(θ, θ′) =∫f(x|θ) log(f(x|θ)/f(x|θ′)) dx
Framework for Hypothesis Testing
Given: Partition Θ = Θ0 ∪Θ1 of parameter space
Goal: Decide if θ ∈ Θ0 or θ ∈ Θ1 based on X ∼ f(x|θ) ∈ P
I Decision space A = {0, 1} where a indicates decision θ ∈ Θa
I Decision rule d : X → {0, 1}
I Zero-one loss `(θ, a) = I(θ 6∈ Θa) (1 if decision is incorrect, 0 otherwise)
Under zero-one loss risk function has form
R(θ, d) = Eθ`(θ, d(X)) =
{Pθ(d(X) = 1) if θ ∈ Θ0 (Type I error)
Pθ(d(X) = 0) if θ ∈ Θ1 (Type II error)
Framework for Interval Estimation
Goal: Find small confidence set C ⊆ Θ likely to contain θ based on X ∼ f(x|θ) ∈ P
I Decision space A ⊆ 2Θ, e.g., intervals, rectangles, balls
I Decision rule C : X → A
I Weighted 0-1 loss `(θ, C) = I(θ 6∈ C) + λVol(C), some λ > 0
Under weighted zero-one loss risk function has form
R(θ, C) = Eθ`(θ, C(X)) = Pθ(θ 6∈ C(X)) + λEθ[Vol(C(X))]
Note that
I Minimizing risk entails trade-off between coverage probability and size ofconfidence set
I In frequentist setting observation X is random, but parameter θ is not
Frequentist and Bayesian Perspectives on Inference
Different approaches stemming in part from different interpretations of probability
Frequentist
I Probability defined through repetitions of a random experiment
I True parameter θ is a fixed element of Θ, but otherwise unknown
I Analysis and interpretation of inference based on (potentially unrealized)replications of basic experiment
Bayesian
I Probability understood as a (potentially subjective) measure of belief
I Belief about true parameter before and after an experiment representedrespectively by prior and posterior distributions on the parameter space Θ
I Experiment regarded as unique. Inference based on updating prior basedon data, without reference to other experiments or repetition
Overview of Bayesian Inference
Basic ingredients
I Family P = {f(x|θ) : θ ∈ Θ} of sampling densities on X
I Prior density π(θ) on parameter space Θ
I Joint densities f(x, θ) = f(x|θ)π(θ), marginal density m(x) =∫f(x, θ)dθ
I Observation model: First θ drawn from π, then X drawn from f(x|θ)
Idea: Prior density π(θ) reflects belief/information about parameters before experimentis conducted. Given data x, update prior using Bayes formula to obtain
π(θ|x) =f(x|θ)π(θ)
m(x)posterior density
Key point: All inferences about θ, (point estimates, hypothesis tests, intervalestimates) are based on the posterior density
Comparing Decision Rules
Recall: Risk of decision rule d : X → Θ under loss ` summarized by risk function
R(θ, d) = Eθ`(θ, d(X))
Question: Given two decision rules d1 and d2, how should we compare theirassociated risk functions R(θ, d1) and R(θ, d2)?
I Frequentist perspective: Consider maximum risk over θ ∈ Θ
I Bayesian perspective: Consider average risk over prior π
Point Estimation Under Squared Loss
Given family P = {f(x|θ) : θ ∈ Θ} with Θ ⊆ R, and an estimator θ : X → Θ
I The bias of θ at θ is biasθ(θ) = Eθ[θ(X)]− θ
I The variance of θ at θ is Varθ(θ) = Eθ[θ(X)− Eθ θ(X)
]2
I Say θ is unbiased if biasθ(θ) = 0 for all θ
Bias-Variance Decomposition: Under the squared loss `(θ, a) = (θ − a)2
R(θ, θ) = Varθ(θ) + (biasθ(θ))2
Upshot: For an estimator θ to perform well it should
I Be centered near the true parameter (small bias)
I Not be too spread out (small variance)
Example: Estimation of a Normal Mean
Observation: X ∼ N (θ, 1) with θ ∈ R
Goal: Estimate θ under the squared error loss.
Consider two estimators
I θ1(x) = x, risk function R(θ, θ1) = 1
I θ2(x) = 3, risk function R(θ, θ2) = (θ − 3)2
Neither risk function dominates the other
Example: Probability of Success in Bernoulli Trial
Observation: X1, . . . , Xn ∼ Bern(θ) with θ ∈ (0, 1)
Goal: Estimate θ under the squared error loss.
Consider two estimators
θ1(x) = xn R(θ, θ1) =θ(1− θ)
n
θ2(x) =nxn +
√n/2
n+√n
R(θ, θ2) =1
4(1 +√n)2
(constant)
Neither risk function dominates the other
Maximum Risk and Bayes Risk
Idea: Single number summaries of overall risk
Definition: Given family P = {f(x|θ) : θ ∈ Θ} and loss function ` : Θ×A → R
(i) The maximum risk of a decision rule d : X → A is
Rm(d) = supθ∈Θ
R(θ, d)
(ii) The Bayes risk of a decision rule d : X → A under prior density π is
Rπ(d) =
∫R(θ, d)π(θ)dθ
Example: Probability of Success in Bernoulli Trial
Recall: Observe X1, . . . , Xn ∼ Bern(θ). Estimators θ1, θ2 for θ with
R(θ, θ1) =θ(1− θ)
nR(θ, θ2) =
n
4(n+√n)2
A. Maximum risk: Prefer estimator θ2(x) as
Rm(θ1) =1
4n>
1
4(1 +√n)2
= Rm(θ2)
B. Bayes risk: Under uniform prior π(θ) = 1, prefer estimator θ1 for n ≥ 20 as
Rπ(θ1) =1
6n<
1
4(1 +√n)2
= Rπ(θ2)
Minimax Rules and Bayes Rules
Definition: The minimax risk for a family of decision rules D is
R∗m = infd∈D
Rm(d) = infd∈D
supθ∈Θ
R(θ, d)
A rule d ∈ D is said to be minimax if Rm(d) = R∗m.
Definition: The optimal Bayes risk for a family of decision rules D under a prior π is
R∗π = infd∈D
Rπ(d) = infd∈D
∫R(θ, d)π(θ)dθ
A rule d ∈ D is said to be a Bayes rule for π if Rπ(d) = R∗π . Note: R∗π depends on π
Fact: Minimax risk is always bounded below by the optimal Bayes risk: for every priordistribution π on Θ one has R∗m ≥ R∗π
Finding Bayes Rules by Minimizing Posterior Risk
Given: Family P = {f(x|θ) : θ ∈ Θ} and prior density π on Θ. Recall the posteriordensity of θ given X = x is
π(θ|x) =f(x|θ)π(θ)
m(x)where m(x) =
∫f(x|θ)π(θ)dθ
Definition: The posterior risk of a decision a ∈ A given x under π is
Rπ(a|x) =
∫Θ`(θ, a)π(θ|x) dθ = E[`(θ, a)|X = x]
Fact: Under mild conditions, the decision rule
dπ(x) = argmina∈A
Rπ(a|x)
is a Bayes rule for π, provided that it is contained in D
Bayesian Point Estimators Under Different Loss Functions
Given: Family P = {f(x|θ) : θ ∈ Θ} with Θ = R and prior density π(θ).
A. Under squared loss `(θ, θ′) = (θ − θ′)2, Bayes estimator is the posterior mean
θπ(x) =
∫Θθ π(θ|x) dθ
B. Under absolute loss `(θ, θ′) = |θ − θ′|, Bayes estimator is posterior median
θπ(x) = u such that∫ u
−∞π(θ|x) dθ =
1
2
C. Under zero-one loss `(θ, θ′) = I(θ 6= θ′), Bayes estimator is posterior mode
θπ(x) = argmaxθ∈Θ π(θ|x)
Bayes Rules with Constant Risk are Minimax
Theorem: Let dπ be the Bayes rule for a family D under a prior π. If the risk functionR(θ, dπ) is constant then dπ is minimax for D.
Note: If dπ is minimax then π is said to be a least favorable prior
Example: Consider X1, . . . , Xn ∼ Bern(θ). Consider the point estimator
θ(x) =nxn +
√n/2
n+√n
under the squared error loss
I Have seen that risk R(θ, θ) = [4(1 +√n)]−2 is constant
I Can show θ is posterior mean for θ under Beta(√n/2,
√n/2) prior
I By Theorem, θ is minimax
Admissibility
Setting: General inference problem with family D of candidate decision rules
Definition: A decision rule d ∈ D is inadmissible if there is some d′ ∈ D such that
(i) R(θ, d′) ≤ R(θ, d) for all θ ∈ Θ
(ii) R(θ, d′) < R(θ, d) for some θ ∈ Θ
If no such d′ exists, then d is said to be admissible
I Admissibility depends on the family D and the loss function `
I A rule d is either admissible or inadmissible
I Admissible rules are candidates for good/reasonable rules
I There may be many admissible rules
I Admissibility is a weak criterion. Obviously silly rules can be admissible.
Example
Observations: X1, . . . , Xn i.i.d. Bern(θ) with θ ∈ (0, 1)
Goal: Estimate of θ under squared loss. Candidate estimators
I θ1(x) = x with R(θ, θ1) = θ(1− θ)/n
I θ2(x) = x1 with R(θ, θ2) = θ(1− θ)
I θ3(x) = 12
with R(θ, θ3) = (θ − 12
)2
Fact
1. θ1 is admissible
2. θ2 is inadmissible (bettered by θ1)
3. θ3 is admissible (lazy, but unbeatable when θ = 12
)
Admissibility of Bayes Rules
Thm: Consider a Bayesian decision problem in which
I Θ ⊆ Rp is open
I π(θ) > 0 for every θ ∈ Θ
I The Bayes rule dπ for π has finite Bayes risk rπ(dπ)
If R(θ, d) is a continuous function of θ for each d ∈ D, then dπ is admissible.
Idea: If there is a rule d′ such that R(θ, d′) ≤ R(θ, dπ) for all θ with strict inequality forsome θ, then the Bayes risk of d′ would be less than that of dπ , which is a contradiction.