Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference •...
Transcript of Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference •...
Bayesian Inference
• Reading Assignments
R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001(2.1, 2.4-2.6, 3.1-3.2 , hard-copy).
Rusell and Norvig, Artificial Intelligence: A Modern Approach (chapter 14, hardcopy).
S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial CollegePres, 2001 (Chapt. 3, hard copy).
• Case Studies
H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object DetectionApplied to Faces and Cars", Computer Vision and Pattern Recognition Confer-ence, pp. 45-51, 1998 (on-line).
K. Sung and T. Poggio, "Example-based learning for view-based human face detec-tion", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 20,no. 1, pp. 39-51, 1998 (on-line).
A. Madabhushi and J. Aggarwal, "A bayesian approach to human activity recogni-tion", 2nd International Workshop on Visual Surveillance, pp. 25-30, June 1999(hard-copy).
M. Jones and J. Rehg, "Statistical color models with application to skin detection",Technical Report, Compaq Research Labs (on-line).
J. Yang and A. Waibel, "A Real-time Face Tracker", Proceedings of WACV’96, 1996(on-line).
C. Stauffer and E. Grimson, "Adaptive background mixture models for real-timetracking", IEEE Computer Vision and Pattern Recognition Conference, Vol. 2,pp. 246-252, 1998 (on-line).
-2-
Bayesian Inference
• Why bother about probabilities?
- Accounting for uncertainty is a crucial component in decision making (e.g.,classification) because of ambiguity in our measurements.
- Probability theory is the proper mechanism for accounting for uncertainty.
- Need to take into account reasonable preferences about the state of the world,for example:
"If the fish was caught in the Atlantic ocean, then it is more likely to besalmon than sea-bass"
- We will discuss techniques for building probabilistic models and for extractinginformation from a probabilistic model.
• Probabilistic Inference
- If we could define all possible values for the probability distribution, then wecould read off any probability we were interested in.
- In general, it is not practical to define all possible entries for the joint probabil-ity function.
- Probabilistic inference consists of computing probabilities that are not explicitlystored by the reasoning system (e.g., marginals, conditionals).
• Belief
- The conditional probability of an event given some evidence.
We may not know for sure what affects a particular patience, but we believe thatthere is, say, an 80% chance (i.e., a probability of 0.8) that the patient has a cav-
ity if he or she has a toothache
-3-
• Bayes rule
- Very often we want to compute the value of P(hypothesis/evidence).
- Bayes’ rule provides a way of computing a conditional probability from itsinverse conditional probability:
P(B/A) =P(A/B) P(B)
P(A)
- The denominator P(A) can be considered as a normalization constant (it can beselected in a way such that the P(B/A) sum to 1)
• An example: separate sea bass from salmon
- Some definitions.
State of nature ω (random variable): ω1 for sea bass, ω2 for salmon.Probabilities P(ω1) and P(ω2): prior knowledge of how likely is to get a seabass or a salmon (priors).Probability density function p(x): how frequently we will measure a patternwith feature value x (e.g., x is a lightness measurement) (evidence)Conditional probability density function (pdf) p(x/ω j): how frequently wewill measure a pattern with feature value x given that the pattern belongs toclass ω j (likelihood)Conditional probability P(ω j /x): the probability that the fish belongs to classω j given measurement x (posterior).
-4-
- Decision rule using priors only
Decide ω1 if P(ω1) > P(ω2);otherwise decide ω2
P(error): min[P(ω1), P(ω2))]
- Classification can be improved by using additional information (i.e.,lightness measurements).
- Decision rule using conditional pdf
- The joint pdf of finding a pattern in category ω j having feature value xis:
p(x, ω j) = P(ω j /x)p(x) = p(x/ω j)P(ω j)
- The Bayes’ formula is:
P(ω j /x) =p(x/ω j)P(ω j)
p(x)=
likelihood x prior
evidence
where p(x) = p(x/ω1)P(ω1) + p(x/ω2)P(ω2) is essentially a scale factor.
Decide ω1 if P(ω1/x) > P(ω2/x);otherwise decide ω2
(or)Decide ω1 if p(x/ω1)P(ω1) > p(x/ω2)P(ω2); otherwise decide ω2
-5-
(assuming P(ω1) = 2/3, P(ω2) = 1/3)
• Probability error
P(error/x) =
P(ω1/x)
P(ω2/x)
if we decide ω2
if we decideω1
(or)
P(error): min[P(ω1/x), P(ω2/x))]
- Does the above decision minimize the probability error ?
P(error) =∞
−∞∫ P(error, x) dx =
∞
−∞∫ P(error/x)p(x) dx
-6-
• Where do the probabilities come from?
- The Bayesian rule is optimal if the pmf or pdf is known.
- There are two competitive answers to the above question:
(1) Relative frequency (objective) approach.
Probabilities can only come from experiments.
(2) Bayesian (subjective) approach.
Probabilities may reflect degree of belief and can be based on opin-ion as well as experiments.
Example (objective): classify cars on UNR campus into two categories:
(1) C1: more than $50K(2) C2: less than $50K
* Suppose we use one feature x: height of car
* From Bayes rule, we can compute our belief:
P(C1/x) =P(x/C1)P(C1)
P(x)
* Need to estimate P(x/C1), P(x/C2), P(C1), and P(C2)
* Determine prior probabilities
(1) ask drivers at the gate how much their car cost(2) measure the height of the car
* Suppose we end up with 1209 samples: #C1=221, #C2=988
-7-
* P(C1) = 221/12 09 = 0. 183 and P(C2) = 1 − P(C1) = 0. 817
* Determine class conditional probabilities (discretize car height intobins and use normalized histogram)
-8-
* Calculate the posterior probability for each bin using the Bayes rule:
P(C1/x = 1. 0) =P(x = 1/C1)P(C1)
P(x = 1)=
P(x = 1/C1)P(C1)
P(x = 1/C1)P(C1) + P(x = 1/C2)P(C2)= 0. 438
• Functional structure of a general statistical classifier
assign x to ω i if gi(x) > g j(x) for all j ≠ i
(discriminant functions)
-9-
• Minimum error-rate case
gi(x) = P(ω i /x)
• Is the choice of gi unique?
- Replacing gi(x) with f (gi), where f () is monotonically increasing, doesnot change the classification results.
gi(x) =p(x/ω j)P(ω j)
p(x)
gi(x) = p(x/ω i)P(ω i)
gi(x) = ln p(x/ω i) + ln P(ω i)
• Decision regions/boundaries
- Decision rules divide the feature space in decision regions R1,..., Rc.
- The boundaries of the decision regions are the decision boundaries.
-10-
• Discriminant functions for the Gaussian density
- Assume the following discriminant function:
gi(x) = ln p(x/ω i) + ln P(ω i)
- If p(x/ω i) ˜ N (µ i, Σi), then
gi(x) = −1
2(x − µ i)
tΣi−1(x − µ i) −
d
2ln2π −
1
2ln |Σi| + ln P(ω i)
Case 1: Σi = σ 2 I
(1) features are uncorrelated(2) each feature has the same variance
- If we disregardd
2ln2π and
1
2ln |Σi| (constants):
gi(x) = −||x − µ i||
2
2σ 2+ ln P(ω i)
where ||x − µ i||2 = (x − µ i)
t(x − µ i) -- favors the a-priori more likely category
- Expanding the above expression:
gi(x) = −1
2σ 2[xtx − 2µ t
ix + µ ti µ i] + ln P(ω i)
- Disregarding xtx (constant), we get a linear discriminant:
gi(x) = wtix + wi0
where wi =1
σ 2µ i , and wi0 = −
1
2σ 2µ t
i µ i + ln P(ω i)
-11-
- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):
wt(x − x0) = 0
where w = µ i − µ j , and x0 =1
2(µ i + µ j) −
σ 2
||µ i − µ j||2ln
P(ω i)
P(ω j)(µ i − µ j)
- Some comments about this hyperplane:
* It passes through x0.* It is orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.* If σ is very small, the position of the boundary is insensitive to P(ω i) andP(ω j).
-12-
- Minimum distance classifier:when P(ω i) is the same for all c classes.
gi(x) = −||x − µ i||
2
2σ 2
"Case 2:" Σi = Σ
- The clusters have hyperellipsoidal shape and same size (centered at µ).
- If we disregardd
2ln2π and
1
2ln |Σi| (constants):
gi(x) = −1
2(x − µ i)
tΣ−1(x − µ i) + ln P(ω i)
Minimum distance classifier using Mahalanobis distance:when P(ω i) is the samefor all c classes.
gi(x) = −1
2(x − µ i)
tΣ−1(x − µ i)
- Expanding the above expression and disregarding the quadratic term:
gi(x) = wtix + wi0
(linear discriminant)
where wi = Σ−1 µ i , and wi0 = −1
2µ t
iΣ−1 µ i + ln P(ω i)
-13-
- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):
wt(x − x0) = 0
where w = Σ−1(µ i − µ j) and x0 =1
2(µ i + µ j) −
ln[P(ω i)/P(ω j)]
(x − µ i)tΣ−1(x − µ i)(µ i − µ j)
- We can make a number of comments about this hyperplane:
* It passes through x0.* It is NOT orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.
-14-
Case 3: Σi = arbitrary
- The clusters have different shapes and sizes (centered at µ).
- If we disregardd
2ln2π (constant):
gi(x) = xtWix + wix + wi0(quadratic discriminant)
where Wi = −1
2Σ−1
i , wi = Σ−1i µ i , and wi0 = −
1
2µ t
iΣ−1 µ i −1
2ln|Σi| + ln P(ω i)
- Decision boundary is determined by superquadrics; setting gi(x) = g j(x)
- Decision regions can be disconnected.
-15-
• Practical difficulties
- In practice, we do not know P(ω i) or p(x/ω i)
- We are supposed to design our classifier using a set of training data.
• Possible solutions
(1) Estimate P(ω i) and p(x/ω i) using the training data.
- Usually, the estimation of P(ω i) is not very difficult.
- Estimating p(x/ω i) from training data poses serious difficulties:
insufficient number of samples
dimensionality of x is large
(2) Assume that p(x/ω i) has a parametric form (e.g., Gaussian)
- In this case, we just need to estimate some parameters (e.g., µ, Σ)
• Main methods for parameter estimation
Maximum Likelihood: It assumes that the parameters are fixed; the best estimateof their value is defined to be the one that maximizes the probability of obtainingthe samples actually observed (i.e., training data).
Bayesian Estimation: It assumes that the parameters are random variables havingsome known a priori distribution; observation of the samples (i.e., training data)converts this to a posterior density which is used to determine the true value ofparameters.
-16-
• Maximum Likelihood (ML) Estimation
Assumptions
- The training data is divided in c sets D1, D2, ...,Dc (i.e., c classes).
- The data in each set are drawn independently.
- p(x/ω j) is the joint density of class j which has a known parametric formwith parameters θ j (e.g., θ j = (µ j, Σ j)
t for Gaussian data)
Problem
- Giv en D = x1, x2. . . , xn, estimate θ
- Same procedure will be applied for each data set D j (i.e., will solve c sepa-rate problems).
ML approach
- The ML estimate is the value θ̂ that maximizes p(D/θ ) (i.e., the θ that bestsupports the training data - maximizes the probability of the observed data).
p(D/θ ) = p(x1, x2. . . , xn/θ )
(Note:) p(D/θ ) depends on θ only, it’s not a density since D is fixed)
- Since the data is drawn independently, the above probability can be writtenas follows:
p(D/θ ) =n
k=1Π p(xk /θ )
-17-
Example: Let us assume we have two coins, one of type I (fair) and one of type II(unfair). Suppose P(h/I ) = P(t/I ) = 0. 5 and P(h/II ) = 0. 8 and P(t/II ) = 0. 2. Weobserve a series of flips of a single coin, and we wish to know what type of coinwe are dealing with. Suppose we observe four heads and one tail in sequence:
P(hhhht/I ) = P(h/I )P(h/I )P(h/I )P(h/I )P(t/I ) = 0. 031 25
P(hhhht/II ) = P(h/II )P(h/II )P(h/II )P(h/II )P(t/II ) = 0. 081 9 2
Using the ML approach, the coin is of type II (we assume thatP(I ) = P(II ) = 0. 5)
• How can we find the maximum?
∇θ p(D/θ ) = 0(i.e., find the solutions, check the sign of the second derivative)
- It is easier to consider ln p(D/θ ):
∇θ ln p(D/θ ) = 0 orn
k=1Σ ∇θ ln p(xk /θ ) = 0
- The solution θ̂ maximizes p(D/θ ) or ln p(D/θ ):
θ̂ = arg maxθ ln p(D/θ )
-18-
• ML - Gaussian case: Unknown µ
- Consider ln p(x/µ) where p(x/µ) ˜ N (µ, Σ):
ln p(x/µ) = −1
2(x − µ)tΣ−1(x − µ) −
d
2ln2π −
1
2ln |Σ|
- Setting x = xk :
ln p(xk /µ) = −1
2(xk − µ)tΣ−1(xk − µ) −
d
2ln2π −
1
2ln |Σ|
- Computing the gradient, we have:
∇θ ln p(xk /µ) = Σ−1(xk − µ)
- Setting ∇θ ln p(x/µ) = 0 we hav e:
n
k=1Σ Σ−1(xk − µ) = 0
- The solution µ̂ is given by
µ̂ =1
n
n
k=1Σ xk
- The maximum likelihood estimate is simply the sample mean.
-19-
• ML - Gaussian case: Unknown µ and Σ
- Let us consider 1D Gaussian p(x) ˜ N (µ, σ 2) (i.e., θ = (θ1,θ2) = (µ, σ 2))
ln p(x/θ ) = −1
2ln 2πθ2 −
1
2θ2(xk − θ1)2
- Computing ∇θ ln p(xk /θ ) we hav e:
∂ ln p(xk /θ )
θ1=
1
θ2(xk − θ1)
∂ ln p(xk /θ )
θ2= −
1
2θ2+
(xk − θ1)2
2θ 22
- Setting ∇θ ln p(x/θ ) = 0 we hav e:
n
k=1Σ
1
θ2(xk − θ1) = 0
−n
k=1Σ
1
2θ2+
n
k=1Σ
(xk − θ1)2
2θ 22
= 0
- The solutions θ̂1=µ and θ̂2=σ 2 are:
µ̂ =1
n
n
k=1Σ xk
σ̂ 2 =1
n
n
k=1Σ(xk − µ̂)2
- In the general case (multivariate Gaussian), the solutions are:
µ̂ =1
n
n
k=1Σ xk
Σ̂ =1
n
n
k=1Σ(xk − µ̂)(xk − µ̂)t
-20-
• Maximum a posteriori estimators (MAP)
- Maximize p(θ /D) (or p(D/θ ) p(θ )):
p(θ /D) =P(D/θ )p(θ )
P(D)=
n
k=1Π P(xk /θ ) p(θ )
P(D)
maximize p(θ /D) orn
k=1Π P(xk /θ )p(θ )
- MAP is equivalent to maximum likelihood when p(θ ) is uniform (i.e., same pri-ors for all classes).
Example: Assuming P(I ) = 0. 75 and P(II ) = 0. 25 in the previous example, wehave the following estimate:
P(hhhht/I )P(I ) = 0. 023 4375
P(hhhht/II )P(II ) = 0. 020 48
Using the MAP approach, the coin is of type I
Example: θ = µ and p(µ) ˜ N (µ0, σ µ)
∂∂µ
(n
k=1Σ ln P(xk /θ ) + ln p(θ )) = 0
n
k=1Σ
1
σ 2(xk − µ) −
1
σ µ2
(µ − µ0) = 0 or µ̂ =µ0 +
σ µ2
σ 2
n
k=1Σ xk
1 +σ µ
2
σ 2
- Ifσ µ
2
σ 2>> 1, then µ̂ ≈
n
k=1Σ xk (same as ML)
- What if n → ∞?
-21-
• Maximum likelihood and model correctness
- If the model chosen for p(x/θ ) is the correct one, the maximum likelihood willgive very good results.
- If the model is wrong, maximum likelihood will give poor results.
Density estimation
- Bayesian inference is based on estimating the density function.
p(y/x) =p(x/y)p(y)
p(x)
- There three types of model for density estimation:
* parametric* non-parametric* semi-parametric
• Parametric models
- This model assumes that the density function has a particular parametric form(e.g., Gaussian).
- This is appropriate if knowledge about the problem domain suggests a specificfunctional form.
- Maximum likelihood estimation is usually used to estimate the parameters ofthe model.
-22-
• Non-parametric models
- This approach makes as few assumptions about the form of the density as possi-ble.
- Non-parametric methods perform poorly unless huge data sets are available.
Parzen windows
- The density function p(x) is estimated by averaging M kernel functionseach of which is determined by one of the M data points.
- The kernel functions are usually symmetric and unimodal (e.g., Gaussianof fixed variance):
p(x) =1
M
M
m=1Σ
1
(2π σ 2)N /2exp(−
||x − xm||2
2σ 2)
- A disadvantage of this approach is that the number of kernel functions andparameters grows with the size of the data.
Histogram
- A histogram quantizes the data feature space into regular bins of equal vol-ume. - The density function is approximated based on the fraction of datathat fall in each bin.
* If K is too small, the histogram will not be able to model the distribu-tion.* If K is too large, lots of data is needed to populate the histogram.
- Like Parzen windows, histograms scale poorly.
-23-
• Semi-parametric models
- Particularly useful for estimating density functions of unknown structure fromlimited data.
* The number of parameters can be varied depending upon the nature of thetrue probability density.
* The number of parameters is not determined by the size of the data set.
Mixtures
- It is defined as a weighted sum of K components where each component isa parametric density function p(x/k):
p(x) =K
k=1Σ p(x/k)π k
- The component densities p(x/k) are usually taken to be of the same para-metric form (e.g., Gaussians).
-24-
- The weights π k are the mixing parameters and they sum to unity:
K
k=1Σ π k = 1
- π k could be regarded as the prior probability of an observation being gener-ated by the k-th mixture component.
- Assuming Gaussian mixtures, the following parameters need to be estimated:
(1) the means of the Gaussians(2) the covariance matrices of the Gaussians(3) the mixing parameters
- Maximum-likelihood estimation is not possible using a closed analytic formlike in the case of a single Gaussian.
- There exists an iterative learning algorithm (Expectation-Maximization (EM)algorithm) which attempts to maximize likelihood.