Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference •...

24
Bayesian Inference Reading Assignments R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (2.1, 2.4-2.6, 3.1-3.2 , hard-copy). Rusell and Norvig, Artificial Intelligence: A Modern Approach (chapter 14, hard copy). S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (Chapt. 3, hard copy). Case Studies H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object Detection Applied to Faces and Cars", Computer Vision and Pattern Recognition Confer- ence, pp. 45-51, 1998 (on-line). K. Sung and T. Poggio, "Example-based learning for view-based human face detec- tion", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 39-51, 1998 (on-line). A. Madabhushi and J. Aggarwal, "A bayesian approach to human activity recogni- tion", 2nd International Workshop on Visual Surveillance, pp. 25-30, June 1999 (hard-copy). M. Jones and J. Rehg, "Statistical color models with application to skin detection", Technical Report, Compaq Research Labs (on-line). J. Yang and A. Waibel, "A Real-time Face Tracker", Proceedings of WACV’96, 1996 (on-line). C. Stauffer and E. Grimson, "Adaptive background mixture models for real-time tracking", IEEE Computer Vision and Pattern Recognition Conference, Vol. 2, pp. 246-252, 1998 (on-line).

Transcript of Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference •...

Page 1: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

Bayesian Inference

• Reading Assignments

R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001(2.1, 2.4-2.6, 3.1-3.2 , hard-copy).

Rusell and Norvig, Artificial Intelligence: A Modern Approach (chapter 14, hardcopy).

S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial CollegePres, 2001 (Chapt. 3, hard copy).

• Case Studies

H. Schneiderman and T. Kanade, "A Statistical Method for 3D Object DetectionApplied to Faces and Cars", Computer Vision and Pattern Recognition Confer-ence, pp. 45-51, 1998 (on-line).

K. Sung and T. Poggio, "Example-based learning for view-based human face detec-tion", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 20,no. 1, pp. 39-51, 1998 (on-line).

A. Madabhushi and J. Aggarwal, "A bayesian approach to human activity recogni-tion", 2nd International Workshop on Visual Surveillance, pp. 25-30, June 1999(hard-copy).

M. Jones and J. Rehg, "Statistical color models with application to skin detection",Technical Report, Compaq Research Labs (on-line).

J. Yang and A. Waibel, "A Real-time Face Tracker", Proceedings of WACV’96, 1996(on-line).

C. Stauffer and E. Grimson, "Adaptive background mixture models for real-timetracking", IEEE Computer Vision and Pattern Recognition Conference, Vol. 2,pp. 246-252, 1998 (on-line).

Page 2: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-2-

Bayesian Inference

• Why bother about probabilities?

- Accounting for uncertainty is a crucial component in decision making (e.g.,classification) because of ambiguity in our measurements.

- Probability theory is the proper mechanism for accounting for uncertainty.

- Need to take into account reasonable preferences about the state of the world,for example:

"If the fish was caught in the Atlantic ocean, then it is more likely to besalmon than sea-bass"

- We will discuss techniques for building probabilistic models and for extractinginformation from a probabilistic model.

• Probabilistic Inference

- If we could define all possible values for the probability distribution, then wecould read off any probability we were interested in.

- In general, it is not practical to define all possible entries for the joint probabil-ity function.

- Probabilistic inference consists of computing probabilities that are not explicitlystored by the reasoning system (e.g., marginals, conditionals).

• Belief

- The conditional probability of an event given some evidence.

We may not know for sure what affects a particular patience, but we believe thatthere is, say, an 80% chance (i.e., a probability of 0.8) that the patient has a cav-

ity if he or she has a toothache

Page 3: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-3-

• Bayes rule

- Very often we want to compute the value of P(hypothesis/evidence).

- Bayes’ rule provides a way of computing a conditional probability from itsinverse conditional probability:

P(B/A) =P(A/B) P(B)

P(A)

- The denominator P(A) can be considered as a normalization constant (it can beselected in a way such that the P(B/A) sum to 1)

• An example: separate sea bass from salmon

- Some definitions.

State of nature ω (random variable): ω1 for sea bass, ω2 for salmon.Probabilities P(ω1) and P(ω2): prior knowledge of how likely is to get a seabass or a salmon (priors).Probability density function p(x): how frequently we will measure a patternwith feature value x (e.g., x is a lightness measurement) (evidence)Conditional probability density function (pdf) p(x/ω j): how frequently wewill measure a pattern with feature value x given that the pattern belongs toclass ω j (likelihood)Conditional probability P(ω j /x): the probability that the fish belongs to classω j given measurement x (posterior).

Page 4: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-4-

- Decision rule using priors only

Decide ω1 if P(ω1) > P(ω2);otherwise decide ω2

P(error): min[P(ω1), P(ω2))]

- Classification can be improved by using additional information (i.e.,lightness measurements).

- Decision rule using conditional pdf

- The joint pdf of finding a pattern in category ω j having feature value xis:

p(x, ω j) = P(ω j /x)p(x) = p(x/ω j)P(ω j)

- The Bayes’ formula is:

P(ω j /x) =p(x/ω j)P(ω j)

p(x)=

likelihood x prior

evidence

where p(x) = p(x/ω1)P(ω1) + p(x/ω2)P(ω2) is essentially a scale factor.

Decide ω1 if P(ω1/x) > P(ω2/x);otherwise decide ω2

(or)Decide ω1 if p(x/ω1)P(ω1) > p(x/ω2)P(ω2); otherwise decide ω2

Page 5: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-5-

(assuming P(ω1) = 2/3, P(ω2) = 1/3)

• Probability error

P(error/x) =

P(ω1/x)

P(ω2/x)

if we decide ω2

if we decideω1

(or)

P(error): min[P(ω1/x), P(ω2/x))]

- Does the above decision minimize the probability error ?

P(error) =∞

−∞∫ P(error, x) dx =

−∞∫ P(error/x)p(x) dx

Page 6: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-6-

• Where do the probabilities come from?

- The Bayesian rule is optimal if the pmf or pdf is known.

- There are two competitive answers to the above question:

(1) Relative frequency (objective) approach.

Probabilities can only come from experiments.

(2) Bayesian (subjective) approach.

Probabilities may reflect degree of belief and can be based on opin-ion as well as experiments.

Example (objective): classify cars on UNR campus into two categories:

(1) C1: more than $50K(2) C2: less than $50K

* Suppose we use one feature x: height of car

* From Bayes rule, we can compute our belief:

P(C1/x) =P(x/C1)P(C1)

P(x)

* Need to estimate P(x/C1), P(x/C2), P(C1), and P(C2)

* Determine prior probabilities

(1) ask drivers at the gate how much their car cost(2) measure the height of the car

* Suppose we end up with 1209 samples: #C1=221, #C2=988

Page 7: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-7-

* P(C1) = 221/12 09 = 0. 183 and P(C2) = 1 − P(C1) = 0. 817

* Determine class conditional probabilities (discretize car height intobins and use normalized histogram)

Page 8: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-8-

* Calculate the posterior probability for each bin using the Bayes rule:

P(C1/x = 1. 0) =P(x = 1/C1)P(C1)

P(x = 1)=

P(x = 1/C1)P(C1)

P(x = 1/C1)P(C1) + P(x = 1/C2)P(C2)= 0. 438

• Functional structure of a general statistical classifier

assign x to ω i if gi(x) > g j(x) for all j ≠ i

(discriminant functions)

Page 9: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-9-

• Minimum error-rate case

gi(x) = P(ω i /x)

• Is the choice of gi unique?

- Replacing gi(x) with f (gi), where f () is monotonically increasing, doesnot change the classification results.

gi(x) =p(x/ω j)P(ω j)

p(x)

gi(x) = p(x/ω i)P(ω i)

gi(x) = ln p(x/ω i) + ln P(ω i)

• Decision regions/boundaries

- Decision rules divide the feature space in decision regions R1,..., Rc.

- The boundaries of the decision regions are the decision boundaries.

Page 10: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-10-

• Discriminant functions for the Gaussian density

- Assume the following discriminant function:

gi(x) = ln p(x/ω i) + ln P(ω i)

- If p(x/ω i) ˜ N (µ i, Σi), then

gi(x) = −1

2(x − µ i)

tΣi−1(x − µ i) −

d

2ln2π −

1

2ln |Σi| + ln P(ω i)

Case 1: Σi = σ 2 I

(1) features are uncorrelated(2) each feature has the same variance

- If we disregardd

2ln2π and

1

2ln |Σi| (constants):

gi(x) = −||x − µ i||

2

2σ 2+ ln P(ω i)

where ||x − µ i||2 = (x − µ i)

t(x − µ i) -- favors the a-priori more likely category

- Expanding the above expression:

gi(x) = −1

2σ 2[xtx − 2µ t

ix + µ ti µ i] + ln P(ω i)

- Disregarding xtx (constant), we get a linear discriminant:

gi(x) = wtix + wi0

where wi =1

σ 2µ i , and wi0 = −

1

2σ 2µ t

i µ i + ln P(ω i)

Page 11: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-11-

- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):

wt(x − x0) = 0

where w = µ i − µ j , and x0 =1

2(µ i + µ j) −

σ 2

||µ i − µ j||2ln

P(ω i)

P(ω j)(µ i − µ j)

- Some comments about this hyperplane:

* It passes through x0.* It is orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.* If σ is very small, the position of the boundary is insensitive to P(ω i) andP(ω j).

Page 12: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-12-

- Minimum distance classifier:when P(ω i) is the same for all c classes.

gi(x) = −||x − µ i||

2

2σ 2

"Case 2:" Σi = Σ

- The clusters have hyperellipsoidal shape and same size (centered at µ).

- If we disregardd

2ln2π and

1

2ln |Σi| (constants):

gi(x) = −1

2(x − µ i)

tΣ−1(x − µ i) + ln P(ω i)

Minimum distance classifier using Mahalanobis distance:when P(ω i) is the samefor all c classes.

gi(x) = −1

2(x − µ i)

tΣ−1(x − µ i)

- Expanding the above expression and disregarding the quadratic term:

gi(x) = wtix + wi0

(linear discriminant)

where wi = Σ−1 µ i , and wi0 = −1

2µ t

iΣ−1 µ i + ln P(ω i)

Page 13: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-13-

- Decision boundary is determined by hyperplanes; setting gi(x) = g j(x):

wt(x − x0) = 0

where w = Σ−1(µ i − µ j) and x0 =1

2(µ i + µ j) −

ln[P(ω i)/P(ω j)]

(x − µ i)tΣ−1(x − µ i)(µ i − µ j)

- We can make a number of comments about this hyperplane:

* It passes through x0.* It is NOT orthogonal to the line linking the means.* What happens when P(ω i) = P(ω j) ?* If P(ω i) ≠ P(ω j), x0 shifts away from the more likely mean.

Page 14: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-14-

Case 3: Σi = arbitrary

- The clusters have different shapes and sizes (centered at µ).

- If we disregardd

2ln2π (constant):

gi(x) = xtWix + wix + wi0(quadratic discriminant)

where Wi = −1

2Σ−1

i , wi = Σ−1i µ i , and wi0 = −

1

2µ t

iΣ−1 µ i −1

2ln|Σi| + ln P(ω i)

- Decision boundary is determined by superquadrics; setting gi(x) = g j(x)

- Decision regions can be disconnected.

Page 15: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-15-

• Practical difficulties

- In practice, we do not know P(ω i) or p(x/ω i)

- We are supposed to design our classifier using a set of training data.

• Possible solutions

(1) Estimate P(ω i) and p(x/ω i) using the training data.

- Usually, the estimation of P(ω i) is not very difficult.

- Estimating p(x/ω i) from training data poses serious difficulties:

insufficient number of samples

dimensionality of x is large

(2) Assume that p(x/ω i) has a parametric form (e.g., Gaussian)

- In this case, we just need to estimate some parameters (e.g., µ, Σ)

• Main methods for parameter estimation

Maximum Likelihood: It assumes that the parameters are fixed; the best estimateof their value is defined to be the one that maximizes the probability of obtainingthe samples actually observed (i.e., training data).

Bayesian Estimation: It assumes that the parameters are random variables havingsome known a priori distribution; observation of the samples (i.e., training data)converts this to a posterior density which is used to determine the true value ofparameters.

Page 16: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-16-

• Maximum Likelihood (ML) Estimation

Assumptions

- The training data is divided in c sets D1, D2, ...,Dc (i.e., c classes).

- The data in each set are drawn independently.

- p(x/ω j) is the joint density of class j which has a known parametric formwith parameters θ j (e.g., θ j = (µ j, Σ j)

t for Gaussian data)

Problem

- Giv en D = x1, x2. . . , xn, estimate θ

- Same procedure will be applied for each data set D j (i.e., will solve c sepa-rate problems).

ML approach

- The ML estimate is the value θ̂ that maximizes p(D/θ ) (i.e., the θ that bestsupports the training data - maximizes the probability of the observed data).

p(D/θ ) = p(x1, x2. . . , xn/θ )

(Note:) p(D/θ ) depends on θ only, it’s not a density since D is fixed)

- Since the data is drawn independently, the above probability can be writtenas follows:

p(D/θ ) =n

k=1Π p(xk /θ )

Page 17: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-17-

Example: Let us assume we have two coins, one of type I (fair) and one of type II(unfair). Suppose P(h/I ) = P(t/I ) = 0. 5 and P(h/II ) = 0. 8 and P(t/II ) = 0. 2. Weobserve a series of flips of a single coin, and we wish to know what type of coinwe are dealing with. Suppose we observe four heads and one tail in sequence:

P(hhhht/I ) = P(h/I )P(h/I )P(h/I )P(h/I )P(t/I ) = 0. 031 25

P(hhhht/II ) = P(h/II )P(h/II )P(h/II )P(h/II )P(t/II ) = 0. 081 9 2

Using the ML approach, the coin is of type II (we assume thatP(I ) = P(II ) = 0. 5)

• How can we find the maximum?

∇θ p(D/θ ) = 0(i.e., find the solutions, check the sign of the second derivative)

- It is easier to consider ln p(D/θ ):

∇θ ln p(D/θ ) = 0 orn

k=1Σ ∇θ ln p(xk /θ ) = 0

- The solution θ̂ maximizes p(D/θ ) or ln p(D/θ ):

θ̂ = arg maxθ ln p(D/θ )

Page 18: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-18-

• ML - Gaussian case: Unknown µ

- Consider ln p(x/µ) where p(x/µ) ˜ N (µ, Σ):

ln p(x/µ) = −1

2(x − µ)tΣ−1(x − µ) −

d

2ln2π −

1

2ln |Σ|

- Setting x = xk :

ln p(xk /µ) = −1

2(xk − µ)tΣ−1(xk − µ) −

d

2ln2π −

1

2ln |Σ|

- Computing the gradient, we have:

∇θ ln p(xk /µ) = Σ−1(xk − µ)

- Setting ∇θ ln p(x/µ) = 0 we hav e:

n

k=1Σ Σ−1(xk − µ) = 0

- The solution µ̂ is given by

µ̂ =1

n

n

k=1Σ xk

- The maximum likelihood estimate is simply the sample mean.

Page 19: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-19-

• ML - Gaussian case: Unknown µ and Σ

- Let us consider 1D Gaussian p(x) ˜ N (µ, σ 2) (i.e., θ = (θ1,θ2) = (µ, σ 2))

ln p(x/θ ) = −1

2ln 2πθ2 −

1

2θ2(xk − θ1)2

- Computing ∇θ ln p(xk /θ ) we hav e:

∂ ln p(xk /θ )

θ1=

1

θ2(xk − θ1)

∂ ln p(xk /θ )

θ2= −

1

2θ2+

(xk − θ1)2

2θ 22

- Setting ∇θ ln p(x/θ ) = 0 we hav e:

n

k=1Σ

1

θ2(xk − θ1) = 0

−n

k=1Σ

1

2θ2+

n

k=1Σ

(xk − θ1)2

2θ 22

= 0

- The solutions θ̂1=µ and θ̂2=σ 2 are:

µ̂ =1

n

n

k=1Σ xk

σ̂ 2 =1

n

n

k=1Σ(xk − µ̂)2

- In the general case (multivariate Gaussian), the solutions are:

µ̂ =1

n

n

k=1Σ xk

Σ̂ =1

n

n

k=1Σ(xk − µ̂)(xk − µ̂)t

Page 20: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-20-

• Maximum a posteriori estimators (MAP)

- Maximize p(θ /D) (or p(D/θ ) p(θ )):

p(θ /D) =P(D/θ )p(θ )

P(D)=

n

k=1Π P(xk /θ ) p(θ )

P(D)

maximize p(θ /D) orn

k=1Π P(xk /θ )p(θ )

- MAP is equivalent to maximum likelihood when p(θ ) is uniform (i.e., same pri-ors for all classes).

Example: Assuming P(I ) = 0. 75 and P(II ) = 0. 25 in the previous example, wehave the following estimate:

P(hhhht/I )P(I ) = 0. 023 4375

P(hhhht/II )P(II ) = 0. 020 48

Using the MAP approach, the coin is of type I

Example: θ = µ and p(µ) ˜ N (µ0, σ µ)

∂∂µ

(n

k=1Σ ln P(xk /θ ) + ln p(θ )) = 0

n

k=1Σ

1

σ 2(xk − µ) −

1

σ µ2

(µ − µ0) = 0 or µ̂ =µ0 +

σ µ2

σ 2

n

k=1Σ xk

1 +σ µ

2

σ 2

- Ifσ µ

2

σ 2>> 1, then µ̂ ≈

n

k=1Σ xk (same as ML)

- What if n → ∞?

Page 21: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-21-

• Maximum likelihood and model correctness

- If the model chosen for p(x/θ ) is the correct one, the maximum likelihood willgive very good results.

- If the model is wrong, maximum likelihood will give poor results.

Density estimation

- Bayesian inference is based on estimating the density function.

p(y/x) =p(x/y)p(y)

p(x)

- There three types of model for density estimation:

* parametric* non-parametric* semi-parametric

• Parametric models

- This model assumes that the density function has a particular parametric form(e.g., Gaussian).

- This is appropriate if knowledge about the problem domain suggests a specificfunctional form.

- Maximum likelihood estimation is usually used to estimate the parameters ofthe model.

Page 22: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-22-

• Non-parametric models

- This approach makes as few assumptions about the form of the density as possi-ble.

- Non-parametric methods perform poorly unless huge data sets are available.

Parzen windows

- The density function p(x) is estimated by averaging M kernel functionseach of which is determined by one of the M data points.

- The kernel functions are usually symmetric and unimodal (e.g., Gaussianof fixed variance):

p(x) =1

M

M

m=1Σ

1

(2π σ 2)N /2exp(−

||x − xm||2

2σ 2)

- A disadvantage of this approach is that the number of kernel functions andparameters grows with the size of the data.

Histogram

- A histogram quantizes the data feature space into regular bins of equal vol-ume. - The density function is approximated based on the fraction of datathat fall in each bin.

* If K is too small, the histogram will not be able to model the distribu-tion.* If K is too large, lots of data is needed to populate the histogram.

- Like Parzen windows, histograms scale poorly.

Page 23: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-23-

• Semi-parametric models

- Particularly useful for estimating density functions of unknown structure fromlimited data.

* The number of parameters can be varied depending upon the nature of thetrue probability density.

* The number of parameters is not determined by the size of the data set.

Mixtures

- It is defined as a weighted sum of K components where each component isa parametric density function p(x/k):

p(x) =K

k=1Σ p(x/k)π k

- The component densities p(x/k) are usually taken to be of the same para-metric form (e.g., Gaussians).

Page 24: Bayesian Inference - UNRbebis/MathMethods/BayesianInference/... · -2-Bayesian Inference • Whybother about probabilities?-Accounting for uncertainty is a crucial component in decision

-24-

- The weights π k are the mixing parameters and they sum to unity:

K

k=1Σ π k = 1

- π k could be regarded as the prior probability of an observation being gener-ated by the k-th mixture component.

- Assuming Gaussian mixtures, the following parameters need to be estimated:

(1) the means of the Gaussians(2) the covariance matrices of the Gaussians(3) the mixing parameters

- Maximum-likelihood estimation is not possible using a closed analytic formlike in the case of a single Gaussian.

- There exists an iterative learning algorithm (Expectation-Maximization (EM)algorithm) which attempts to maximize likelihood.