Gaussian Mixture Models (GMM) and ML Estimation Examples Materials/gmm... · 2020-02-28 ·...

GaussianMixtureModels(GMM)and

MLEstimationExamples

MeanandVarianceofGaussian

• ConsidertheGaussianPDF:

Giventheobservations(sample)

Formthelog-likelihoodfunction

Takethederivativeswrt 𝜇𝑎𝑛𝑑𝜎andsetittozero

Let us look at the log likelihood function

l(µ) = log L(µ) =nX

log P (Xi

= 2µlog

3+ log µ

∂+ 3

3+ log µ

∂+ 3

3+ log(1° µ)

∂+ 2

3+ log(1° µ)

= C + 5 log µ + 5 log(1° µ)

where C is a constant which does not depend on µ. It can be seen that the log likelihoodfunction is easier to maximize compared to the likelihood function.

Let the derivative of l(µ) with respect to µ be zero:

dl(µ)

1° µ

and the solution gives us the MLE, which is µ̂ = 0.5. We remember that the method ofmoment estimation is µ̂ = 5/12, which is diÆerent from MLE.

Example 2: Suppose X1, X2, · · · , Xn

are i.i.d. random variables with density functionf(x|æ) = 1

exp≥° |x|

¥, please find the maximum likelihood estimate of æ.

Solution: The log-likelihood function is

l(æ) =nX

° log 2° log æ ° |Xi

Let the derivative with respect to µ be zero:

0(æ) =nX

i=1 |Xi

and this gives us the MLE for æ as

æ̂ =

i=1 |Xi

Again this is diÆerent from the method of moment estimation which is

æ̂ =

Example 3: Use the method of moment to estimate the parameters µ and æ for the normaldensity

f(x|µ,æ

2) =1p2ºæ

°(x° µ)2

based on a random sample X1, · · · , Xn

Solution: In this example, we have two unknown parameters, µ and æ, therefore the pa-rameter µ = (µ,æ) is a vector. We first write out the log likelihood function as

l(µ,æ) =nX

∑° log æ ° 1

2log 2º ° 1

2æ2(X

° µ)2∏

= °n log æ ° n

2log 2º ° 1

° µ)2

Setting the partial derivative to be 0, we have

@l(µ,æ)

° µ) = 0

@l(µ,æ)

° µ)2 = 0

Solving these equations will give us the MLE for µ and æ:

µ̂ = X and æ̂ =

vuut 1

This time the MLE is the same as the result of method of moment.

From these examples, we can see that the maximum likelihood result may or may not be thesame as the result of method of moment.

Example 4: The Pareto distribution has been used in economics as a model for a densityfunction with a slowly decaying tail:

f(x|x0, µ) = µx

0x°µ°1

, x ∏ x0, µ > 1

Assume that x0 > 0 is given and that X1, X2, · · · , Xn

is an i.i.d. sample. Find the MLE ofµ.

l(µ) =nX

log f(Xi

|µ) =nX

(log µ + µ log x0 ° (µ + 1) log X

= n log µ + nµ log x0 ° (µ + 1)nX

dl(µ)

+ n log x0 °nX

l(µ,æ) =nX

∑° log æ ° 1

2log 2º ° 1

2æ2(X

° µ)2∏

= °n log æ ° n

2log 2º ° 1

° µ)2

@l(µ,æ)

° µ) = 0

@l(µ,æ)

° µ)2 = 0

µ̂ = X and æ̂ =

vuut 1

f(x|x0, µ) = µx

0x°µ°1

, x ∏ x0, µ > 1

l(µ) =nX

log f(Xi

|µ) =nX

dl(µ)

+ n log x0 °nX

Solution

• Samplemeanandvariance:

l(µ,æ) =nX

∑° log æ ° 1

2log 2º ° 1

2æ2(X

° µ)2∏

= °n log æ ° n

2log 2º ° 1

° µ)2

@l(µ,æ)

° µ) = 0

@l(µ,æ)

° µ)2 = 0

µ̂ = X and æ̂ =

vuut 1

f(x|x0, µ) = µx

0x°µ°1

, x ∏ x0, µ > 1

l(µ) =nX

log f(Xi

|µ) =nX

dl(µ)

+ n log x0 °nX

GaussianMixtureModel• GMM

Gaussian Mixture Model• Probabilistic story: Each cluster is associated with a

Gaussian distribution. To generate data, randomly choosea cluster k with probability ⇡k and sample from itsdistribution.

• Likelihood Pr(x ) =KX

⇡kN (x |µk ,⌃k ) where

⇡k = 1, 0 ⇡k 1.

Sriram Sankararaman Clustering

Gaussian Mixture Model• Probabilistic story: Each cluster is associated with a

Gaussian distribution. To generate data, randomly choosea cluster k with probability ⇡k and sample from itsdistribution.

• Likelihood Pr(x ) =KX

⇡kN (x |µk ,⌃k ) where

⇡k = 1, 0 ⇡k 1.

X ismultidimensional.Ref:https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/clustering/slides.pdf

CanweusetheMLestimationmethodtoestimatetheunknownparameters,𝜇1, 𝜎1, 𝜋4 ?• It isnoteasy:

However,itispossible to obtain aniterative solution!

Gaussian Mixture Model

• Loss function is the negative log likelihood

� log Pr(x |⇡, µ,⌃) = �nX

⇡kN (x |µk ,⌃k )

• Why is this function difficult to optimize?• Notice that the sum over the components appears inside

the log, thus coupling all the parameters.

WecanestimatetheparametersusingiterativeExpectation-Maximization(EM)algorithm

Thelatentvariableparameterzik representsthecontribution ofk-th Gaussianto xiTakethederivativeofthelog-likelihood wrt 𝜇1,𝜎1,𝜋4 andsetittozerotogetequations tobeusedinEMalgorithm

• Given the observations xi,i=1,2,...,nGaussian Mixture Model• Each xi is associated with a latent variable

zi = (zi1, . . . , ziK ).• Given the complete data (x , z) = (xi , zi ), i = 1, . . . , n

• We can estimate the parameters by maximizing thecomplete data log likelihood.

log Pr(x , z |⇡, µ,⌃) =NX

zik {log ⇡k + logN (xi |µk ,⌃k )}

• Notice that the ⇡k and (µk ,⌃k ) decouple. Trivialclosed-form solution exists.

• Need a procedure that would optimize the log likelihood byworking with the (easier) complete log likelihood.

• “Fill-in” the latent variables using current estimate of theparameters.

• Adjust parameters based on the filled-in variables.

Iteratefork=1,2,…

• Initializewith𝜇5, 𝜎5𝐼, 𝜋5• Updateequationsatthek-th iteration:

Itmaynotconvergetotheglobaloptimum!

The Expectation-Maximization (EM)algorithm

• E-step: Given parameters, compute

rik�= E (zik ) =

⇡kN (xi |µk ,⌃k )PK

k=1 ⇡kN (xi |µk ,⌃k )

• M-step: Maximize the expected complete log likelihood

E [log Pr(x , z |⇡, µ,⌃)] =nX

rik {log ⇡k + logN (xi |µk ,⌃k )}

To update the parameters

⇡k =

Pi rik

n, µk =

Pi rik xiPi rik

,⌃k =

Pi rik (xi � µk )(xi � µk )T

Pi rik

• Iterate till likelihood converges.• Converges to local optimum of the log likelihood.

+1 +1 +1

By updating

Example

Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU 28

• GMM example – Training set: 𝑛 = 900 examples from a uniform pdf inside an annulus

– Model: GMM with 𝐶 = 30 Gaussian components

– Training procedure

• Gaussians centers initialized by choosing 30 arbitrary training examples

• Covariance matrices initialized to be diagonal, with large variance compared to that of the training data

• To avoid singularities, at every iteration the covariance matrices computed with EM were regularized with a small multiple of the identity matrix

• Components whose mixing coefficients fell below a threshold are removed

– Illustrative results are provided in the next slide

Observations:bluedots

-4 -3 -2 -1 0 1 2 3 4

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 0 Iteration 25 Iteration 50

Iteration 75 Iteration 275 Iteration 300

Ellipsesrepresent2-DGaussians

Vectorquantization=K-meansclustering

Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU 30

• k-means clustering – The k-means algorithm is a simple procedure that attempts to group a

collection of unlabeled examples 𝑋 = 𝑥1 …𝑥𝑛 into one of 𝐶 clusters • k-means seeks to find compact clusters, measured as

𝐽𝑀𝑆𝐸 = 𝑥 − 𝜇𝑐 2𝑥∈𝜔𝑐

𝑐=1; 𝜇𝑐 =

1𝑛𝑐 𝑥𝑥∈𝜔𝑐

• It can be shown that k-means is a special case of the GMM-EM algorithm

– Procedure

1. Define the number of clusters 2. Initialize clusters by

a) an arbitrary assignment of examples to clusters or b) an arbitrary set of cluster centers (i.e., use some examples as centers)

3. Compute the sample mean of each cluster 4. Reassign each example to the cluster with the nearest mean 5. If the classification of all samples has not changed, stop, else go to step 3

YoumayinitializetheGMMalgorithmfromK-meansclustercenters!

Example

SpeakerIdentification

• Featureextractionsfromdatausingmel-cepstrum

• Extractfeaturevectorsforeachspeaker(e.g.,90seclongdata)• Framelength10ms

SpeakermodelGMM

\Lambdarepresentsaperson (speaker)

GMMModel

Mixtureweights\sum_i p_i =1

Trainthemodelfromobservationsusingtheiterativealgorithmforeachspeaker

Howdowedecide?

CSC401/2511 – Spring 2016 10

Classifyingspeakers• Similarly,allofthespeechproducedbyonespeaker willclusterdifferentlyinMFCCspacethanspeechfromanotherspeaker.• Wecan∴ decideifagivenobservationcomesfromonespeakeroranother.

Time,30 1 … T

1 …2 …3 …… … … … …42 …

Observationmatrix

Sheistheperson!

Speakeridentificationproblemsolution

SpeakerIdentificationresult

16speakers

ExponentialDistributionExample

Let us look at the log likelihood function

l(µ) = log L(µ) =nX

log P (Xi

= 2µlog

3+ log µ

∂+ 3

3+ log µ

∂+ 3

3+ log(1° µ)

∂+ 2

3+ log(1° µ)

= C + 5 log µ + 5 log(1° µ)

where C is a constant which does not depend on µ. It can be seen that the log likelihoodfunction is easier to maximize compared to the likelihood function.

Let the derivative of l(µ) with respect to µ be zero:

dl(µ)

1° µ

and the solution gives us the MLE, which is µ̂ = 0.5. We remember that the method ofmoment estimation is µ̂ = 5/12, which is diÆerent from MLE.

Example 2: Suppose X1, X2, · · · , Xn

are i.i.d. random variables with density functionf(x|æ) = 1

exp≥° |x|

¥, please find the maximum likelihood estimate of æ.

l(æ) =nX

° log 2° log æ ° |Xi

0(æ) =nX

i=1 |Xi

and this gives us the MLE for æ as

æ̂ =

i=1 |Xi

Again this is diÆerent from the method of moment estimation which is

æ̂ =

Example 3: Use the method of moment to estimate the parameters µ and æ for the normaldensity

f(x|µ,æ

2) =1p2ºæ

°(x° µ)2

Yougetadifferentestimateforthestandarddeviation

References

• Robusttext-independentspeakeridentification usingGaussianmixture speakermodels,DAReynolds,RCRose - SpeechandAudioProcessing,IEEE…,1995- ieeexplore.ieee.orgGaussianmixturemodels(GMM)forrobusttext-independentspeakeridentification.TheindividualGaussiancomponentsofaGMMareshowntorepresentsomegeneralspeaker-dependentspectralshapesthatareeffectiveformodelingspeakeridentity.Thefocusof…Citedby2855 Relatedarticles All11versionsWebofScience:976 Cite Save

• Speaker verificationusingadaptedGaussianmixturemodels,DAReynolds,TFQuatieri,RBDunn-Digitalsignalprocessing,2000– ElsevierDAReynoldsPDF] PatternRecognition

• CM Bishop - MachineLearning,2006- academia.edu,PDFisavailable!• https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/clustering/slides.pdf

Gaussian Mixture Models (GMM) and ML Estimation Examples Materials/gmm... · 2020-02-28 ·...

Documents

Transcript of Gaussian Mixture Models (GMM) and ML Estimation Examples Materials/gmm... · 2020-02-28 ·...

Sliced Wasserstein Distance for Learning Gaussian Mixture ......Finite Gaussian Mixture Models (GMMs), also called Mixture of Gaussians (MoG), are powerful, parametric, and probabilistic

The Blob Filter: Gaussian Mixture Nonlinear Filtering with ...gps.mae.cornell.edu/psiaki_gaussmixfilter_ieeeionplans2014.pdf · The "Blob" Filter: Gaussian Mixture Nonlinear Filtering

Gaussian Mixture Model (GMM) using Expectationvplab/courses/DVP/PDF/gmm.pdf · ML Method for estimating parameters Consider log of Gaussian Distribution ()x μ x μ 2 1 ln 2 1 ln(2

PROYECTO FIN DE GRADOoa.upm.es/56574/1/TFG_ALMUDENA_MARTINEZ_JIMENEZ.pdf · PROYECTO FIN DE GRADO ... GMM (Gaussian Mixture Models) y RBF (Radial Basis Functions). Para ello, hemos

A Gaussian Mixture Model Spectral Representation for ...mi.eng.cam.ac.uk/~mjfg/thesis_mns25.pdf4.1 Gaussian mixture model representations of the speech spectrum 45 4.1.1 Mixture models

Real2Sim Transfer using Differentiable Physicspendulum using a Gaussian Mixture Model. Top left: loss evolution, top right: 3D visualization of the evolution of the GMM kernels, bottom:

The Infinite Gaussian Mixture Modelpapers.nips.cc/paper/1745-the-infinite-gaussian-mixture-model.pdf · The Infinite Gaussian Mixture Model ... here we derive the model as the limiting

Speaker Recognition using Gaussian Mixture Model

670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

SPEAKER VERIFICATION USING NEURAL RESPONSES FROM THE …studentsrepo.um.edu.my/7784/1/KGL120004_NoorFadzilahRazali_V3.pdf · menggunakan ‘Gaussian Mixture Model’ (GMM). Ciri vector

Gaussian Mixture Model - Carnegie Mellon Universitygdicaro/10315/lectures/EM...General GMM • There are k components • Component i has an associated mean vector i • Each component

Dirichlet Process Gaussian Mixture Models: Choice of …mlg.eng.cam.ac.uk/pub/pdf/GoeRas10.pdf · Dirichlet process Gaussian mixture models: ... Dirichlet Process Gaussian Mixture

Clustering and Gaussian Mixture Models · GMM, being probabilistic, allows learningprobabilities of cluster assignments Probabilistic Machine Learning (CS772A) Clustering and Gaussian

FACE DETECTION - Matematikcentrum...FACE DETECTION Skin Color • Example of skin pixel classifier – Model and using Gaussian Mixture Model (GMM) – Train two GMMs, one with skin

Multi-Dimensional Uniform Initialization Gaussian Mixture ...

Deep Clustering by Gaussian Mixture Variational ...openaccess.thecvf.com/content_ICCV_2019/papers/...Deep Clustering by Gaussian Mixture Variational Autoencoders with Graph Embedding

UNIVERSITÉ DU QUÉBEC - depot-e.uqtr.cadepot-e.uqtr.ca/5167/1/030350177.pdf · 2.2.2.1 Le neurone biologique ... GMM Gaussian Mixture Model (Mélange de mixture de Gaussienne) CBIR

Neural Network - Gaussian Mixture Hybrid for Speech ...papers.nips.cc/paper/521-neural-network-gaussian... · Neural Network-Gaussian Mixture Hybrid for Speech Recognition or Density

Gaussian Mixture Models and Expectation Maximization

Dissertations in Forestry and Natural Sciences...density Gaussian mixture model (JD-GMM) based method, and a simpliﬁed unit selection (US) method. The proposed joint approach of