Singularity structures and parameter estimation in mixture...

65
Singularity structures and parameter estimation in mixture models Long Nguyen Department of Statistics University of Michigan Institute for Mathematical Sciences National University of Singapore June 2016 Joint work with Nhat Ho (UM) Long Nguyen (UM) June 2016 1 / 50

Transcript of Singularity structures and parameter estimation in mixture...

Page 1: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularity structures and parameter estimationin mixture models

Long Nguyen

Department of StatisticsUniversity of Michigan

Institute for Mathematical SciencesNational University of Singapore

June 2016

Joint work with Nhat Ho (UM)

Long Nguyen (UM) June 2016 1 / 50

Page 2: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

machine learning: “model-free optimization-based algorithms”

I isn’t it the spirit of empirical likelihood based methods?I prediction vs estimation/inference

Bayesian nonparametrics:

I how to be Bayesian, yet more empirical by being nonparametric

empirical likelihood:

I how to take the inference a bit beyond empirical distributions

modern statistics/ data science

— data increasingly high-dimensional and complex— inferential goals increasingly more ambitious— requiring more sophisticated algorithms and complex statistical modeling

Long Nguyen (UM) June 2016 2 / 50

Page 3: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

machine learning: “model-free optimization-based algorithms”

I isn’t it the spirit of empirical likelihood based methods?I prediction vs estimation/inference

Bayesian nonparametrics:

I how to be Bayesian, yet more empirical by being nonparametric

empirical likelihood:

I how to take the inference a bit beyond empirical distributions

modern statistics/ data science

— data increasingly high-dimensional and complex— inferential goals increasingly more ambitious— requiring more sophisticated algorithms and complex statistical modeling

Long Nguyen (UM) June 2016 2 / 50

Page 4: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

machine learning: “model-free optimization-based algorithms”

I isn’t it the spirit of empirical likelihood based methods?I prediction vs estimation/inference

Bayesian nonparametrics:

I how to be Bayesian, yet more empirical by being nonparametric

empirical likelihood:

I how to take the inference a bit beyond empirical distributions

modern statistics/ data science

— data increasingly high-dimensional and complex— inferential goals increasingly more ambitious— requiring more sophisticated algorithms and complex statistical modeling

Long Nguyen (UM) June 2016 2 / 50

Page 5: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Mixture modeling

Mixture density

p(x) =kX

i=1

pi

f (x |⌘i

)

e.g.,f (x |⌘

i

) = Normal(x |µi

,⌃i

)

Parameter estimationmixing probabilities p = (p

1

, . . . , pk

)?mixing components ⌘ = (⌘

1

, . . . , ⌘k

)?

Long Nguyen (UM) June 2016 3 / 50

Long Nguyen
Text
Page 6: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Hierarchical models

HM = Mixture of mixture models

Challenge: parameter estimation in latent variable models

[courtesy M. Jordan’s slides]

Long Nguyen (UM) June 2016 4 / 50

Page 7: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Estimation in parametric models

Standard methods, e.g., maximum likelihood estimation (via EM)or Bayesian estimation, yield root-n parameter estimation rate

if Fisher information matrix is non-singular

What if we are in a singular situation?

Long Nguyen (UM) June 2016 5 / 50

Page 8: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Estimation in parametric models

Standard methods, e.g., maximum likelihood estimation (via EM)or Bayesian estimation, yield root-n parameter estimation rate

if Fisher information matrix is non-singular

What if we are in a singular situation?

Long Nguyen (UM) June 2016 5 / 50

Page 9: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Fisher singularity

Cox & Hinkley (1974), Lee & Chesher (1986), Rotnitzky, Cox, Bottai andRobbins (2000), etc: first-order singularity

Azzalini & Capitanio (1999); Pewsey (2000); DiCiccio & Monti (2004);Hallin & Ley (2012,2014), etc: third order singularities in skewnormaldistributions

Chen (1995); Rousseau & Mengersen (2011); Nguyen (2013): asymptoticsfor parameter estimation in overfitted mixture models, under strongidentifiability conditions

A full picture of singularity structures in mixture models remain largelyunknown (e.g., hitherto there’s no asymptotic theory for finite mixtures oflocation-scale Gaussian mixtures)

Long Nguyen (UM) June 2016 6 / 50

Page 10: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularity is a common occurence in modern statistics

high-dimensional and sparse setting

“complicated” density classes (e.g., gamma, normal, skewnormal),when there are more than one parameter varying

overfitted/infinite mixture models

hierarchical models

Long Nguyen (UM) June 2016 7 / 50

Page 11: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularities in e-mixtures

Long Nguyen (UM) June 2016 8 / 50

Page 12: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

e-mixtures = exact-fitted mixtures

Mixture model indexed by mixing distribution G =P

k

i=1

pi

�⌘i

⇢p

G

(x) =kX

i=1

pi

f (x |⌘i

)

����G has k atoms

�,

Fisher information matrix

I (G ) = E⇢✓

@ log pG

(X )

@G

◆✓@ log p

G

(X )

@G

◆T

where @ log pG

/@G simply denotes partial derivative of score function wrt allparameters p,⌘

Long Nguyen (UM) June 2016 9 / 50

Page 13: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Exploiting the representation pG

=P

pi

f (x |⌘i

), easy to note that I (G ) isnon-singular iff the collection of

⇢f (x |⌘

i

),@f

@⌘(x |⌘

i

)

����i = 1, . . . , k

are linearly independent functions of x

This condition holds if

f (x |⌘) = Gaussian(x |µ,�2) location-scale Gaussian

not for Gamma or skewnormal kernels (and many others)

Long Nguyen (UM) June 2016 10 / 50

Page 14: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Gamma mixtures

Gamma density

f (x |a, b) = ba

�(a)xa�1e�bx , a > 0, b > 0

admits the following pde

@f

@b(x |a, b) = a

bf (x |a, b)� a

bf (x |a + 1, b).

For Gamma mixture model

pG

(x) =kX

i=1

pi

f (x |ai

, bi

)

I (G ) is a singular matrix if ai

� aj

= 1 and bi

= bj

for some pair of componentsi , j = 1, . . . , k .

Long Nguyen (UM) June 2016 11 / 50

Page 15: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Theorem (Ho & Nguyen, 2016)The minimax rate of estimation for Gamma mixture’s parameter is slower than

1

n

1/2r

for any r � 1, sample size n

This is because we can’t tell very well when G is singular or not

However if we know that the true G is non-singular, and that |ai

� aj

|� 1 and|b

i

� bj

| is bounded away from 0, then MLE achieves root-n rate

Long Nguyen (UM) June 2016 12 / 50

Page 16: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Parameter estimation rate for Gamma mixtures

0 0.5 1 1.5 2 2.5

x 104

0

0.1

0.2

0.3

0.4

n=sample size

W1

1.1(log(n)/n)1/2

0 0.5 1 1.5 2 2.5

x 104

0

0.2

0.4

0.6

0.8

1

1.2

n=sample size

W1

1.9/(log(n))1/2

L: MLE restricted to compact set of non-singular G : W1

(bGn

,G0

) ⇣ n�1/2.R: MLE for general set of G : W

1

⇡ 1/(log n)1/2.

Long Nguyen (UM) June 2016 13 / 50

Page 17: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Skewnormal mixtures

Skewnormal kernel density

f (x |✓,�,m) :=2

�f

✓x � ✓

◆�(m(x � ✓)/�),

where f (x) is the standard normal density and �(x) =

Zf (t)1(t x) dt.

This generalizes Gaussian densities, which correspond to m = 0.

I (G ) is singular iif the parameters are real solution of a number of polynomialequations

(i) Type A: P1

(⌘) =Q

k

j=1

mj

.

(ii) Type B: P2

(⌘) =Q

1i 6=jk

⇢(✓

i

� ✓j

)2+

�2

i

(1 + m2

j

)� �2

j

(1 + m2

i

)

�2

�.

Long Nguyen (UM) June 2016 14 / 50

Page 18: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Figure : Illustration of type A and type B singularity.

Long Nguyen (UM) June 2016 15 / 50

Page 19: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularities of skewnormal mixtures lie in affine varieties

If we know that the parameters are bounded away from these algebraic sets,then method such as MLE continues to produce root-n rate

Otherwise it is worse, especially if the true model is near singularity

I in practice, want to test if the true parameters are a singular pointI in theory, we want to know the actual rate of estimation for singular

points, necessitating the need to look into deep structure of singularities

We’ll show that the singularity structure is very rich and go beyond thesingularities of Fisher information

I introduce singularity levels, which provide multi-level partitions ofparameter space

There is a consequence: the “more” singular the parameter values, the worsethe MLE and minimax rates of estimation, n�1/2 to n�1/4 to n�1/8, adinfinitum

Long Nguyen (UM) June 2016 16 / 50

Page 20: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularities of skewnormal mixtures lie in affine varieties

If we know that the parameters are bounded away from these algebraic sets,then method such as MLE continues to produce root-n rate

Otherwise it is worse, especially if the true model is near singularity

I in practice, want to test if the true parameters are a singular pointI in theory, we want to know the actual rate of estimation for singular

points, necessitating the need to look into deep structure of singularities

We’ll show that the singularity structure is very rich and go beyond thesingularities of Fisher information

I introduce singularity levels, which provide multi-level partitions ofparameter space

There is a consequence: the “more” singular the parameter values, the worsethe MLE and minimax rates of estimation, n�1/2 to n�1/4 to n�1/8, adinfinitum

Long Nguyen (UM) June 2016 16 / 50

Page 21: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

General theory

behavior of likelihood function in the neighborhood of model parameters

Long Nguyen (UM) June 2016 17 / 50

Page 22: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

General theory

behavior of likelihood function in the neighborhood of model parameters

Long Nguyen (UM) June 2016 17 / 50

Page 23: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

From parameter space to space of mixing measure G

The map (p, ⌘) 7! G (p, ⌘) =P

i

pi

�⌘i

is many-to-one, e.g.

1

2�0

+1

2�1

=1

2�0

+1

3�1

+1

6�1

Say (p, ⌘) and (p0, ⌘0) are equivalent if the corresponding mixing measures areequal, e.g.,

[p,✓] = [(1/2, 1/2), (0, 1)] ⌘ [(1/2, 1/3, 1/6), (0, 1, 1)]

Long Nguyen (UM) June 2016 18 / 50

Page 24: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

G =X

pi

�⌘i

7! pG

(·) =X

pi

f (·|⌘i

)

Long Nguyen (UM) June 2016 19 / 50

Page 25: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Wasserstein space of measures/ optimal transport distance

Long Nguyen (UM) June 2016 20 / 50

Page 26: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Optimal transportation problem (Monge-Kantorovich)

how to transport good products from a collection of producers to a collection ofconsumers located in a common spacehow to move the mass from one distribution to another?

squares: locations of producers; circles: locations of consumers

Optimal transport/Wasserstein distance is the optimal cost of transportation ofmassfrom — “production distribution” — to — “consumption distribution”

Long Nguyen (UM) June 2016 21 / 50

Page 27: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Wasserstein distance

Let G ,G 0 be two prob. measures on Rd , r � 1.A coupling of G ,G 0 is a joint dist on Rd ⇥ Rd which induces marginals G ,G 0. also called a “transportation plan”.

The r -th order Wasserstein distance, denoted by Wr

, is given by

Wr

(G ,G 0) :=

inf

Zk✓ � ✓0krd(✓, ✓0)

�1/r

.

Long Nguyen (UM) June 2016 22 / 50

Page 28: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

G =X

pi

�⌘i

7! pG

(·) =X

pi

f (·|⌘i

)

Long Nguyen (UM) June 2016 23 / 50

Page 29: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Behavior of likelihood in a Wasserstein neighborhood

As G ! G0

in Wasserstein metric, apply Taylor expansion up to the r -th order:

pG

(x)� pG

0

(x) =k

0X

i=1

s

iX

j=1

pij

rX

||=1

(�⌘ij

)

!

@||f

@⌘(x |⌘0

i

) +k

0X

i=1

�pi.f (x |⌘0

i

) + Rr

(x),

where Rr

(x) is the Taylor remainder and Rr

(x)/W r

r

(G ,G0

) ! 0.

We have seen examples where the partial derivatives up to the first order are notlinearly independent (for gamma kernel and skewnormal kernel)

Long Nguyen (UM) June 2016 24 / 50

Page 30: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

For normal kernel density f (x |µ, v),

@2f

@✓2

= 2@f

@v.

For skewnormal kernel density f (x |µ, v ,m),

@2f (x |⌘)@✓2

� 2@f (x |⌘ )

@v+

m3 + m

v

@f (x |⌘)@m

= 0.

Long Nguyen (UM) June 2016 25 / 50

Page 31: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

For normal kernel density f (x |µ, v),

@2f

@✓2

= 2@f

@v.

For skewnormal kernel density f (x |µ, v ,m),

@2f (x |⌘)@✓2

� 2@f (x |⌘ )

@v+

m3 + m

v

@f (x |⌘)@m

= 0.

Long Nguyen (UM) June 2016 25 / 50

Page 32: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

r -canonical formFix r � 1. For some sequence of G

n

2 G, such that Wr

(Gn

,G0

) ! 0,

pG

n

(x)� pG

0

(x)

W r

r

(Gn

,G0

)=

L

rX

l=1

✓⇠l

(G0

,�Gn

)

W r

r

(G ,G0

)

◆H

l

(x |G0

) + o(1),

where

Hl

(x |G0

) are linearly independent functions

coefficients ⇠l

(G0

,�Gn

)/W r

r

(Gn

,G0

) are ratio of two semi-polynomials ofthe parameter perturbation of G

n

around G0

Long Nguyen (UM) June 2016 26 / 50

Page 33: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Hl

may be obtained by reducing partial derivatives to linearly independent onesFor Gamma kernel f ,

@f (x |⌘0

j

)

@m= �

kX

j=1

↵1j

↵4k

f (x |⌘0

j

)+↵

2j

↵4k

@f (x |⌘0

j

)

@✓+

↵3j

↵4k

@f (x |⌘0

j

)

@v�

k�1X

j=1

↵4j

↵4k

@f (x |⌘0

j

)

@m

Long Nguyen (UM) June 2016 27 / 50

Page 34: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

For Gaussian kernel f (x |⌘) = f (x |✓, v) all partial derivatives wrt both ✓ and v canbe eliminated via the following reduction: for any

1

,2

2 N, for anyj = 1, . . . , k

0

,@

1

+2 f (x |⌘0

j

)

@✓1v2

=1

22

@1

+22 f (x |⌘0

j

)

@✓1

+22

.

Thus, this reduction is valid for all parameter values (p,⌘), and r -canonical formsfor all orders.

Long Nguyen (UM) June 2016 28 / 50

Page 35: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

For skewnormal kernel f (x |⌘) = f (x |✓, v ,m) for any j = 1, . . . , k0

, any⌘ = ⌘0

j

= (✓0

j

, v0

j

,m0

j

),

@2f (x |⌘)@✓2

= 2@f (x |⌘)

@v� m3 + m

v

@f (x |⌘)@m

.

For higher order partial derivatives:

@3f

@✓3

= 2@2f

@✓@v� m3 + m

v

@2f

@✓@m,

@3f

@✓2@v= 2

@2f

@v2

+m3 + m

v2

@f

@m� m3 + m

v

@2f

@v@m,

@3f

@✓2@m= 2

@2f

@v@m� 3m2 + 1

v

@f

@m� m3 + m

v

@2f

@m2

.

Long Nguyen (UM) June 2016 29 / 50

Page 36: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

r -singularity

Let G be a space of mixing measure.

DefinitionFor each finite r � 1, say G

0

is r -singular relative to G if G0

admits a r -canonicalform for some sequence of G 2 G, according to which W

r

(G ,G0

) ! 0,

coefficients ⇠(r)l

(G )/W r

r

(G ,G0

) ! 0 for all l = 1, . . . , Lr

.

Lemma(a) The notion of r -singularity is independent of the specific r -form. That is, theexistence of the sequence G in the definition holds for all r -canonical forms onceit holds for at least one of them.(b) If G

0

is r -singular for some r > 1, then G0

is (r � 1)-singular.

Long Nguyen (UM) June 2016 30 / 50

Page 37: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

r -singularity

Let G be a space of mixing measure.

DefinitionFor each finite r � 1, say G

0

is r -singular relative to G if G0

admits a r -canonicalform for some sequence of G 2 G, according to which W

r

(G ,G0

) ! 0,

coefficients ⇠(r)l

(G )/W r

r

(G ,G0

) ! 0 for all l = 1, . . . , Lr

.

Lemma(a) The notion of r -singularity is independent of the specific r -form. That is, theexistence of the sequence G in the definition holds for all r -canonical forms onceit holds for at least one of them.

(b) If G0

is r -singular for some r > 1, then G0

is (r � 1)-singular.

Long Nguyen (UM) June 2016 30 / 50

Page 38: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

r -singularity

Let G be a space of mixing measure.

DefinitionFor each finite r � 1, say G

0

is r -singular relative to G if G0

admits a r -canonicalform for some sequence of G 2 G, according to which W

r

(G ,G0

) ! 0,

coefficients ⇠(r)l

(G )/W r

r

(G ,G0

) ! 0 for all l = 1, . . . , Lr

.

Lemma(a) The notion of r -singularity is independent of the specific r -form. That is, theexistence of the sequence G in the definition holds for all r -canonical forms onceit holds for at least one of them.(b) If G

0

is r -singular for some r > 1, then G0

is (r � 1)-singular.

Long Nguyen (UM) June 2016 30 / 50

Page 39: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Definition

The singularity level of G0

relative to ambient space G, denoted by `(G0

|G), is

0, if G0

is not r -singular for any r � 1;

1, if G0

is r -singular for all r � 1;

otherwise, the largest natural number r � 1 for which G0

is r -singular.

Long Nguyen (UM) June 2016 31 / 50

Page 40: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

TheoremIf `(G

0

|G) = r , then for any G 2 G subject to a compactness condition

V (pG

, pG

0

) & W r+1

r+1

(G ,G0

).

So, singularity level r implies that MLE has rate

n� 1

2(r+1)

Under additional regularity condition, this is also a local minimax lower bound forestimating G

0

Long Nguyen (UM) June 2016 32 / 50

Page 41: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Role of ambient spaces

If G1

⇢ G2

then`(G

0

|G1

) `(G0

|G2

)

For location-scale Gaussian e-mixtures

`(G0

|Ek

0

) = 0.

For any o-mixtures: G0

has k0

support points, but we consider the space Ok

of allmeasures with at most k > k

0

support points. Then,

`(G0

|Ok

) � 1.

In fact, for location-scale Gaussian o-mixtures, we can show

`(G0

|Ok

) � 3.

Long Nguyen (UM) June 2016 33 / 50

Page 42: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Role of ambient spaces

If G1

⇢ G2

then`(G

0

|G1

) `(G0

|G2

)

For location-scale Gaussian e-mixtures

`(G0

|Ek

0

) = 0.

For any o-mixtures: G0

has k0

support points, but we consider the space Ok

of allmeasures with at most k > k

0

support points. Then,

`(G0

|Ok

) � 1.

In fact, for location-scale Gaussian o-mixtures, we can show

`(G0

|Ok

) � 3.

Long Nguyen (UM) June 2016 33 / 50

Page 43: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularities in o-mixtures

G0

has k0

support points,G := O

k

, are space of mixing measures having at most k > k0

supportpoints

Long Nguyen (UM) June 2016 34 / 50

Page 44: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

When do we have `(G0|Ok) = 1?

Definition [Chen, 1995; Nguyen, 2013]

G is non-singular in a o-mixture model with at most k components if

⇢f (x |✓

i

),@f

@✓(x |✓

i

),@2f

@✓2

(x |✓i

), |i = 1, . . . , k

are linearly independent functions of x

Theorem [Nguyen, 2013; Chen, 1995]

Under non-singularity of G0

, and compactness conditions on parameter space,there holds

V (pG

, pG

0

) & W 2

2

(G ,G0

).

This is a corollary of our theory, due to the fact that one can establish

`(G0

|Ok

) = 1.

Long Nguyen (UM) June 2016 35 / 50

Page 45: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Since MLE or Bayes estimators yield root-n density estimation rate, it implies that

W2

(Gn

,G0

) = Op

(n�1/4)

where G0

denotes true mixing distribution, Gn

estimate from an n-iid sample

This result is applicable to location Gaussian o-mixture, scale Gaussian o-mixture,but not applicable to

location-scale Gaussian o-mixture

skewnormal o-mixture

Long Nguyen (UM) June 2016 36 / 50

Page 46: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Since MLE or Bayes estimators yield root-n density estimation rate, it implies that

W2

(Gn

,G0

) = Op

(n�1/4)

where G0

denotes true mixing distribution, Gn

estimate from an n-iid sample

This result is applicable to location Gaussian o-mixture, scale Gaussian o-mixture,but not applicable to

location-scale Gaussian o-mixture

skewnormal o-mixture

Long Nguyen (UM) June 2016 36 / 50

Page 47: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Location-scale Gaussian o-mixture

given n-iid sample from a location-scale Gaussian mixture with mixingmeasure G

0

(which has k0

components)

fit the data with a mixture model with k > k0

components

`(G0

|Ok

) is determined by (k � k0

) — specifically, by the following systemof polynomial equations:

k�k

0

+1X

j=1

X

n

1

+2n

2

=↵

c2

j

an

1

j

bn

2

j

n1

!n2

!= 0 for each ↵ = 1, . . . , r (1)

there are r equations for 3(k � k0

+ 1) unknowns (cj

, aj

, bj

)k�k

0

+1

j=1

let r � 1 minimum value of r � 1 such that the above equations do nothave any non-trivial real-valued solution. A solution is considered non-trivialif all c

j

s are non-zeros, and at least one of the aj

s is non-zero.

Long Nguyen (UM) June 2016 37 / 50

Page 48: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Location-scale Gaussian o-mixture

given n-iid sample from a location-scale Gaussian mixture with mixingmeasure G

0

(which has k0

components)

fit the data with a mixture model with k > k0

components

`(G0

|Ok

) is determined by (k � k0

) — specifically, by the following systemof polynomial equations:

k�k

0

+1X

j=1

X

n

1

+2n

2

=↵

c2

j

an

1

j

bn

2

j

n1

!n2

!= 0 for each ↵ = 1, . . . , r (1)

there are r equations for 3(k � k0

+ 1) unknowns (cj

, aj

, bj

)k�k

0

+1

j=1

let r � 1 minimum value of r � 1 such that the above equations do nothave any non-trivial real-valued solution. A solution is considered non-trivialif all c

j

s are non-zeros, and at least one of the aj

s is non-zero.

Long Nguyen (UM) June 2016 37 / 50

Page 49: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Location-scale Gaussian o-mixture

given n-iid sample from a location-scale Gaussian mixture with mixingmeasure G

0

(which has k0

components)

fit the data with a mixture model with k > k0

components

`(G0

|Ok

) is determined by (k � k0

) — specifically, by the following systemof polynomial equations:

k�k

0

+1X

j=1

X

n

1

+2n

2

=↵

c2

j

an

1

j

bn

2

j

n1

!n2

!= 0 for each ↵ = 1, . . . , r (1)

there are r equations for 3(k � k0

+ 1) unknowns (cj

, aj

, bj

)k�k

0

+1

j=1

let r � 1 minimum value of r � 1 such that the above equations do nothave any non-trivial real-valued solution. A solution is considered non-trivialif all c

j

s are non-zeros, and at least one of the aj

s is non-zero.

Long Nguyen (UM) June 2016 37 / 50

Page 50: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Theorem [Singularity in localion-scale Gaussian o-mixtures]

`(G0

|Ok

) = r � 1.

Corrolary [Location-scale Gaussian finite mixtures]

convergence rate of mixing measure G by either MLE or Bayesian estimation is(log n/n)1/2r , under both W

r

and W1

.

[Ho & Nguyen (2016)]

Long Nguyen (UM) June 2016 38 / 50

Page 51: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

More on system of r polynomial equations (1):

let us consider the case k = k0

+ 1, and let r = 3, then we have

c2

1

a1

+ c2

2

a2

= 0,1

2(c2

1

a2

1

+ c2

2

a2

2

) + c2

1

b1

+ c2

2

b2

= 0,

1

3!(c2

1

a3

1

+ c2

2

a3

2

) + c2

1

a1

b1

+ c2

2

a2

b2

= 0.

a non-trivial solution exists, by choosing c2

= c1

6= 0, b1

= b2

= �1/2, anda

1

= 1, a2

= �1.Hence, r � 4.

For r = 4, the system consists of the three equations above, plus

1

4!(c2

1

a4

1

+ c2

2

a4

2

) +1

2!(c2

1

a2

1

b1

+ c2

2

a2

2

b2

) +1

2!(c2

1

b2

1

+ c2

2

b2

2

) = 0.

This system has no non-trivial solution. So, for k = k0

+ 1, we have r = 4.

Long Nguyen (UM) June 2016 39 / 50

Page 52: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

More on system of r polynomial equations (1):

let us consider the case k = k0

+ 1, and let r = 3, then we have

c2

1

a1

+ c2

2

a2

= 0,1

2(c2

1

a2

1

+ c2

2

a2

2

) + c2

1

b1

+ c2

2

b2

= 0,

1

3!(c2

1

a3

1

+ c2

2

a3

2

) + c2

1

a1

b1

+ c2

2

a2

b2

= 0.

a non-trivial solution exists, by choosing c2

= c1

6= 0, b1

= b2

= �1/2, anda

1

= 1, a2

= �1.Hence, r � 4.

For r = 4, the system consists of the three equations above, plus

1

4!(c2

1

a4

1

+ c2

2

a4

2

) +1

2!(c2

1

a2

1

b1

+ c2

2

a2

2

b2

) +1

2!(c2

1

b2

1

+ c2

2

b2

2

) = 0.

This system has no non-trivial solution. So, for k = k0

+ 1, we have r = 4.

Long Nguyen (UM) June 2016 39 / 50

Page 53: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Using the Groebner bases method, we can figure out the zeros of a system of realpolynomial equations. So,

(1) [Overfitting by one] if k � k0

= 1, then `(G0

|Ok

) = 3. So, MLE/Bayesestimation rate is n�1/8.

(2) [Overfitting by two] if k � k0

= 2, then `(G0

|Ok

) = 5, so the rate is n�1/12.

(3) [Overfitting by three] if k � k0

= 3, then `(G0

|Ok

) � 6. So, the rate is notbetter than n�1/14.

Lessons for Gaussian location-scale mixtures

do not overfit

if you must, be conservative in allowing extra mixing components

Long Nguyen (UM) June 2016 40 / 50

Page 54: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Using the Groebner bases method, we can figure out the zeros of a system of realpolynomial equations. So,

(1) [Overfitting by one] if k � k0

= 1, then `(G0

|Ok

) = 3. So, MLE/Bayesestimation rate is n�1/8.

(2) [Overfitting by two] if k � k0

= 2, then `(G0

|Ok

) = 5, so the rate is n�1/12.

(3) [Overfitting by three] if k � k0

= 3, then `(G0

|Ok

) � 6. So, the rate is notbetter than n�1/14.

Lessons for Gaussian location-scale mixtures

do not overfit

if you must, be conservative in allowing extra mixing components

Long Nguyen (UM) June 2016 40 / 50

Page 55: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Using the Groebner bases method, we can figure out the zeros of a system of realpolynomial equations. So,

(1) [Overfitting by one] if k � k0

= 1, then `(G0

|Ok

) = 3. So, MLE/Bayesestimation rate is n�1/8.

(2) [Overfitting by two] if k � k0

= 2, then `(G0

|Ok

) = 5, so the rate is n�1/12.

(3) [Overfitting by three] if k � k0

= 3, then `(G0

|Ok

) � 6. So, the rate is notbetter than n�1/14.

Lessons for Gaussian location-scale mixtures

do not overfit

if you must, be conservative in allowing extra mixing components

Long Nguyen (UM) June 2016 40 / 50

Page 56: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

0 0.5 1 1.5

x 10−4

0

1

2

3

4

5x 10

−5

W1(G,G

0)

h(p

G,p

G0

)

h ⇣ pW1

h ⇣ W1

0 2 4 6 8

x 10−3

0

0.2

0.4

0.6

0.8

1x 10

−4

W4(G,G

0)

h(p

G,p

G0

)

h ⇣ W 44

h ⇣ W4

0 1 2 3 4

x 10−3

0

0.5

1

1.5

2

h(pG

,pG

0

)

h/W 22

h/W 44

0 2 4 6

x 10−4

0

1

2

3

4

5x 10

−4

W6(G,G

0)

h(p

G,p

G0

)

h ⇣ W 66

h ⇣ W6

Figure : Location-scale Gaussian mixtures. From left to right: (1) Exact-fitted setting;(2) Over-fitted by one component; (3) Over-fitted by one component; (4) Over-fitted bytwo components.

Long Nguyen (UM) June 2016 41 / 50

Page 57: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

0 1 2 3

x 104

0

0.5

1

1.5

2

2.5

3

n=sample size

W1

2.5(log(n)/n)1/2

0 1 2 3

x 104

0

0.5

1

1.5

2

2.5

3

n=sample size

W4

2.05(log(n)/n)1/8

0 1 2 3

x 104

0

0.5

1

1.5

2

2.5

3

3.5

n=sample size

W6

3.6(log(n)/n)1/12

0 1 2 3

x 104

0

1

2

3

4

n=sample size

W1

2.5(log(n)/n)1/2

W4

2.05(log(n)/n)1/8

W6

3.6(log(n)/n)1/12

Figure : MLE rates for location-covariance mixtures of Gaussians. L to R: (1)Exact-fitted: W

1

⇣ n�1/2. (2) Over-fitted by one: W4

⇣ n�1/8. (3) Over-fitted by two:W

6

⇣ n�1/12.

Long Nguyen (UM) June 2016 42 / 50

Page 58: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

The direct link to algebraic geometry

Recall r -canonical form

pG

n

(x)� pG

0

(x)

W r

r

(Gn

,G0

)=

L

rX

l=1

✓⇠l

(G0

,�Gn

)

W r

r

(G ,G0

)

◆H

l

(x) + o(1),

where

coefficients ⇠l

(G0

,�Gn

)/W r

r

(Gn

,G0

) are the ratio of two semi-polynomialsof the parameter perturbation of G

n

around G0

as Gn

! G0

, the collection of these ratios tend to a system of realpolynomial equations

Long Nguyen (UM) June 2016 43 / 50

Page 59: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularities in e-mixtures of skewnormals

Long Nguyen (UM) June 2016 44 / 50

Page 60: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

P1

(⌘) =Q

k

j=1

mj

.

P2

(⌘) =Q

1i 6=jk

⇢(✓

i

� ✓j

)2+

�2

i

(1 + m2

j

)� �2

j

(1 + m2

i

)

�2

�.

Long Nguyen (UM) June 2016 45 / 50

Page 61: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Partition of subset NC, without Gaussian components

Long Nguyen (UM) June 2016 46 / 50

Page 62: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Partition of subset NC, with some Gaussian components

Long Nguyen (UM) June 2016 47 / 50

Page 63: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Singularities in o-mixtures of skewnormal mixtures

Long Nguyen (UM) June 2016 48 / 50

Page 64: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

Summary

Singularities are common in mixture models, including finite mixtures

Beyond the singular points of Fisher information

They are organized into levels, which subdivide the parameter space intomulti-level partitions, each of which allow different minimax and MLEconvergence rate

Now that we know what the singularities are (mostly), how to go aboutimproving the estimation algorithm both statistically and computationally?

Long Nguyen (UM) June 2016 49 / 50

Page 65: Singularity structures and parameter estimation in mixture ...dept.stat.lsa.umich.edu/~xuanlong/Talks/B_singular_NUS.pdf · Otherwise it is worse, especially if the true model is

For details, see

Convergence of latent mixing measures in finite and infinite mixture models.(Ann. Statist., 2013)

Convergence rates of parameter estimation in some weakly identifiablemodels (with N. Ho, Ann. Statist., 2016)

Singularity structures and parameter estimation in finite mixture models(manuscript to be submitted)

Long Nguyen (UM) June 2016 50 / 50