Consistency and Asymptotic Normality of Latent Blocks ... · Consistency and Asymptotic Normality...

HAL Id: hal-01511960https://hal.archives-ouvertes.fr/hal-01511960

Submitted on 21 Apr 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Consistency and Asymptotic Normality of Latent BlocksModel Estimators

Vincent Brault, Christine Keribin, Mahendra Mariadassou

To cite this version:Vincent Brault, Christine Keribin, Mahendra Mariadassou. Consistency and Asymptotic Normalityof Latent Blocks Model Estimators. 2017. hal-01511960

https://hal.archives-ouvertes.fr/hal-01511960

https://hal.archives-ouvertes.fr

Submitted to BernoulliarXiv: arXiv:0000.0000

Consistency and Asymptotic Normality of

Latent Blocks Model EstimatorsVINCENT BRAULT1,* CHRISTINE KERIBIN2,** and MAHENDRAMARIADASSOU3,†

1Univ. Grenoble Alpes, LJK, F-38000 Grenoble, FranceCNRS, LJK, F-38000 Grenoble, France E-mail: *[email protected]

2Laboratoire de Mathmatiques d’Orsay, CNRS, and INRIA Saclay Ile de France, UniversiteParis-Sud, Universite Paris-Saclay, F-91405 Orsay, France.E-mail: **[email protected]

3MaIAGE, INRA, Universite Paris-Saclay, 78352 Jouy-en-Josas, FranceE-mail: †[email protected]

Latent Block Model (LBM) is a model-based method to cluster simultaneously the d columnsand n rows of a data matrix. Parameter estimation in LBM is a difficult and multifaceted prob-lem. Although various estimation strategies have been proposed and are now well understoodempirically, theoretical guarantees about their asymptotic behavior is rather sparse. We showhere that under some mild conditions on the parameter space, and in an asymptotic regimewhere log(d)/n and log(n)/d tend to 0 when n and d tend to +∞, (1) the maximum-likelihoodestimate of the complete model (with known labels) is consistent and (2) the log-likelihood ratiosare equivalent under the complete and observed (with unknown labels) models. This equivalenceallows us to transfer the asymptotic consistency to the maximum likelihood estimate under theobserved model. Moreover, the variational estimator is also consistent.

Keywords: Latent Block Model, asymptotic normality, Maximum Likelihood Estimate, Concen-tration Inequality.

1. Introduction

Coclustering is an unsupervised way to cluster simultaneously the rows and columnsof a data matrix, and can be used in numerous applications such as recommendationsystems, genomics or text mining. Among the coclustering methods, the Latent BlockModel (LBM) is based on the definition of a probabilistic model.

We observe a data matrix X = (xij) with n rows and d columns and we suppose thatthere exists a row-partition with g row-classes and a column-partition with m column-classes. The row (resp. column) class for each row (resp. column) is unknown and hasto be determined. Once determined, rows and columns can be re-ordered according tothis coclustering, to let appear blocks that are homogeneous and distinct. This leads toa parsimonious data representation.

1imsart-bj ver. 2014/10/16 file: bj-BraultKeribinMariadassou.tex date: April 21, 2017

http://isi.cbs.nl/bernoulli/

http://arxiv.org/abs/arXiv:0000.0000

mailto:[email protected]



2 V. BRAULT et al.

LBM can deal with binary ([6]), Gaussian ([10]), categorical ([9]) or count ([7]) data.Due to the complex dependence structure, neither the likelihood, nor the computationof the distribution of the assignments conditionally to the observations (E-step of theEM algorithm), and therefore the maximum likelihood estimator (MLE) are numericallytractable. Estimation can be however performed either with a variational approximation(leading to an approximate value of the MLE), or with a Bayesian approach (VBayesalgorithm or Gibbs sampler). Notice that [9] recommend to perform a Gibbs samplercombined with a VBayes algorithm.

Although these estimation methods give satisfactory results, the consistence and asymp-totic normality of the MLE are still an open question. Some partial results exist for LBM,and this question has been solved for SBM (Stochastic Block Model), a special case ofLBM where the data is a random graphe encoded by its adjacency matrix (rows andcolumns represents the same units, so that there is only one partition, the same for rowsand columns). [4] proved in their Theorem 3 that under the true parameter value, the dis-tribution of the assignments conditionally to the observations of a binary SBM convergesto a Dirac of the real assignments. Moreover, this convergence remains valid under theestimated parameter value, assuming that this estimator converges at rate at least n−1,where n is the number of nodes (Proposition 3.8). This assumption is not trivial, and it isnot established that such an estimator exists except in some particular cases ([1] for ex-ample). [11] presented a unified frame for LBM and SBM in case of observations comingfrom an exponential family, and showed the consistency of the assignment conditionaldistribution under all parameter value in a neighborhood of the true value. [3] and [2]proved the consistency and asymptotic normality of the MLE for the binary SBM. Burst-ing with the preceding approaches, they first studied the asymptotic behavior of the MLEin the complete model (observations and assignments) which is very simple to handle;then, they showed that the complete likelihood and the marginal likelihood have similarasymptotic behavior by the use of a Bernstein inequality for bounded observations.

We extend these results to the double asymptotic framework of LBM, following the wayof [2], and for observations coming from some exponential family. Moreover, we introducethe concept of model symmetry which was not pointed out by these authors, but isnecessary to set the asymptotic behavior. The asymptotic normality of the variationalestimator is also settled, and an application to model selection criteria is presented.

The paper is organized as follows. The model, main assumptions and notations areintroduced in Section 2, where model symmetry is also discussed. Section 3 establishesthe asymptotic normality of the complete likelihood estimator, and section 4 settles threedifferent types of assignment behaviors. Our main result showing that the observed like-lihood behaves like the complete likelihood takes place in section 5, and the consistencyof MLE and variational estimator is deduced. Technical proofs are gathered in the ap-pendices.

imsart-bj ver. 2014/10/16 file: bj-BraultKeribinMariadassou.tex date: April 21, 2017

Consistency and asymptotic normality of LBM estimators 3

2. Model and assumptions

The LBM assumes a block clustering structure of a data matrix X = (xij) with n rowsand d columns, as the Cartesian product of a row partition z by a column partition w.More precisely,

• row assignments (or labels) zi, i = 1, . . . , n, are independent from column assign-ments (or labels) wj , j = 1, . . . , d : p(z,w) = p(z)p(w);

• row labels are independent, with a common multinomial distribution: zi ∼M(1,π =(π1, . . . , πg)); in the same way, column labels are i.i.d. multinomial variables: wj ∼M(1,ρ = (ρ1, . . . , ρm)).

• conditionally to row and column assignments (z1, . . . , zn) × (w1, . . . ,wd), the ob-served dataXij are independent, and their (conditional) distribution ϕ(., α) belongsto the same parametric family, which parameter α only depends on the given block:

Xij |zikwj` = 1 ∼ ϕ(., αk`)

where zik is the indicator variable of whether row i belongs to row-group k and wj`is the indicator variable of whether column j belongs to column-group `.

Hence, the complete parameter set is θ = (π,ρ,α) ∈ Θ, with α = (α11, . . . , αgm) andΘ the parameter space. Figure 1 summarizes these notations.

1...

i...

...n

n

...

...

...

...

...

...

1 · · · j · · · · · · d

d

· · · · · · · · · · · · · · · · · ·x11

xi1

x1j

xij

xn1 xnj

x1d

xid

xnd

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

· · ·

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

1 ` m

m

1

k

g

g

(k, `)

α11 α1` α1m

αk1 αk` αkm

αg1 αg` αgm

ρ1 ρ` ρm

π1

πk

πg

Figure 1. Notations. Left: Notations for the elements of observed data matrix are in black, notationsfor the block clusters are in blue. Right: Notations for the model parameter.


4 V. BRAULT et al.

When performing inference from data, we note θ? = (π?,ρ?,α?) the true parameterset, i.e. the parameter values used to generate the data, and z? and w? the true (andusually unobserved) assignment of rows and columns to their group. For given matricesof indicator variables z and w, we also note:

• z+k =∑i zik and w+` =

∑j wj`

• z?+k and w?+` their counterpart for z? and w?.

The confusion matrix allows to compare the partitions.

Definition 2.1 (confusion matrices). For given assignments z and z? (resp. w andw?), we define the confusion matrix between z and z? (resp. w and w?), noted IRg(z)(resp. IRm(w)), as follows:

IRg(z)kk′ =1

n

∑i

z?ikzik′ and IRm(w)``′ =1

d

∑j

w?j`wj`′ (2.1)

2.1. Likelihood

When the labels are known, the complete log-likelihood is given by:

Lc(z,w;θ) = log p(x, z,w;θ)

= log

∏i,k

πzikk

∏j,`

ρwj``

∏i,j,k,`

ϕ (xij ;αk`)zikwj`

= log

(∏

i

πzi

)∏j

ρwj

∏i,j

ϕ(xij ;αziwj

) .

(2.2)

But the labels are usually unobserved, and the observed log-likelihood is obtained bymarginalization over all the label configurations:

L(θ) = log p(x;θ) = log

∑z∈Z,w∈W

p(x, z,w;θ)

. (2.3)

As the LBM involves a double missing data structure z for rows and w for columns, theobserved likelihood is not tractable, nor the E-step of the EM algorithm, but estimationcan be performed either by numerical approximation, or by MCMC methods [9], [8].

2.2. Assumptions

We focus here on parametric models where ϕ belongs to a regular one-dimension expo-nential family in canonical form:

ϕ(x, α) = b(x) exp(αx− ψ(α)), (2.4)



where α belongs to the space A, so that ϕ(·, α) is well defined for all α ∈ A. Clas-sical properties of exponential families insure that ψ is convex, infinitely differentiableon A, that (ψ′)−1 is well defined on ψ′(A). When Xα ∼ ϕ(., α), E[Xα] = ψ′(α) andV[Xα] = ψ′′(α).

Moreover, we make the following assumptions on the parameter space :

H1 : There exist a positive constant c, and a compact Cα such that

Θ ⊂ [c, 1− c]g × [c, 1− c]m × Cg×mα with Cα ⊂ A.

H2 : The true parameter θ? = (π?,ρ?,α?) lies in the relative interior of Θ.H3 : The map α 7→ ϕ(·, α) is injective.H4 : Each row and each column of α? is unique.

The previous assumptions are standard. Assumption H1 ensure that the group pro-portions are bounded away from 0 and 1 so that no group disappears when n and d goto infinity. It also ensures that α is bounded away from the boundaries of the A and thatthere exists a κ > 0, such that [αk` − κ, αk` + κ] ⊂ A for all parameters αk` of θ ∈ Θ.Assumptions H3 and H4 are necessary to ensure that the model is identifiable. If themap α 7→ ϕ(., α) is not injective, the model is trivially not identifiable. Similarly, if rowsk and k′ are identical, we can build a more parsimonious model that induces the samedistribution of x by merging groups k and k′. In the following, we consider that g andm, row- and column- classes (or groups) counts are known.

Moreover, we define the δ(α), that captures the differences between either row-groupsor column-groups: lower values means that there are two row-classes or two column-classes that are very similar.

Definition 2.2 (class distinctness). For θ = (π,ρ,α) ∈ Θ. We define:

δ(α) = min

min`,`′

maxk

KL(αk`, αk`′),mink,k′

max`

KL(αk`, αk′`)

with KL(α, α′) = Eα[log(ϕ(X,α)/ϕ(X,α′))] = ψ′(α)(α−α′)+ψ(α′)−ψ(α) the Kullbackdivergence between ϕ(., α) and ϕ(., α′), when ϕ comes from an exponential family.

Remark 2.3. Since all α have distinct rows and columns, δ(α) > 0.

Remark 2.4. Since we restricted α in a bounded subset of A, there exists two positivevalues Mα and κ such that Cα + (−κ, κ) ⊂ [−Mα,Mα] ⊂ A. Moreover, the variance ofXα is bounded away from 0 and +∞. We note

supα∈[−Mα,Mα]

V(Xα) = σ2 < +∞ and infα∈[−Mα,Mα]

V(Xα) = σ2 > 0. (2.5)

Proposition 2.5. With the previous notations, if α ∈ Cα and Xα ∼ ϕ(., α), then Xα

is subexponential with parameters (σ2, κ−1).


6 V. BRAULT et al.

Remark 2.6. These assumptions are satisfied for many distributions, including butnot limited to:

• Bernoulli, when the proportion p is bounded away from 0 and 1, or natural param-eter α = log(p/(1− p)) bounded away from ±∞;

• Poisson, when the mean λ is bounded away from 0 and +∞, or natural parameterα = log(λ) bounded away from ±∞;

• Gaussian with known variance when the mean µ, which is also the natural param-eter, is bounded away from ±∞.

In particular, the conditions stating that ψ is twice differentiable and that (ψ′)−1 existsare equivalent to assuming that Xα has positive and finite variance for all values of α inthe parameter space.

2.3. Symmetry

The LBM is a generalized mixture model, and it is well known that it subject to labelswitching. [9] showed that the categorical LBM is generically identifiable, and this prop-erty is easily extended to the case of observations of a one-dimension exponential family.Hence, except on a manifold set of null Lebesgue measure in Θ, the parameter set isidentifiable up to a label permutation.

The study of the asymptotic properties of the MLE will lead to take into accountsymmetry properties on the parameter set. We first recall the definition of a permuta-tion, then define equivalence relationships for assignments and parameter, and precisesymmetry.

Definition 2.7 (permutation). Let s be a permutation on 1, . . . , g and t a permuta-tion on 1, . . . ,m. If A is a matrix with g columns, we define As as the matrix obtainedby permuting the columns of A according to s, i.e. for any row i and column k of A,Asik = Ais(k). If B is a matrix with m columns and C is a matrix with g rows and m

columns, Bt and Cs,t are defined similarly:

As =(Ais(k)

)i,k

Bt =(Bjt(`)

)j,`

Cs,t =(Cs(k)t(`)

)k,`

Definition 2.8 (equivalence). We define the following equivalence relationships:

• Two assignments (z,w) and (z′,w′) are equivalent, noted ∼, if they are equal upto label permutation, i.e. there exist two permutations s and t such that z′ = zs andw′ = wt.

• Two parameters are θ and θ′ are equivalent, noted ∼, if they are equal up to labelpermutation, i.e. there exist two permutations s and t such that (πs,ρt,αs,t) =(π′,ρ′,α′). This is label-switching.

• (θ, z,w) and (θ′, z′,w′) are equivalent, noted ∼, if they are equal up to label per-mutation on α, i.e. there exist two permutations, s and t such that (αs,t, zs,wt) =(α′, z′,w′).



Definition 2.9 (distance). We define the following distance, up to equivalence, betweenconfigurations z and z?:

‖z− z?‖0,∼ = infz′∼z‖z′ − z?‖0

and similarly for the distance between w and w? where, for all matrix z, we use theHamming norm ‖·‖0 defined by

‖z‖0 =∑i,k

1zik 6= 0.

The last equivalence relationship is not concerned with π and ρ. It is useful whendealing with the conditional likelihood p(x|z,w;θ) which does not depend on π and ρ:in fact, if (θ, z,w) ∼ (θ′, z′,w′), then for all x, we have p(x|z,w;θ) = p(x|z′,w′;θ′).Note also that z ∼ z? (resp. w ∼ w?) if and only if the confusion matrix IRg(z) (resp.IRm(w)) is equivalent to a diagonal matrix.

Definition 2.10 (symmetry). We say that the parameter θ exhibits symmetry for thepermutations s, t if

(πs,ρt,αs,t) = (π,ρ,α).

θ exhibits symmetry if it exhibits symmetry for any non trivial pair of permutations(s, t). Finally the set of pairs (s, t) for which θ exhibits symmetry is noted Sym(θ).

Remark 2.11. The set of parameters that exhibit symmetry is a manifold of nullLebesgue measure in Θ. The notion of symmetry allows us to deal with a notion of non-identifiability of the class labels that is subtler than and different from label switching.To emphasize the difference between equivalence and symmetry, consider the following

model: π = (1/2, 1/2), ρ = (1/3, 2/3) and α =

(α1 α2

α2 α1

)with α1 6= α2. The only

permutations of interest here are s = t = [1 2]. Choose any z and w. Because of labelswitching, we know that p(x, zs,wt;θs,t) = p(x, z,w;θ). (zs,wt) and (z,w) have thesame likelihood but under different parameters θ and θs,t. If however, ρ = (1/2, 1/2),then (s, t) ∈ Sym(θ) and θs,t = θ so that (z,w) and (zs,wt) have exactly the samelikelihood under the same parameter θ. In particular, if (z,w) is a maximum-likelihoodassignment under θ, so is (zs,wt). In other words, if θ exhibits symmetry, the maximum-likelihood assignment is not unique under the true model and there are at least # Sym(θ)of them.

3. Asymptotic properties in the complete data model

As stated in the introduction, we first study the asymptotic properties of the completedata model. Let θc = (π, ρ, α) be the MLE of θ in the complete data model, where thereal assignments z = z? and w = w? are known. We can derive the following generalestimates from Equation (2.2):


8 V. BRAULT et al.

πk = πk(z) =z+k

nρ` = ρ`(w) =

w+`

d

xk`(z,w) =

∑ij xijzikwj`

z+kw+`αk` = αk`(z,w) = (ψ′)−1 (xk`(z,w))

(3.1)

Proposition 3.1. The matrices Σπ? = Diag(π?) − π? (π?) T , Σρ? = Diag(ρ?) −ρ? (ρ?) T are semi-definite positive, of rank g− 1 and m− 1, and π and ρ are asymptot-ically normal:

√n (π (z?)− π?) D−−−−→

n→∞N (0,Σπ?) and

√d (ρ (w?)− ρ?) D−−−→

d→∞N (0,Σρ?) (3.2)

Similarly, let V (α?) be the matrix defined by [V (α?)]k` = 1/ψ′′(α?k`) andΣα? = Diag−1(π?)V (α?) Diag−1(ρ?). Then:

√nd (αk` (z?,w?)− α?k`)

D−−−−−→n,d→∞

N (0,Σα?,k`) for all k, ` (3.3)

where the components are independent.

Proof: Since π (z?) = (π1 (z?) , . . . , πg (z?)) (resp. ρ (w?)) is the sample mean of n(resp. d) i.i.d. multinomial random variables with parameters 1 and π? (resp. ρ?), asimple application of the central limit theorem (CLT) gives:

Σπ?,kk′ =

π?k(1− π?k) if k = k′

−π?kπ?k′ if k 6= k′and Σρ?,``′ =

ρ?` (1− ρ?` ) if ` = `′

−ρ?`ρ?`′ if ` 6= `′

which proves Equation (3.2) where Σπ? and Σρ? are semi-definite positive of rank g − 1and m− 1.

Similarly, ψ′ (αk` (z?,w?)) is the average of z?+kw?+` = ndπk (z?) ρ` (w?) i.i.d. random

variables with mean ψ′ (α?k`) and variance ψ′′ (α?k`). ndπk (z?) ρ` (w?) is itself randombut πk (z?) ρ` (w?) −−−−−−→

n,d→+∞π?kρ

?` almost surely. Therefore, by Slutsky’s lemma and the

CLT for random sums of random variables [13], we have:

√ndπ?kρ

?` (ψ′ (αk` (z?,w?))− ψ′(α?k`)) =

√ndπ?kρ

?`

( ∑ij Xijz

?ikw

?j`

ndπk (z?) ρ` (w?)− ψ′(α?k`)

)D−−−−−−→

n,d→+∞N (0, ψ′′(α?k`))

The differentiability of (ψ′)−1 and the delta method then gives:

√nd (αk` (z?,w?)− α?k`)

D−−−−−−→n,d→+∞

N(

0,1

π?kρ?`ψ′′(α?k`)

)and the independence results from the independence of αk` (z?,w?) and αk′`′ (z

?,w?) assoon as k 6= k′ or ` 6= `′, as they involve different sets of i.i.d. variables.



Proposition 3.2 (Local asymptotic normality). Let L?c the function defined on Θ byL?c (π,ρ,α) = log p (x, z?,w?;θ). For any s, t and u in a compact set, we have:

L?c(π? +

s√n,ρ? +

t√d,α? +

u√nd

)= L?c (θ?) + sTYπ? + tTYρ? + Tr(uTYα?)

−(

1

2sTΣπ?s+

1

2tTΣρ?t+

1

2Tr((u u)TΣα?

))+ oP (1)

where denote the Hadamard product of two matrices (element-wise product) and Σπ? ,Σρ? and Σα? are defined in Proposition 3.1. Yπ? , Yρ? are asymptotically Gaussian withzero mean and respective variance matrices Σπ? , Σρ? and Yα? is a matrix of asymptot-ically independent Gaussian components with zero mean and variance matrix Σα? .

Proof.By Taylor expansion,

L?c(π? +

s√n,ρ? +

t√d,α? +

u√nd

)= L?c (θ?) +

1√nsT∇L?cπ (θ?) +

1√dtT∇L?cρ (θ?) +

1√nd

Tr(uT∇L?cα (θ?)

)+

1

nsTHπ (θ?) s+

1

dtTHρ (θ?) t+

1

ndTr((u u)THα (θ?)

)+ oP (1)

where ∇L?cπ (θ?), ∇L?cρ (θ?) and ∇L?cα (θ?) denote the respective components of the

gradient of L?c evaluated at θ? and Hπ, Hρ and Hα denote the conditional hessianof L?c evaluated at θ?. By inspection, Hπ/n, Hρ/d and Hα/nd converge in probabil-

ity to constant matrices and the random vectors ∇L?cπ (θ?) /√n, ∇L?cρ (θ?) /

√d and

∇L?cα (θ?) /√nd converge in distribution by central limit theorem.

4. Profile Likelihood

To study the likelihood behaviors, we shall work conditionally to the real configurations(z?,w?) that have enough observations in each row or column group. We therefore defineregular configurations which occur with high probability, then introduce conditional andprofile log-likelihood ratio.


10 V. BRAULT et al.

4.1. Regular assignments

Definition 4.1 (c-regular assignments). Let z ∈ Z and w ∈ W. For any c > 0, we saythat z and w are c-regular if

minkz+k ≥ cn and min

`w+` ≥ cd. (4.1)

In regular configurations, each row-group (resp. column-group) has Ω(n) members,where un = Ω(n) if there exists two constant a, b > 0 such that for n enough largean ≤ un ≤ bn. c/2-regular assignments, with c defined in Assumption H1, have highPθ? -probability in the space of all assignments, uniformly over all θ? ∈ Θ.

Each z+k is a sum of n i.i.d Bernoulli r.v. with parameter πk ≥ πmin ≥ c. A simpleHoeffding bound shows that

Pθ?(z+k ≤ n

c

2

)≤ Pθ?

(z+k ≤ n

πk2

)≤ exp

(−2n

(πk2

)2)≤ exp

(−nc

2

2

)taking a union bound over g values of k and using a similar approach for w+` lead toProposition 4.2.

Proposition 4.2. Define Z1 and W1 as the subsets of Z and W made of c/2-regularassignments, with c defined in assumption H1. Note Ω1 the event (z?,w?) ∈ Z1×W1,then:

Pθ?(Ω1

)≤ g exp

(−nc

2

2

)+m exp

(−dc

2

2

).

We define now balls of configurations taking into account equivalent assignmentsclasses.

Definition 4.3 (Set of local assignments). We note S(z?,w?, r) the set of configura-tions that have a representative (for ∼) within relative radius r of (z?,w?):

S(z?,w?, r) = (z,w) : ‖z− z?‖0,∼ ≤ rn and ‖w −w?‖0,∼ ≤ rd

4.2. Conditional and profile log-likelihoods

We first introduce few notations.

Definition 4.4. We define the conditional log-likelihood ratio Fn,d and its expectationG as:

Fnd(θ, z,w) = logp(x|z,w;θ)

p(x|z?,w?;θ?)

G(θ, z,w) = Eθ?[

logp(x|z,w;θ)

p(x|z?,w?;θ?)

∣∣∣∣ z?,w?

] (4.2)



We also define the profile log-likelihood ratio Λ and its expectation Λ as:

Λ(z,w) = maxθ

Fnd(θ, z,w)

Λ(z,w) = maxθ

G(θ, z,w).(4.3)

Remark 4.5. As Fnd and G only depend on θ through α, we will sometimes replace θwith α in the expressions of Fnd and G. Replacing Fn,d and G by their profiled version

Λ and Λ allows us to get rid of the continuous argument of Fnd and to effectively usediscrete contrasts Λ and Λ.

The following proposition shows which values of α maximize Fnd and G to attain Λand Λ.

Proposition 4.6 (maximum of G and Λ in θ). Conditionally on z?,w?, define thefollowing quantities:

S? = (S?k`)k` = (ψ′(α?k`))k`

xk`(z,w) = Eθ? [xk`(z,w)|z?,w?] =

[IRg(z)TS?IRm(w)

]k`

πk(z)ρ`(w)

(4.4)

with xk`(z,w) = 0 for z and w such that πk(z) = 0 or ρ`(w) = 0. Then Fnd(θ, z,w) andG(θ, z,w) are maximum in α for α(z,w) and α(z,w) defined by:

α(z,w)k` = (ψ′)−1(xk`(z,w)) and α(z,w)k` = (ψ′)−1(xk`(z,w))

so thatΛ(z,w) = Fnd(α(z,w), z,w)

Λ(z,w) = G(α(z,w), z,w)

Note that although xk` = Eθ? [ xk`| z?,w?], in general αk` 6= Eθ? [ αk`| z?,w?] by nonlinearity of (ψ′)−1. Nevertheless, (ψ′)−1 is Lipschitz over compact subsets of ψ′(A) andtherefore, with high probability, |αk` − αk`| and |xk` − xk`| are of the same order ofmagnitude.

The maximum and argmax of G and Λ are characterized by the following propositions.

Proposition 4.7 (maximum of G and Λ in (θ, z,w)). Let KL(α, α′) = ψ′(α)(α−α′) +ψ(α′)− ψ(α) be the Kullback divergence between ϕ(., α) and ϕ(., α′) then:

G(θ, z,w) = −nd∑k,k′

∑`,`′

IRg(z)k,k′IRm(w)`,`′ KL(α?k`, αk′`′) ≤ 0. (4.5)

Conditionally on the set Ω1 of regular assignments and for n, d > 2/c,

(i) G is maximized at (α?, z?,w?) and its equivalence class.


12 V. BRAULT et al.

(ii) Λ is maximized at (z?,w?) and its equivalence class; moreover, Λ(z?,w?) = 0.(iii) The maximum of Λ (and hence the maximum of G) is well separated.

Property (iii) of Proposition 4.7 is a direct consequence of the local upperbound forΛ as stated as follows:

Proposition 4.8 (Local upperbound for Λ). Conditionally upon Ω1, there exists apositive constant C such that for all (z,w) ∈ S(z?,w?, C):

Λ(z,w) ≤ −cδ(α?)

4(d‖z− z?‖0,∼ + n‖w −w?‖0,∼) (4.6)

Proofs of Propositions 4.6, 4.7 and 4.8 are reported in Appendix A.

5. Main Result

We are now ready to present our main result stated in Theorem 5.1.

Theorem 5.1 (complete-observed). Consider that assumptions H1 to H4 hold for theLatent Block Model of known order with n × d observations coming from an univariateexponential family and define # Sym(θ) as the set of pairs of permutation (s, t) for whichθ = (π,ρ,α) exhibits symmetry. Then, for n and d tending to infinity with asymptoticrates log(d)/n → 0 and log(n)/d → 0, the observed likelihood ratio behaves like thecomplete likelihood ratio, up to a bounded multiplicative factor:

p(x;θ)

p(x;θ?)=

# Sym(θ)

# Sym(θ?)maxθ′∼θ

p(x, z?,w?;θ′)

p(x, z?,w?;θ?)(1 + oP (1)) + oP (1)

where the oP is uniform over all θ ∈ Θ.

The maximum over all θ′ that are equivalent to θ stems from the fact that becauseof label-switching, θ is only identifiable up to its ∼-equivalence class from the observedlikelihood, whereas it is completely identifiable from the complete likelihood.As already pointed out, if Θ exhibits symmetry, the maximum likelihood assignment isnot unique under the true model, and # Sym(θ) terms contribute with the same weight.This was not taken into account by [2]. The next corollary is deduced immediately :

Corollary 5.2. If Θ contains only parameters that do not exhibit symmetry:

p(x;θ)

p (x;θ?)= maxθ′∼θ

p(x, z?,w?;θ′)

p(x, z?,w?;θ?)(1 + oP (1)) + oP (1)

where the oP is uniform over all Θ.



Using the conditional log-likelihood, the observed likelihood can be written as

p(x;θ) =∑

(z,w)

p(x, z,w;θ)

= p(x|z?,w?;θ?)∑

(z,w)

p(z,w;θ) exp(Fnd(θ, z,w)). (5.1)

The proof proceeds with an examination of the asymptotic behavior of Fnd on threetypes of configurations that partition Z ×W:

1. global control : for (z,w) such that Λ(z,w) = Ω(−nd), Proposition 5.3 proves a largedeviation behavior for Fnd = −ΩP (nd) and in turn those assignments contribute aoP of p(x, z?,w?;θ?)) to the sum (Proposition 5.4).

2. local control : a small deviation result (Proposition 5.5) is needed to show that thecombined contribution of assignments close to but not equivalent to (z?,w?) is alsoa oP of p(x, z?,w?;θ?) (Proposition 5.6).

3. equivalent assignments: Proposition 5.7 examines which of the remaining assign-ments, all equivalent to (z?,w?), contribute to the sum.

These results are presented in next section 5.1 and their proofs reported in AppendixA. They are then put together in section 5.2 to achieve the proof of our main result.The remainder of the section is devoted to the asymptotics of the ML and variationalestimators as a consequence of the main result.

5.1. Different asymptotic behaviors

We begin with a large deviations inequality for configurations (z,w) far from (z?,w?)and leverage it to prove that far away configurations make a small contribution to p(x;θ).

5.1.1. Global Control

Proposition 5.3 (large deviations of Fnd). Let Diam(Θ) = supθ,θ′ ‖θ− θ′‖∞. For all

εn,d < κ/(2√

2 Diam(Θ)) and n, d large enough that

∆1nd(εnd)

= P

(supθ,z,w

Fnd(θ, z,w)− Λ(z,w)

≥ σndDiam(Θ)2

√2εnd

[1 +

gm

2√

2ndεnd

])

≤ gnmd exp

(−ndε

2nd

2

)(5.2)

Proposition 5.4 (contribution of global assignments). Assume log(d)/n→ 0, log(n)/d→0 when n and d tend to infinity, and choose tnd decreasing to 0 such that tnd max(n+d

nd ,log(nd)√

nd).


14 V. BRAULT et al.

Then conditionally on Ω1 and for n, d large enough that 2√

2ndtnd ≥ gm, we have:

supθ∈Θ

∑(z,w)/∈S(z?,w?,tnd)

p(z,w,x;θ) = oP (p(z?,w?,x;θ?))

5.1.2. Local Control

Proposition 5.3 gives deviations of order OP (√nd), which are only useful for (z,w) such

that G and Λ are large compared to√nd. For (z,w) close to (z?,w?), we need tighter

concentration inequalities, of order oP (−(n+ d)), as follows:

Proposition 5.5 (small deviations Fnd). Conditionally upon Ω1, there exists threepositive constant c1, c2 and C such that for all ε ≤ κσ2, for all (z,w) (z?,w?) suchthat (z,w) ∈ S(z?,w?, C):

∆2nd(ε) = Pθ?

(supθ


d‖z− z?‖0,∼ + n‖w −w?‖0,∼≥ ε

)≤ exp

(− ndc2ε2

128(c1σ2 + c2κ−1ε)

)(5.3)

The next propositions builds on Proposition 5.5 and 4.7 to show that the combinedcontributions of assignments close to (z?,w?) to the observed likelihood is also a oP ofp(z?,w?,x;θ?)

Proposition 5.6 (contribution of local assignments). With the previous notations

supθ∈Θ

∑(z,w)∈S(z?,w?,C)

(z,w)(z?,w?)

p(z,w,x;θ) = oP (p(z?,w?,x;θ?))

5.1.3. Equivalent assignments

It remains to study the contribution of equivalent assignments.

Proposition 5.7 (contribution of equivalent assignments). For all θ ∈ Θ, we have

∑(z,w)∼(z?,w?)

p(x, z,w;θ)

p(x, z?,w?;θ?)= # Sym(θ) max

θ′∼θ

p(x, z?,w?;θ′)

p(x, z?,w?;θ?)(1 + oP (1))

where the oP is uniform in θ.

5.2. Proof of the main result

Proof.We work conditionally on Ω1. Choose (z?,w?) ∈ Z1×W1 and a sequence tnd decreasing



to 0 but satisfying tnd max(n+dnd ,

log(nd)√nd

)(this is possible since log(d)/n → 0 and

log(n)/d→ 0). According to Proposition 5.4,

supθ∈Θ

∑(z,w)/∈S(z?,w?,tnd)

p(z,w,x;θ) = oP (p(z?,w?,x;θ?))

Since tnd decreases to 0, it gets smaller than C (used in proposition 5.6) for n, d largeenough. As this point, Proposition 5.6 ensures that:

supθ∈Θ

∑(z,w)∈S(z?,w?,tnd)

(z,w)(z?,w?)

p(z,w,x;θ) = oP (p(z?,w?,x;θ?))

And therefore the observed likelihood ratio reduces as:

p(x;θ)

p(x;θ?)=

∑(z,w)∼(z?,w?)

p(x, z,w;θ) +∑

(z,w)(z?,w?)

p(x, z,w;θ)

∑(z,w)∼(z?,w?)

p(x, z,w;θ?) +∑

(z,w)(z?,w?)

p(x, z,w;θ?)

=

∑(z,w)∼(z?,w?)

p(x, z,w;θ) + p(x; z?,w?,θ?)oP (1)

∑(z,w)∼(z?,w?)

p(x, z,w;θ?) + p(x; z?,w?,θ?)oP (1)

And Proposition 5.7 allows us to conclude

p(x;θ)

p(x;θ?)=

# Sym(θ)


p(x, z?,w?;θ′)

p(x, z?,w?;θ?)(1 + oP (1)) + oP (1).

5.3. Asymptotics for the MLE of θ

The asymptotic behavior of the maximum likelihood estimator in the incomplete datamodel is a direct consequence of Theorem 5.1.

Corollary 5.8 (Asymptotic behavior of θMLE). Denote θMLE the maximum likeli-hood estimator and use the notations of Proposition 3.1. If # Sym(θ) = 1, there existpermutations s of 1, . . . , g and t of 1, . . . ,m such that

π (z?)− πsMLE = oP

(n−1/2

), ρ (w?)− ρtMLE = oP

(d−1/2

),

α (z?,w?)− αs,tMLE = oP

((nd)

−1/2).


16 V. BRAULT et al.

If # Sym(θ) 6= 1, θMLE is still consistent: there exist permutations s of 1, . . . , g and tof 1, . . . ,m such that

π (z?)− πsMLE = oP (1) , ρ (w?)− ρtMLE = oP (1) ,

α (z?,w?)− αs,tMLE = oP (1) .

Hence, the maximum likelihood estimator for the LBM is consistent and asymptoti-cally normal, with the same behavior as the maximum likelihood estimator in the com-plete data model when θ does not exhibit any symmetry. The proof in appendix A.9relies on the local asymptotic normality of the MLE in the complete model, as stated inProposition 3.2 and on our main Theorem.

5.4. Consistency of variational estimates

Due to the complex dependence structure of the observations, the maximum likelihoodestimator of the LBM is not numerically tractable, even with the Expectation Max-imisation algorithm. In practice, a variational approximation can be used ([?, see forexample]]govaert2003): for any joint distribution Q ∈ Q on Z×W a lower bound of L(θ)is given by

J (Q,θ) = L(θ)−KL (Q, p (., .;θ,x))

= EQ [Lc (z,w;θ)] +H (Q) .

where H (Q) = −EQ[log(Q)]. Choose Q to be the set of product distributions, such thatfor all (z,w)

Q (z,w) = Q (z)Q (w) =∏i,k

Q (zik = 1)zik∏j,`

Q (wj` = 1)wj`

allow to obtain tractable expressions of J (Q,θ). The variational estimate θvar of θ isdefined as

θvar ∈ argmaxθ∈Θ

maxQ∈Q

J (Q,θ) .

The following corollary states that θvar has the same asymptotic properties as θMLE

and θMC .

Corollary 5.9 (Variational estimate). Under the assumptions of Theorem 5.1 and if# Sym(θ) = 1, there exist permutations s of 1, . . . , g and t of 1, . . . ,m such that

π (z?)− πsvar = oP

(n−1/2

), ρ (w?)− ρtvar = oP

(d−1/2

),

α (z?,w?)− αs,tvar = oP

((nd)

−1/2).

The proof is available in appendix A.10.



Appendix A: Proofs

A.1. Proof of Proposition 4.6

Proof.Define ν(x, α) = xα− ψ(α). For x fixed, ν(x, α) is maximized at α = (ψ′)−1(x). Manip-ulations yield

Fnd(α, z,w) = log p(x; z,w,θ)− log p(x; z?,w?,θ?)

= nd

[∑k

∑`

πk(z)ρ`(w)ν(xk`(z,w), αk`)−∑k

∑`

πk(z?)ρ`(w?)ν(xk`(z

?,w?), α?k`)

]which is maximized at αk` = (ψ′)−1(xk`(z,w)). Similarly

G(α, z,w) = Eθ? [log p(x; z,w,θ)− log p(x; z?,w?,θ?)]

= nd

[∑k

∑`

πk(z)ρ`(w)ν(xk`(z,w), αk`)−∑k

∑`

πk(z?)ρ`(w?)ν(ψ′(α?k`), α

?k`)

]is maximized at αk` = (ψ′)−1(xk`(z,w))

A.2. Proof of Proposition 4.7 (maximum of G and Λ)

Proof.We condition on (z?,w?) and prove Equation (4.5):

G(θ, z,w) = Eθ?[

p(x; z,w,θ)

p(x; z?,w?,θ?)

∣∣∣∣ z?,w?

]=∑i

∑j

∑k,k′

∑`,`′

Eθ? [xij(αk′`′ − α?k`)− (ψ(αk′`′)− ψ(α?k`))] z?ikzik′w

?j`wj`′

= nd∑k,k′

∑`,`′

IRg(z)k,k′IRm(w)`,`′ [ψ′(α?k`)(αk′`′ − α?k`) + ψ(α?k`)− ψ(αk′`′)]

= −nd∑k,k′

∑`,`′

IRg(z)k,k′IRm(w)`,`′ KL(α?k`, αk′`′)

If (z?,w?) is regular, and for n, d > 2/c, all the rows of IRg(z) and IRm(w) have atleast one positive element and we can apply lemma B.4 (which is an adaptation for LBMof Lemma 3.2 of [2] for SBM) to characterize the maximum for G.

The maximality of Λ(z?,w?) results from the fact that Λ(z,w) = G(α(z,w), z,w)where α(z,w) is a particular value of α, Λ is immediately maximum at (z,w) ∼ (z?,w?),and for those, we have α(z,w) ∼ α?.

The separation and local behavior of G around (z?,w?) is a direct consequence of theproposition 4.8.


18 V. BRAULT et al.

A.3. Proof of Proposition 4.8 (Local upper bound for Λ)

Proof.

We work conditionally on (z?,w?). The principle of the proof relies on the extensionof Λ to a continuous subspace ofMg([0, 1])×Mm([0, 1]), in which confusion matrices arenaturally embedded. The regularity assumption allows us to work on a subspace thatis bounded away from the borders of Mg([0, 1]) ×Mm([0, 1]). The proof then proceeds

by computing the gradient of Λ at and around its argmax and using those gradients tocontrol the local behavior of Λ around its argmax. The local behavior allows us in turnto show that Λ is well-separated.

Note that Λ only depends on z and w through IRg(z) and IRm(w). We can thereforeextend it to matrices (U, V ) ∈ Uc ×Vc where U is the subset of matrices Mg([0, 1]) witheach row sum higher than c/2 and V is a similar subset of Mm([0, 1]).

Λ(U, V ) = −nd∑k,k′

∑`,`′

Ukk′V``′ KL (α?k`, αk′`′)

where

αk` = αk`(U, V ) = (ψ′)−1

([UTS?V

]k`

[UT1V ]k`

)and 1 is the g ×m matrix filled with 1. Confusion matrices IRg(z) and IRm(w) satisfyIRg(z)1I = π(z?) and IRm(w)1I = ρ(w?), with 1I = (1, . . . , 1)T a vector only containing1 values, and are obviously in Uc and Vc as soon as (z?,w?) is c/2 regular.

The maps fk,` : (U, V ) 7→ KL(α?k`, αk`(U, V )) are twice differentiable with second

derivatives bounded over Uc×Vc and therefore so is Λ(U, V ). Tedious but straightforwardcomputations show that the derivative of Λ at (Dπ, Dρ) := (Diag(π(z?)),Diag(ρ(w?)))is:

Akk′(w?) :=

∂Λ

∂Ukk′(Dπ, Dρ) =

∑`

ρ`(w?) KL (α?k`, α

?k′`)

B``′(z?) :=

∂Λ

∂V``′(Dπ, Dρ) =

∑k

ρ`(z?) KL (α?k`, α

?k`′)

A(w?) and B(z?) are the matrix-derivative of −Λ/nd at (Dπ, Dρ). Since (z?,w?) is c/2-regular and by definition of δ(α?), A(w?)kk′ ≥ cδ(α?)/2 (resp. B(w?)``′ ≥ cδ(α?)/2)if k 6= k′ (resp. ` 6= `′) and A(w?)kk = 0 (resp. B(z?)`` = 0) for all k (resp. `). Byboundedness of the second derivative, there exists C > 0 such that for all (Dπ, Dρ) andall (H,G) ∈ B(Dπ, Dρ, C), we have:

−1

nd

∂Λ

∂Ukk′(H,G)

≥ 3cδ(α?)

8 if k 6= k′

≤ cδ(α?)8 if k = k′

and−1

nd

∂Λ

∂V``′(H,G)

≥ 3cδ(α?)

8 if ` 6= `′

≤ cδ(α?)8 if ` = `′



Choose U and V in (Uc × Vc) ∩ B(Dπ, Dρ, C) satisfying U1I = π(z?) and V 1I = ρ(w?).U − Dπ and V − Dρ have nonnegative off diagonal coefficients and negative diagonalcoefficients. Furthermore, the coefficients of U, V,Dπ, Dρ sum up to 1 and Tr(Dπ) =Tr(Dρ) = 1. By Taylor expansion, there exists a couple (H,G) also in (Uc × Vc) ∩B(Dπ, Dρ, C) such that

−1

ndΛ (U, V ) =

−1

ndΛ (Dπ, Dρ)+Tr

((U −Dπ)

∂Λ

∂U(H,G)

)+Tr

((V −Dρ)

∂Λ

∂V(H,G)

)

≥ cδ(α?)

8[3∑k 6=k′

(U −Dπ)kk′ + 3∑` 6=`′

(V −Dρ)``′ −∑k

(U −Dπ)kk −∑`

(V −Dρ)``]

=cδ(α?)

4[(1− Tr(U)) + (1− Tr(V ))]

To conclude the proof, assume without loss of generality that (z,w) ∈ S(z?,w?, C)achieves the ‖.‖0,∼ norm (i.e. it is the closest to (z?,w?) in its representative class).Then (U, V ) = (IRg(z), IRm(w)) is in (Uc × Vc) ∩ B(Dπ, Dρ, C) and satisfy U1I = π(z?)(resp. V 1I = ρ(w?)). We just need to note n(1 − Tr(IRg(z))) = ‖z − z?‖0,∼ (resp.d(1− Tr(IRm(w))) = ‖w −w?‖0,∼) to end the proof.

The maps fk,` : x 7→ KL(α?k`, (ψ′)−1(x)) are twice differentiable with a continuous sec-

ond derivative bounded by σ−2 on ψ′(Cα). All terms[UTS?V

]k`

[UT1V

]−1

k`are convex

combinations of the ψ′(α?k`) and therefore in ψ′(Cα). Furthermore, their first and secondorder derivative are also bounded as soon as each row sum of U and V is bounded awayfrom 0. By composition, all second order partial derivatives of Λ are therefore continuousand bounded on U × V.

We now compute the first derivative of Λ at (Dπ, Dρ) := (Diag(π(z?)),Diag(ρ(w?)))

by doing a first-order Taylor expansion of Λ (Dπ + U,Dρ + V ) for small U and V .Tedious but straightforward manipulations show:

αk`(Dπ + U,Dρ + V ) =α?k` +1

πk(z?)

∑k′

Ukk′(Sk′` − 1)

+1

ρ`(w?)

∑`′

V``′(Sk`′ − 1) + o(‖U‖1, ‖V ‖1)

KL (α?k`, αk′`′) = KL(α?k`, α

?k′,`′

)+

O(‖U‖1, ‖V ‖1) if (k′, `′) 6= (k, `)

o(‖U‖1, ‖V ‖1) if (k′, `′) = (k, `)

where the second line comes from the fact that f ′k,`(ψ′(α?k`)) = 0. Keeping only the first

order term in U and V in Λ and noting that Λ (Dπ, Dρ) = 0 yields:


20 V. BRAULT et al.

−1

nd[Λ (Dπ + U,Dρ + V )− Λ (Dπ, Dρ)] =

−1

ndΛ (Dπ + U,Dρ + V )

=∑k

Dπ,kk

∑`,`′

V``′ KL (α?k`, αk`′) +∑`

Dρ,``

∑k,k′

Ukk′ KL (α?k`, αk′`) + o(‖U‖1, ‖V ‖1)

=∑k

πk(z?)∑`,`′

V``′ KL (α?k`, α?k`′) +

∑`

ρ`(w?)∑k,k′

Ukk′ KL (α?k`, α?k′`) + o(‖U‖1, ‖V ‖1)

= Tr(UA(w?)) + Tr(V B(z?)) + o(‖U‖1, ‖V ‖1)

where Akk′(w?) :=

∑` ρ`(w

?) KL (α?k`, α?k′`) and B``′(z

?) :=∑k ρ`(z

?) KL (α?k`, α?k`′). A

and B are the matrix-derivative of −Λ/nd at (Dπ, Dρ). Since (z?,w?) is c/2-regular andby definition of δ(α?), Akk′ ≥ cδ(α?)/2 for k 6= k′ and B``′ ≥ cδ(α?)/2 for ` 6= `′ andthe diagonal terms of A and B are null. By boundedness of the lower second derivativeof Λ, there exists a constant C > 0 such that for all (H,G) ∈ B(Dπ, Dρ, C), we have:

−1

nd

∂Λ

∂Ukk′(H,G)

≥ 3cδ(α?)

8 if k 6= k′

≤ cδ(α?)8 if k = k′

and−1

nd

∂Λ

∂V``′(H,G)

≥ 3cδ(α?)

8 if ` 6= `′

≤ cδ(α?)8 if ` = `′

In particular, if U and V have nonnegative non diagonal coefficients and negative diagonalcoefficients.

−1

nd

[Tr

(U∂Λ

∂U(H,G)

)+ Tr

(V∂Λ

∂V(H,G)

)]

≥ cδ(α?)

4

[∑kk′

Ukk′ +∑``′

V``′ − Tr(U)− Tr(V )

]

Choose U and V in (U×V)∩B(Dπ, Dρ, c3) satisfying U1I = π(z?) and V 1I = ρ(w?). Notethat U −Dπ and V −Dρ have nonnegative non diagonal coefficients, negative diagonalcoefficients, that their coefficients sum up to 1 and that Tr(Dπ) = Tr(Dρ) = 1. By Taylorexpansion, there exists a couple (H,G) also in (U × V) ∩B(Dπ, Dρ, C) such that

−1

ndΛ (U, V ) =

−1

ndΛ (Dπ + (U −Dπ), Dρ + (V −Dρ))

= Tr

((U −Dπ)

∂Λ

∂U(H,G)

)+ Tr

((V −Dρ)

∂Λ

∂V(H,G)

)

≥ cδ(α?)

4[∑k,k′

(U −Dπ)kk′ +∑`,`′

(V −Dρ)``′ − Tr(U −Dπ)− Tr(V −Dρ)]

=cδ(α?)

4[(1− Tr(U)) + (1− Tr(V ))]



To conclude the proof, choose any assignment (z,w) and without loss of generality assumethat (z,w) are closest to (z?,w?) in their equivalence class. Then IRg(z) is in U andadditionally satifies IRg(z)1I = π(z?) and ‖z − z?‖0,∼ = n‖IRg(z) − Dπ‖1/2 = n(1 −Tr(IRg(z))). Similar equalities hold for IRm(w) and ‖w −w?‖0.

A.4. Proof of Proposition 5.3 (global convergence Fnd)

Proof.Conditionally upon (z?,w?),

Fnd(θ, z,w)− Λ(z,w) ≤ Fnd(θ, z,w)−G(θ, z,w)

=∑i

∑j

(αziwj − α?z?i w?j )(xij − ψ′(α?z?i w?j )

)=∑kk′

∑``′

(αk′`′ − α?k`)Wkk′``′

≤ supΓ∈IRg

2×m2

‖Γ‖∞≤Diam(Θ)

∑kk′

∑``′

Γkk′``′Wkk′``′ := Z

uniformly in θ, where the Wkk′``′ are independent and defined by:

Wkk′``′ =∑i

∑j

z?ikw?j`zi,k′wj`′ (xij − ψ′(α?k`))

is the sum of ndIRg(z)kk′IRm(w)``′ sub-exponential variables with parameters (σ2, 1/κ)and is therefore itself sub-exponential with parameters (ndIRg(z)kk′IRm(w)``′ σ

2, 1/κ).

According to Proposition B.3, Eθ? [Z|z?,w?] ≤ gmDiam(Θ)√ndσ2 and Z is sub-exponential

with parameters (ndDiam(Θ)2(2√

2)2σ2, 2√

2 Diam(Θ)/κ). In particular, for all εn,d <κ/2√

2 Diam(Θ)

Pθ?(Z ≥ σgmDiam(Θ)

√nd

1 +

√8ndεn,dgm

∣∣∣∣∣ z?,w?

)≤ Pθ?

(Z ≥ Eθ? [Z|z?,w?] + σDiam(Θ)nd2

√2εn,d

∣∣∣ z?,w?)

≤ exp

(−ndε2

n,d

2

)

We can then remove the conditioning and take a union bound to prove Equation (5.2).


22 V. BRAULT et al.

A.5. Proof of Proposition 5.4 (contribution of far awayassignments)

Proof.Conditionally on (z?,w?), we know from proposition 4.7 that Λ is maximal in (z?,w?)

and its equivalence class. Choose 0 < tnd decreasing to 0 but satisfying tnd max(n+dnd ,

log(nd)√nd

).

This is always possible because we assume that log(d)/n→ 0 and log(n)/d→ 0. Accord-ing to 4.7 (iii), for all (z,w) /∈ (z?,w?, tnd)

Λ(z,w) ≤ −cδ(α?)

4(n‖w −w?‖0,∼ + d‖z− z?‖0,∼) ≤ −cδ(α

?)

4ndtnd (A.1)

since either ‖z− z?‖0,∼ ≥ ntnd or ‖w −w?‖0,∼ ≥ dtnd.Set εnd = inf(cδ(α?)tnd/16σ,κ)

Diam(Θ) . By proposition 5.3, and with our choice of εnd, with

probability higher than 1−∆1nd(εnd),∑

(z,w)/∈S(z?,w?,tnd)

p(x, z,w;θ)

= p(x|z?,w?,θ?)∑


p(z,w;θ)eFnd(θ,z,w)−Λ(z,w)+Λ(z,w)

≤ p(x|z?,w?,θ?)∑z,w

p(z,w;θ)eFnd(θ,z,w)−Λ(z,w)−ndtndcδ(α?)/4

≤ p(x|z?,w?,θ?)∑z,w

p(z,w;θ)endtndcδ(α?)/8

=p(x, z?,w?;θ?)

p(z?,w?;θ?)e−ndtndcδ(α

?)/8

≤ p(x, z?,w?;θ?) exp

(−ndtnd

cδ(α?)

8+ (n+ d) log

1− cc

)= p(x, z?,w?;θ?)o(1)

where the second line comes from inequality (A.1), the third from the global controlstudied in Proposition 5.3 and the definition of εnd, the fourth from the definition ofp(x, z?,w?;θ?), the fifth from the bounds on π? and ρ? and the last from tnd (n +d)/nd.

In addition, we have εnd log(nd)/√nd so that the series

∑n,d ∆1

nd(εnd) convergesand: ∑


p(x, z,w;θ) = p(x; z?,w?,θ?)oP (1)



A.6. Proof of Proposition 5.5 (local convergence Fnd)

Proof.We work conditionally on (z?,w?) ∈ Z1 × W1. Choose ε ≤ κσ2 small. Assignments(z,w) at ‖.‖0,∼-distance less than c/4 of (z?,w?) are c/4-regular. According to Propo-sition B.1, xk` and xk` are at distance at most ε with probability higher than 1 −exp

(− ndc2ε2

128(σ2+κ−1ε)

). Manipulation of Λ and Λ yield


nd≤ Λ(z,w)− Λ(z,w)

nd

=∑k,k′′

∑`,`′

IRg(z)kk′IRm(w)``′ [fk`(xk′`′)− fk`(xk′`′)]

where fk`(x) = −S?k`(ψ′)−1(x) +ψ (ψ′)−1(x). The functions fk` are twice differentiable

on A with bounded first and second derivatives over I = ψ′([−Mα,Mα]) so that:

fk`(y)− fk`(x) = f ′k`(x) (y − x) + o (y − x)

where the o is uniform over pairs (x, y) ∈ I2 at distance less than ε and does not dependon (z?,w?). xk` is a convex combination of the S?k` = ψ′(α?k`) ∈ ψ′(Cα). Since ψ′ ismonotonic, xk` ∈ ψ′(Cα) ⊂ I. Similarly, |xk`− xk`| ≤ κσ2 and |ψ′′| ≥ σ2 over I thereforexk` ∈ I. We now bound f ′k`:

|f ′k`(xk′`′)| =∣∣∣∣ xk′`′ − S?k`ψ′′ (ψ′)−1(xk′`′)

∣∣∣∣ =

∣∣∣∣∣∣∣[IRg(z)TS?IRm(w)]

k′`′πk(z)ρ`(w) − S?k`ψ′′ (ψ′)−1(xk′`′)

∣∣∣∣∣∣∣≤(

1− IRg(z)kk′IRm(w)``′

πk(z)ρ`(w)

)S?max − S?min

σ2

where S?max = maxk,` S?k` and S?min = mink,` S

?k`. In particular,

IRg(z)kk′IRm(w)``′ |f ′k`(xk′`′)| ≤ IRg(z)kk′IRm(w)``′

(1− IRg(z)kk′IRm(w)``′

πk(z)ρ`(w)

)S?max − S?min

σ2

≤

IRg(z)kk′IRm(w)``′

S?max−S?min

σ2 if (k′, `′) 6= (k, `)

[πk(z)ρ`(w)− IRg(z)kkIRm(w)``]S?max−S

?min

σ2 if (k, `) = (k, `)


24 V. BRAULT et al.

Wrapping everything,

|Λ(z,w)− Λ(z,w)|nd

=

∣∣∣∣∣∣∑k,k′

∑`,`′

IRg(z)kk′IRm(w)``′ [f′k`(xk′`′)(xk` − xk`) + o(xk` − xk`)]

∣∣∣∣∣∣≤

∑(k′,`′) 6=(k,`)

IRg(z)kk′IRm(w)``′ +∑k,`

(πk(z)ρ`(w)− IRg(z)kkIRm(w)``)

× S?max − S?min

σ2maxk,`|xk` − xk`|(1 + o(1))

= 2

∑(k′,`′)6=(k,`)

IRg(z)kk′IRm(w)``′

S?max − S?min

σ2maxk,`|xk` − xk`|(1 + o(1))

= 2 [1− Tr(IRg(z)) Tr(IRm(w))]S?max − S?min

σ2maxk,`|xk` − xk`|(1 + o(1))

≤ 2

(‖z− z?‖

n+‖w −w?‖

d

)S?max − S?min

σ2maxk,`|xk` − xk`|(1 + o(1))

≤ 2

(‖z− z?‖

n+‖w −w?‖

d

)S?max − S?min

σ2ε(1 + o(1))

We can remove the conditioning on (z?,w?) to prove Equation (5.3) with c2 = 2(S?max−S?min)/σ2 and c1 = c22.

A.7. Proof of Proposition 5.6 (contribution of local assignments)

Proof.By Proposition 4.2, it is enough to prove that the sum is small compared to p(z?,w?,x;θ?)on Ω1. We work conditionally on (z?,w?) ∈ Z1×W1. Choose (z,w) in S(z?,w?, C) withC defined in proposition 5.4.

log

(p(z,w,x;θ)

p(z?,w?,x;θ?)

)= log

(p(z,w;θ)

p(z?,w?;θ?)

)+ Fnd(θ, z,w)

For C small enough, we can assume without loss of generality that (z,w) is the repre-sentative closest to (z?,w?) and note r1 = ‖z − z?‖0 and r2 = ‖w − w?‖0. We choose



εnd ≤ min(κσ2, cδ(α?)/8). Then with probability at least 1− exp(− ndc2ε2nd

8(c1σ2+c2κ−1εnd)

):

Fnd(θ, z,w) ≤ Λ(z,w)− Λ(z,w) + Λ(z,w)

≤ Λ(z,w)− Λ(z,w)− cδ(α?)

4(dr1 + nr2)

≤ εnd (dr1 + nr2)− cδ(α?)

4(dr1 + nr2)

≤ −cδ(α?)

8(dr1 + nr2)

where the first line comes from the definition of Λ, the second line from Proposition 4.7,the third from Proposition 5.5 and the last from εnd ≤ cδ(α?)/8. A union bound showsthat

∆nd(εnd) = Pθ?

sup(z,w)∈(z?,w?,c)

θ∈Θ

Fnd(θ, z,w) ≥ −cδ(α?)

8(d‖z− z?‖0,∼ + n‖w −w?‖0,∼)

≤ gnmd exp

(− ndc2ε2

nd

8(c1σ2 + c2κ−1εnd)

)Thanks to corollary B.6, we also know that:

log

(p(z,w;θ)

p(z?,w?;θ?)

)≤ OP (1) exp

Mc/4(r1 + r2)

There are at most

(nr1

)(nr2

)gr1mr2 assignments (z,w) at distance r1 and r2 of (z?,w?) and

each of them has at most ggmm equivalent configurations. Therefore, with probability1−∆nd(εnd),∑

(z,w)∈S(z?,w?,c)(z,w)(z?,w?)

p(z,w,x;θ)

p(z?,w?,x;θ?)

≤ OP (1)∑

r1+r2≥1

(n

r1

)(n

r2

)gg+r1mm+r2 exp

((r1 + r2)Mc/4 −

cδ(α?)

8(dr1 + nr2)

)

= OP (1)(

1 + e(g+1) log g+Mc/4−dcδ(α?)

8

)n (1 + e(m+1) logm+Mc/4−n

cδ(α?)8

)d− 1

≤ OP (1)and exp(and)

where and = ne(g+1) log g+Mc/4−dcδ(α?)

8 + de(m+1) logm+Mc/4−ncδ(α?)

8 = o(1) as soon asn log d and d log n. If we take εnd log(nd)/

√nd, the series

∑n,d ∆nd(εnd)

converges which proves the results.


26 V. BRAULT et al.

A.8. Proof of Proposition 5.7 (contribution of equivalentassignments)

Proof.Choose (s, t) permutations of 1, . . . , g and 1, . . . ,m and assume that z = z?,s andw = w?,t. Then p(x, z,w;θ) = p(x, z?,s,w?,t;θ) = p(x, z?,w?;θs,t). If furthermore(s, t) ∈ Sym(θ), θs,t = θ and immediately p(x, z,w;θ) = p(x, z?,w?;θ). We can there-fore partition the sum as

∑(z,w)∼(z,w)

p(x, z,w;θ) =∑s,t

p(x, z?,s,w?,t;θ)

=∑s,t

p(x, z?,w?;θs,t)

=∑θ′∼θ

# Sym(θ′)p(x, z?,w?;θ′)

= # Sym(θ)∑θ′∼θ

p(x, z?,w?;θ′)

p(x, z?,w?;θ) unimodal in θ, with a mode in θMC . By consistency of θMC , eitherp(x, z?,w?;θ) = oP (p(x, z?,w?;θ?)) or p(x, z?,w?;θ) = OP (p(x, z?,w?;θ?)) and θ →θ?. In the latter case, any θ′ ∼ θ other than θ is bounded away from θ? and thusp(x, z?,w?;θ′) = oP (p(x, z?,w?;θ?)). In summary,∑

θ′∼θ

p(x, z?,w?;θ′)

p(x, z?,w?;θ?)= maxθ′∼θ

p(x, z?,w?;θ′)

p(x, z?,w?;θ?)(1 + oP (1))

A.9. Proof of Corollary 5.8: Behavior of θMLE

Theorem 5.1, states that:

p(x;θ)

p(x;θ?)=

# Sym(θ)


p(x, z?,w?;θ′)

p(x, z?,w?;θ?)(1 + oP (1)) + oP (1)

Then,

p(x;θ) = # Sym(θ)p(x;θ?)

# Sym(θ?)p(x, z?,w?;θ?)maxθ′∼θ

p(x, z?,w?;θ′) (1 + oP (1)) + oP (1)

= # Sym(θ)1

# Sym(θ?)p(z?,w?|x;θ?)maxθ′∼θ

p(x, z?,w?;θ′) (1 + oP (1)) + oP (1).



Now, using Corollary 3 p. 553 of Mariadassou and Matias [11]

p(·, ·|x;θ?)(D)−→

n,d→+∞

1

# Sym(θ?)

∑(z,w)

θ?∼(z?,w?)

δ(z,w)(·, ·),

we can deduce that

p(x;θ) = # Sym(θ)1

# Sym(θ?)p(z?,w?|x;θ?)maxθ′∼θ

p(x, z?,w?;θ′) (1 + oP (1)) + oP (1)

= # Sym(θ)1

1 + oP (1)maxθ′∼θ

p(x, z?,w?;θ′) (1 + oP (1)) + oP (1)

= # Sym(θ) maxθ′∼θ

p(x, z?,w?;θ′) (1 + oP (1)) + oP (1). (A.2)

Finaly, we conclude with the proposition 3.2.

A.10. Proof of Corollary 5.9: Behavior of J (Q, θ)

Remark first that for every θ and for every (z,w),

p (x, z,w;θ) ≤ exp [J (δz × δw,θ)] ≤ maxQ∈Q

exp [J (Q,θ)] ≤ p (x;θ)

where δz denotes the dirac mass on z. By dividing by p (x;θ?), we obtain

p (x, z,w;θ)

p (x;θ?)≤

maxQ∈Q

exp [J (Q,θ)]

p (x;θ?)≤ p (x;θ)

p (x;θ?).

As this inequality is true for every couple (z,w), we have:

max(z,w)∈Z×W

p (x, z,w;θ)

p (x;θ?)≤

maxQ∈Q

exp [J (Q,θ)]

p (x;θ?).

Moreover, using Equation A.2, we get a lower bound:

max(z,w)∈Z×W

p (x, z,w;θ)

p (x;θ?)= max

θ′∼θ

p(x, z?,w?;θ′

)(1 + op(1))

p (x;θ?)+ op(1)

= maxθ′∼θ

p(x, z?,w?;θ′

)(1 + op(1))

# Sym(θ?)p (x, z?,w?;θ?) (1 + op(1))+ op(1)

= maxθ′∼θ

p(x, z?,w?;θ′

)(1 + op(1))

# Sym(θ?)p (x, z?,w?;θ?)+ op(1).


28 V. BRAULT et al.

Now, Theorem 5.1 leads to the following upper bound:

maxQ∈Q

exp [J (Q,θ)]

p (x;θ?)≤ p (x;θ)

p (x;θ?)

≤ # Sym(θ)


p(x, z?,w?;θ′

)(1 + op(1))

p (x, z?,w?;θ?)+ op(1)

so that we have the following control

maxθ′∼θ

p(x, z?,w?;θ′

)(1 + op(1))

# Sym(θ?)p (x, z?,w?;θ?)+ op(1) ≤

maxQ∈Q

exp [J (Q,θ)]

p (x;θ?)

≤ # Sym(θ)


p(x, z?,w?;θ′

)(1 + op(1))

p (x, z?,w?;θ?)+ op(1).

In the particular case where # Sym(θ) = 1, we have

maxQ∈Q

exp [J (Q,θ)]

p (x;θ?)=

1


p(x, z?,w?;θ′

)(1 + op(1))

p (x, z?,w?;θ?)+ op(1)

and, following the same reasoning as the appendix A.9, we have the result.

Appendix B: Technical Lemma

B.1. Sub-exponential variables

We now prove two propositions regarding subexponential variables. Recall first that arandom variable X is sub-exponential with parameters (τ2, b) if for all λ such that |λ| ≤1/b,

E[eλ(X−E(X))] ≤ exp

(λ2τ2

2

).

In particular, all distributions coming from a natural exponential family are sub-exponential.Sub-exponential variables satisfy a large deviation Bernstein-type inequality:

P(X − E[X] ≥ t) ≤

exp

(− t2

2τ2

)if 0 ≤ t ≤ τ2

b

exp(− t

2b

)if t ≥ τ2

b

(B.1)

So that

P(X − E[X] ≥ t) ≤ exp

(− t2

2(τ2 + bt)

)The subexponential property is preserved by summation and multiplication.



• If X is sub-exponential with parameters (τ2, b) and α ∈ R, then so is αX withparameters (α2τ2, αb)

• If the Xi, i = 1, . . . , n are sub-exponential with parameters (τ2i , bi) and indepen-

dent, then so is X = X1 + · · ·+Xn with parameters (∑i τ

2i ,maxi bi)

Proposition B.1 (Maximum in (z,w)). Let (z,w) be a configuration and xk,`(z,w)resp. xk`(z,w) be as defined in Equations (3.1) and (4.4). Under the assumptions of thesection 2.2, for all ε > 0

P(

maxz,w

maxk,l

πk(z)ρ`(w)|xk,` − xk`| > ε

)≤ gn+1md+1 exp

(− ndε2

2(σ2 + κ−1ε)

). (B.2)

Additionally, the suprema over all c/2-regular assignments satisfies:

P(

maxz∈Z1,w∈W1

maxk,l|xk,` − xk`| > ε

)≤ gn+1md+1 exp

(− ndc2ε2

8(σ2 + κ−1ε)

). (B.3)

Note that equations B.2 and B.3 remain valid when replacing c/2 by any c < c/2.

Proof.

The random variablesXij are subexponential with parameters (σ2, 1/κ). Conditionallyto (z?,w?), z+kw+`(xk,` − xk`) is a sum of z+kw+` centered subexponential randomvariables. By Bernstein’s inequality [12], we therefore have for all t > 0

P(z+kw+`|xk,` − xk`| ≥ t) ≤ 2 exp

(− t2

2(z+kw+`σ2 + κ−1t)

)

In particular, if t = ndx,

P (πk(z)ρ`(w)|xk,` − xk`| ≥ x) ≤ 2 exp

(− ndx2

2(πk(z)ρ`(w)σ2 + κ−1x)

)≤ 2 exp

(− ndx2

2(σ2 + κ−1x)

)

uniformly over (z,w). Equation (B.2) then results from a union bound. Similarly,

P (|xk,` − xk`| ≥ x) = P (πk(z)ρ`(w)|xk,` − xk`| ≥ πk(z)ρ`(w)x)

≤ 2 exp

(− ndx2πk(z)2ρ`(w)2

2(πk(z)ρ`(w)σ2 + κ−1xπk(z)ρ`(w))

)≤ 2 exp

(− ndc2x2

8(σ2 + κ−1x)

)Where the last inequality comes from the fact that c/2-regular assignments satisfyπk(z)ρ`(w) ≥ c2/4. Equation (B.3) then results from a union bound over Z1 × W1 ⊂Z ×W.


30 V. BRAULT et al.

Lemma B.2. If X is a zero mean random variable, subexponential with parameters(σ2, b), then |X| is subexponential with parameters (8σ2, 2

√2b).

Proof.Note µ = E|X| and consider Y = |X| − µ. Choose λ such that |λ| < (2

√2b)−1. We need

to bound E[eλY ]. Note first that E[eλY ] ≤ E[eλX ] + E[e−λX ] < +∞ is properly definedby subexponential property of X and we have

E[eλY ] ≤ 1 +∑k=2

|λ|kE[|Y |k]

k!

where we used the fact that E[Y ] = 0. We know bound odd moments of |λY |.

E[|λY |2k+1] ≤ (E[|λY |2k]E[|λY |2k+2])1/2 ≤ 1

2(λ2kE[Y 2k] + λ2k+2E[Y 2k+2])

where we used first Cauchy-Schwarz and then the arithmetic-geometric mean inequality.The Taylor series expansion can thus be reduced to

E[eλY ] ≤ 1 +

(1

2+

1

2.3!

)E[Y 2]λ2 +

+∞∑k=2

(1

(2k)!+

1

2

[1

(2k − 1)!+

1

(2k + 1)!

])λ2kE[Y 2k]

≤+∞∑k=0

2kλ2kE[Y 2k]

(2k)!

≤+∞∑k=0

23k λ2kE[X2k]

(2k)!= cosh

(2√

2λX)

= E

[e2√

2λX + e−2√

2λX

2

]≤ e 8λ2σ2

2

where we used the well-known inequality E[|X − E[X]|k] ≤ 2kE[|X|k] to substitute22kE[X2k] to E[Y 2k].

Proposition B.3 (concentration for subexponential). Let X1, . . . , Xn be independentzero mean random variables, subexponential with parameters (σ2

i , bi). Note V 20 =

∑i σ

2i

and b = maxi bi. Then the random variable Z defined by:

Z = supΓ∈IRn

‖Γ‖∞≤M

∑i

ΓiXi



is also subexponential with parameters (8M2V 20 , 2√

2Mb). Moreover E[Z] ≤ MV0√n so

that for all t > 0,

P(Z −MV0

√n ≥ t) ≤ exp

(− t2

2(8M2V 20 + 2

√2Mbt)

)(B.4)

Proof.Note first that Z can be simplified to Z = M

∑i |Xi|. We just need to bound bound

E[Z]. The rest of the proposition results from the fact that the |Xi| are subexponential(8σ2

i , 2√

2bi) by Lemma B.2 and standard properties of sums of independent rescaledsubexponential variables.

E[Z] = E

supΓ∈IRn

‖Γ‖∞≤M

∑i

ΓiXi

= E

[∑i

M |Xi|

]≤M

∑i

√E[X2

i ]

= M∑i

σi ≤M

(∑i

1

)1/2(∑i

σ2i

)1/2

= MV0

√n

using Cauchy-Schwarz.

The final lemma is the working horse for proving Proposition 4.7.

Lemma B.4.Let η and η be two matrices from Mg×m(Θ) and f : Θ×Θ→ R+ a positive function,

A a (squared) confusion matrix of size g and B a (squared) confusion matrix of size m.We denote Dk`k′`′ = f(ηk`, ηk′`′). Assume that

• all the rows of η are distinct;• all the columns η are distinct;• f(x, y) = 0⇔ x = y;• each row of A has a non zero element;• each row of B has a non zero element;

and noteΣ =

∑kk′

∑``′

Akk′B``′dk`k′`′ (B.5)

Then,

Σ = 0⇔

A,B are permutation matrices s, t

η = ηs,t cad ∀(k, `), ηk` = ηs(k)t(`)


32 V. BRAULT et al.

Proof.If A and B are the permutation matrices corresponding to the permutations s et t:Aij = 0 if i 6= s(j) and Bij = 0 if i 6= t(j). As each row of A contains a non zero elementand as As(k)k > 0 (resp. Bs(`)` > 0) for all k (resp. `), the following sum Σ reduces to

Σ =∑kk′

∑``′

Akk′B``′dk`k′`′ =∑k

∑`

As(k)kBt(`)`ds(k)t(`)k`

Σ is null and sum of positive components, each component is null. However, all As(k)k

and Bt(`)` are not null, so that for all (k, `), ds(k)t(`)k` = 0 and ηk` = ηs(k)t(`).Now, if A is not a permutation matrix while Σ = 0 (the same reasoning holds for Bor both). Then A owns a column k that contains two non zero elements, say Ak1k andAk2k. Let ` ∈ 1 . . .m, there exists by assumption `′ such that B``′ 6= 0. As Σ = 0, bothproducts Ak1kB``′dk1`k`′ and Ak2kB``′dk2`k`′ are zero.

Ak1kB``′dk1`k`′ = 0

Ak2kB``′dk2`k`′ = 0⇔

dk1`k`′ = 0

dk2`k`′ = 0⇔

ηk1` = ηk`′

ηk2` = ηk`′⇔ ηk1` = ηk2`

The previous equality is true for all `, thus rows k1 and k2 of η are identical, and contradictthe assumptions.

B.2. Likelihood ratio of assignments

Lemma B.5.Let Z1 be the subset of Z of c-regular configurations, as defined in Definition 4.1. Let

Sg = π = (π1, π2, . . . , πg) ∈ [0, 1]g :∑gk=1 πk = 1 be the g-dimensional simplex and

note Sgc = Sg ∩ [c, 1− c]g. Then there exists two positive constants Mc and M ′c such thatfor all z, z? in Z1 and all π ∈ Sgc

|log p(z; π(z))− log p(z?; π(z?))| ≤ Mc‖z− z?‖0

Proof.Consider the entropy map H : Sg → R defined as H(π) = −

∑gk=1 πk log(πk). The

gradient ∇H is uniformly bounded by Mc

2 = log 1−cc in ‖.‖∞-norm over Sg ∩ [c, 1 − c]g.

Therefore, for all π, π? ∈ Sg ∩ [c, 1− c]g, we have

|H(π)−H(π?)| ≤ Mc

2‖π − π?‖1

To prove the inequality, we remark that z ∈ Z1 translates to π(z) ∈ Sg ∩ [c, 1 − c]g,that log p(z; π(z))− log p(z?; π(z?)) = n[H(π(z))−H(π(z?))] and finally that ‖π(z)−π(z?)‖1 ≤ 2

n‖z− z?‖0.



Corollary B.6. Let z? (resp. w?) be c/2-regular and z (resp. w) at ‖.‖0-distance c/4of z? (resp. w?). Then, for all θ ∈ Θ

logp(z,w;θ)

p(z?,w?;θ?)≤ OP (1) exp

Mc/4(‖z− z?‖0 + ‖w −w?‖0)

Proof.Note then that:

p(z,w;θ)

p(z?,w?;θ?)=

p(z,w;π,ρ)

p(z?,w?;π?,ρ?)=

p(z,w;π,ρ)

p(z?,w?; π(z?), ρ(w?))

p(z?,w?; π(z?), ρ(w?))

p(z?,w?;π?,ρ?)

≤ p(z,w; π(z), ρ(w))

p(z?,w?; π(z?), ρ(w?))

p(z?,w?; π(z?), ρ(w?))

p(z?,w?;π?,ρ?)

≤ expMc/4(‖z− z?‖0 + ‖w −w?‖0)

× p(z?,w?; π(z?), ρ(w?))

p(z?,w?;π?,ρ?)

≤ OP (1) expMc/4(‖z− z?‖0 + ‖w −w?‖0)

where the first inequality comes from the definition of π(z) and ρ(w) and the secondfrom Lemma B.5 and the fact that z? and z (resp. w? and w) are c/4-regular. Fi-nally, local asymptotic normality of the MLE for multinomial proportions ensures thatp(z?,w?;π(z?),ρ(w?))

p(z?,w?;π?,ρ?) = OP (1).


34 V. BRAULT et al.

References

[1] Christophe Ambroise and Catherine Matias. New consistent and asymptoticallynormal parameter estimates for random-graph mixture models. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 74(1):3–35, 2012.

[2] Peter Bickel, David Choi, Xiangyu Chang, Hai Zhang, et al. Asymptotic normalityof maximum likelihood and its variational approximation for stochastic blockmodels.The Annals of Statistics, 41(4):1922–1943, 2013.

[3] Peter J Bickel and Aiyou Chen. A nonparametric view of network models andnewman–girvan and other modularities. Proceedings of the National Academy ofSciences, 106(50):21068–21073, 2009.

[4] Alain Celisse, Jean-Jacques Daudin, Laurent Pierre, et al. Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electronic Jour-nal of Statistics, 6:1847–1899, 2012.

[5] Gerard Govaert and Mohamed Nadif. Clustering with block mixture models. PatternRecognition, 36(2):463–473, 2003.

[6] Gerard Govaert and Mohamed Nadif. Block clustering with bernoulli mixture mod-els: Comparison of different approaches. Computational Statistics & Data Analysis,52(6):3233–3245, 2008.

[7] Gerard Govaert and Mohamed Nadif. Latent block model for contingency table.Communications in StatisticsTheory and Methods, 39(3):416–425, 2010.

[8] Gerard Govaert and Mohamed Nadif. Co-clustering. John Wiley & Sons, 2013.[9] Christine Keribin, Vincent Brault, Gilles Celeux, and Gerard Govaert. Estimation

and selection for the latent block model on categorical data. Statistics and Comput-ing, 25(6):1201–1216, 2015.

[10] Aurore Lomet. Selection de modeles pour la classification de donnees continues.PhD thesis, Universite Technologique de Compiegne, 2012.

[11] Mahendra Mariadassou and Catherine Matias. Convergence of the groups posteriordistribution in latent or stochastic block models. Bernoulli, 21(1):537–573, 2015.

[12] Pascal Massart. Concentration inequalities and model selection, volume 6. Springer,2007.

[13] J.G. Shanthikumar and U. Sumita. A central limit theorem for random sums ofrandom variables. Operations Research Letters, 3(3):153 – 155, 1984.


Consistency and Asymptotic Normality of Latent Blocks ... · Consistency and Asymptotic Normality...

Documents

Transcript of Consistency and Asymptotic Normality of Latent Blocks ... · Consistency and Asymptotic Normality...