KL distributed icsa vancouver - University of Texas at...

36
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College

Transcript of KL distributed icsa vancouver - University of Texas at...

Page 1: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Distributed  Estimation,  Information  Loss  and  Exponential  Families

Qiang  Liu  Department  of  Computer  Science

Dartmouth  College

Page 2: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Statistical  Learning  /  Estimation• Learning  generative  models  from  data– Topic  models  (text),  computer  vision,  bioinformatics  …  

2

Page 3: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Statistical  Learning  /  Estimation• Find  a  model  (indexed  by  parameter)  from  a  distribution  family  (or  set)  that  fits  the  data  best

3

X = {x1, . . . , x

n}Data  (X),  iid Best  model                                  from        

• Maximum  likelihood:  

P = {p(x|✓), ✓ 2 ⇥}p(x|✓⇤)

–Many  other  traditional  methods  …

ˆ

mle= argmax

✓2⇥

nX

i

log p(x

i|✓)

Page 4: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Challenge:  Big  Data

4

• Big  data  (very  large  n)

• Learning  on  distributed  data!

• Can’t  be  stored  in  a  single  machine!

• Private  data

Page 5: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Challenge:  Big  Data

5

Challenge:  Communication  is  the  bottleneck

• Big  data  (very  large  n)

• Learning  on  distributed  data!

• Can’t  be  stored  in  a  single  machine!

• Private  data

Page 6: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Big  Data  =  Distributed  Data

6

…X1 X2 Xd

• Assumption:  randomly  partition  the  instances  into  dsubsets  evenly  (n/d instances  in  each  subset)

X = {x1, . . . , x

n}U= U U

• Traditional  (centralized)  MLE  doesn’t  work  in  general  ˆ

mle= argmax

✓2⇥

nX

i

log p(x

i|✓)

• Existing  methods:  distributed  implementation  of  MLE– Stochastic  gradient  descent  (SGD),  ADMM,  etc…

• This  talk:  One-­‐shot  combine  local  MLEs  – Good  approximation  to  global  MLE– Much  less  communication

Page 7: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Combining  local  MLEs

7

X

✓̂mle

MLE

Combine  the  local  MLEs  (this  talk)Global  MLE

X1 X2 Xd…"

✓̂f

✓̂1 ✓̂2 ✓̂d

(= f(✓̂1, . . . , ✓̂d))

MLE MLE MLE

• Questions:  – The  best  combination  function  f(…)?– How  well  the  local  MLEs  approximates  the  global  MLE?– Statistical  loss  by  using  local  MLEs?

Page 8: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Linear  Averaging  (Naive)

✓̂linear =1

d

X

k

✓̂k

8

• Take  the  linear  averaging  (Zhang  et  al.  13;  Scott  et  al.13)

Page 9: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Linear  Averaging  (Naive)

✓̂linear =1

d

X

k

✓̂k

9

• Take  the  linear  averaging  (Zhang  et  al.  13;  Scott  et  al.13)

• Many  disadvantages:– Doesn’t  work  for  discrete,  non-­‐additive  parameters– Unidentifiable  Parameters• Latent  variables  models:  Gaussian  mixture,  topic  models  …

Page 10: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

pdf(obj,[x,y]) pdf(obj,[x,y])

Linear  Averaging  (Naive)

✓̂linear =1

d

X

k

✓̂k

• Take  the  linear  averaging  (Zhang  et  al.  13;  Scott  et  al.13)

µ1

µ2

µ2

µ1

pdf(obj,[x,y])

µ1 + µ2

2+

pdf(obj,[x,y])

• Many  disadvantages:– Doesn’t  work  for  discrete,  non-­‐additive  parameters– Unidentifiable  Parameters• Latent  variables  models:  Gaussian  mixture,  topic  models  …

Labels  mismatched

Page 11: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

−2 −1 0 1 2 3 4−2

−1

0

1

2

3

4

x

y

pdf(objMLE,[x,y])

Linear  Averaging  (Naive)

✓̂linear =1

d

X

k

✓̂k

• Take  the  linear  averaging  (Zhang  et  al.  13;  Scott  et  al.13)

pdf(obj,[x,y])

µ1

µ2

+

• Many  disadvantages:– Doesn’t  work  for  discrete,  non-­‐additive  parameters– Parameter  non-­‐identifiability• Latent  variables  models:  Gaussian  mixture,  topic  models  …

Different  numbers  of  components

Page 12: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Linear  Averaging  (Naive)

✓̂linear =1

d

X

k

✓̂k

• Take  the  linear  averaging  (Zhang  et  al.  13;  Scott  et  al.13)

• Many  disadvantages:– Doesn’t  work  for  discrete,  non-­‐additive  parameters– Parameter  non-­‐identifiability• Latent  variables  models:  Gaussian  mixture,  LDA,  deep  nets

– Result  changes  with  different  parameterization

✓ = �2 ✓ = � ✓ = 1/�

(variance)                              (std)                                (precision)

(�21 + �2

2)/2 (�1 + �2)/2 (1/�1 + 1/�2)/2

For  example:

Page 13: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Linear  Averaging  (Naive)

✓̂linear =1

d

X

k

✓̂k

• Take  the  linear  averaging  (Zhang  et  al.  13;  Scott  et  al.13)

• Many  disadvantages:– Doesn’t  work  for  discrete,  non-­‐additive  parameters– Parameter  non-­‐identifiability• Latent  variables  models:  Gaussian  mixture,  LDA,  deep  nets

– Result  changes  with  different  parameterization– Asymptotically:  suboptimal  error  rate  (explain  later)

Page 14: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

P✓̂1

P✓̂2

P✓̂3

P✓̂KL

• We  propose  KL-­‐averaging:  

KL(P

0 ||P✓

) =

Z

x

p(x|✓0) log p(x|✓0)p(x|✓) dxWhere  KL  divergence:

✓̂KL = argmin✓2⇥

X

k

KL(P✓̂k ||P✓)

Our  Algorithm:  KL-­‐Averaging

– Idea:  average  the  models,  not  the  parameters!!

Geometric  mean  in  the  model  space

Page 15: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

– Addresses  all  the  problems  (non-­‐additive,  non-­‐identifiabilities,  reparameterization invariance…)

pdf(obj,[x,y]) pdf(obj,[x,y])

µ1

µ2

µ2

µ1

+

pdf(obj,[x,y])

• We  propose  KL-­‐averaging:  

KL(P

0 ||P✓

) =

Z

x

p(x|✓0) log p(x|✓0)p(x|✓) dxWhere  KL  divergence:

✓̂KL = argmin✓2⇥

X

k

KL(P✓̂k ||P✓)

Our  Algorithm:  KL-­‐Averaging

Page 16: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Our  Algorithm:  KL-­‐Averaging• We  propose  KL-­‐averaging:  

KL(P

0 ||P✓

) =

Z

x

p(x|✓0) log p(x|✓0)p(x|✓) dxWhere  KL  divergence:

✓̂KL = argmin✓2⇥

X

k

KL(P✓̂k ||P✓)

−2 −1 0 1 2 3 4−2

−1

0

1

2

3

4

x

y

pdf(objMLE,[x,y])

pdf(obj,[x,y])

+

−2 −1 0 1 2 3 4−2

−1

0

1

2

3

4

x

y

pdf(objKL,[x,y])– Addresses  all  the  problems  (non-­‐additive,  non-­‐identifiabilities,  reparameterization invariance…)

Page 17: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Parametric  Bootstrap  Interpretation• A  parametric  bootstrap  procedure:  – Resample  from  the  local  model  

–Maximum  likelihood  based  on          

17

p(x|✓̂k)X̃

k ⇠ p(x|✓̂k)

X̃ = [X̃1, . . . , X̃d]

• Reduces  to                    when  the  resample  size  goes  to  infinite

✓̂KL

ˆ✓ = argmax

✓2⇥

X

k

log p( ˜Xk|✓)

Page 18: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

KL-­‐averaging  on  Exponential  Families• (Full)  exponential  families

✓: natural parameter�(x): su�cient statistics

⇥ = {✓ : Z(✓) < 1}p(x|✓) = 1

Z(✓)

exp[✓

T�(x)]

✓: natural parameter�(x): su�cient statisticsZ(✓): normalization constant

Z(✓): normalization constant

18

• Include:  – Gaussian,  exponential,  Gamma,  Dirichlet,  Bernoulli  …– Most  undirected  graphical  models  (e.g.,  Isingmodel)  …

• Not  include:– Latent  variables  models  (Gaussian  mixture,  topic  models)– Hierarchical  models,  Gaussian  process,  (Parametric)  Bayes  nets  …

Page 19: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

• KL-­‐averaging  exactly  recovers  the  global  MLE  under  exponential  families  (                                                  )!

• Linear-­‐averaging   is  not  exact  in  general

KL-­‐averaging  on  Exponential  Families

1919

✓̂KL = ✓̂mle

X

✓̂mle

MLE=X1 X2 Xd…"

✓̂f

✓̂1 ✓̂2 ✓̂d

MLE MLE MLE

X1 X2 Xd…"

✓̂f

✓̂1 ✓̂2 ✓̂d

X

MLE MLE MLE MLE

✓̂KL = ✓̂mle

Page 20: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

• KL-­‐averaging  exactly  recovers  the  global  MLE  under  exponential  families  (                                                  )!

KL-­‐averaging  on  Exponential  Families

20

✓̂KL = ✓̂mle

– Because    the  local  MLE            is  a  sufficient  statistic  of  

✓̂KL = g�1�g(✓̂1) + · · ·+ g(✓̂d)

d

where   (one-­‐to-­‐one  map)

µ : The  moment  parameter

g : ✓ 7! µ ⌘ E✓[�(x)]

✓̂k Xk

– Can  be  viewed  as  a  nonlinear  averaging:

Page 21: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Non-­‐Exponential  Families• Non-­‐exponential  families:– doesn’t  equal                      in  general– Error  rate  =  how  nearly  “exponential  family”  they  are

21

• Geometric  intuition:  

✓̂KL ✓̂mle

• Statistical  Curvature  (Efron 75):  the  mathematical  measure  of  “exponential-­‐family-­‐ness”.  

Exponential  families  =  lines  /  linear  subspaces r

� = 1/rCurvature:

Non-­‐exponential   families  =  curves  /  curved  surfaces  

Page 22: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

– Assume:  dimension  of            is  smaller  than            (and                  )– :  lower  dimensional  curved  surface  in          -­‐space

Curved  Exponential  Families• Curved  exponential  families:  

⌘✓

p(x|✓) = 1

Z(⌘(✓))

exp

�⌘(✓)

T�(x)

⌘(✓)

⌘�(x)

22

⌘-space

⌘(✓)

1/�

⌘-space

⌘(✓)

Page 23: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

• Statistical  curvature  of                        :– Geometric  curvature  of  – with  inner  product  via  Fisher  information  matrix:  

Curved exponential families:

23

(Efron 75)p(x|✓)⌘(✓)

p(x|✓) = 1

Z(⌘(✓))

exp

�⌘(✓)

T�(x)

⌘-space

⌘(✓)

1/�

⌘-space

⌘(✓)

⌃ = cov[�(x)]

h⌘, ⌘0i def= ⌘T⌃⌘0

where⌘-space

⌘(✓)

1/�

Curved  Exponential  Families

Page 24: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Statistical  Curvature  &Information  Loss• Information  loss  of  MLE  (Fisher,  Rao,  Efron):  

24

Fisher  information  in  a  simple  of  size  n:

Fisher  information            in                            based  on  a  sample  of  size  n        

Statistical  curvature ✓̂mle

In = nI1

– Statistical  curvature  = Loss  of  information  when  summarizing  the  data  using  MLE

�2 = limn!1

I�11 (In � I✓̂mle) Fisher

Rao

Efron

– Full  exponential  family:  • =>  Zero  curvature  (                            )• =>  No  information  loss• =>  MLE  is  sufficient  statistics  

� = 0

Page 25: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Statistical  Curvature  &Information  Loss• Information  loss  of  MLE  (Fisher,  Rao,  Efron):  

25

Fisher  information  in  a  simple  of  size  n:

Fisher  information            in                            based  on  a  sample  of  size  n        

Statistical  curvature ✓̂mle

In = nI1

– Statistical  curvature  = Loss  of  information  when  summarizing  the  data  using  MLE

– Doesn’t  violate  the  first  order  efficiency  of  MLE:  

�2 = limn!1

I�11 (In � I✓̂mle) Fisher

Rao

Efron

limn!1

I✓̂mle / In = 1

– Second  order  efficiency  of  MLE:              is  the  lower  bound

�2

Page 26: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Information  loss  in  Distributed  Estimation

26

X

✓̂mle

MLE

X1 X2 Xd…"

✓̂f

✓̂1 ✓̂2 ✓̂d

(= f(✓̂1, . . . , ✓̂d))

MLE MLE MLE

Information  loss  = Information  loss  >=  

Relative  Information  loss  >=  (d� 1) · �2

d · �2�2

Page 27: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

– Intuition:  project                      into  the  space  of  spanned  by  

Information  Loss  & Distributed  Estimation

✓̂mle

f(✓̂1, . . . , ✓̂d)

✓̂1

✓̂d

f(✓̂

1, . . . , ✓̂

d)

✓̂mle

=1

n2(d� 1) · �2

(

• Lower  bound:  For  any                                                              ,  we  have✓̂f = f(✓̂1, . . . , ✓̂d)

E||pI(✓̂f � ✓̂

mle)||2] � 1

n

2(d� 1) · �2 + o(n�2)

Page 28: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Information  Loss  & Distributed  Estimation

• Lower  bound:  For  any                                                              ,  we  have✓̂f = f(✓̂1, . . . , ✓̂d)

E||pI(✓̂f � ✓̂

mle)||2] � 1

n

2(d� 1) · �2 + o(n�2)

• KL-­‐averaging:  achieves  the  lower  bound

• Linear–averaging:  doesn’t  achieve  the  lower  bound  (even  on  exponential  families)

E||pI(✓̂KL � ✓̂

mle)||2] = 1

n

2(d� 1) · �2 + o(n�2)

(� 6= 0 in general)

E||pI(✓̂linear � ✓̂

mle)||2] = 1

n

2(d� 1) · (�2 + (d+ 1)�2) + o(n�2)

Page 29: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Error  to  the  true  parameters

29

• The  error  w.r.t.  the  *true  parameters*  is  related  to  the  error  w.r.t.  the  global  MLE

E||pI(✓̂linear � ✓⇤)||2] ⇡ MSEmle +

1

n2(d� 1) · (�2 + 2�2)

E||pI(✓̂KL � ✓⇤)||2] ⇡ MSEmle +

1

n2(d� 1) · �2

MSEmle ⇡1

nI�11

• But  note  that  

�1

n2

– The  different  is  small,  theoretically  and  asymptotically  – But  practically,  KL-­‐averaging  is  more  robust  in  non-­‐asymptotic  or  non-­‐regular  settings

Page 30: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

150 250 500 1000−8

−6

−4

Total Sample Size (n)

logE[��

ˆ ✓f−ˆ ✓m

le��2 ]

Linear−averagingKL−averagingTheoretical Limits

150 250 500 1000

−6

−4

−2

Total Sample Size (n)log��E[

ˆ ✓f−ˆ ✓m

le]��

(a). MSE w.r.t. global MLE (b). Bias w.r.t. global MLE

150 250 500 1000

−4

−3.5

−3

Total Sample Size (n)

logE[��

ˆ ✓f−ˆ ✓∗��2 ]

Linear−averagingKL−averagingGlobal MLE

150 250 500 1000

−4

−3

−2

Total Sample Size (n)

log��E[

ˆ ✓f−ˆ ✓∗]��

(c). MSE w.r.t. true parameters (d). Bias w.r.t. true paramegters

Figure 1: Bivariate Normal distributions on Ellipse.

4. Prove the model consistency of ˆ✓KL in terms of KL divergence or the Hellinger distance(with assuming the parameter identifiability)?

5. What happens when it is allowed to send messages iteratively (like that in ADMM). Canwe design procedures that monotonically decrease the error across iterations.

6 Experiments

6.1 Bivariate Normal on Ellipse

We start with a toy example on which we demonstrate our asymptotic analysis matches with thesimulation very well. Consider distribution family

p(x�✓)∝ exp

� − 1

2

(x21 + x2

2) + a cos(✓)x1 + b sin(✓)x2�, ✓ ∈ (0,2⇡), x ∈ R2.

It is a bivariate normal distribution with identity covariance matrix and mean restricted on a ellipse⌘(✓) = [a cos(✓), b sin(✓)]. We can exactly calculate the theoretical bound,

n2E[��ˆ✓KL − ˆ✓mle��2] ≈ �2✓

I✓

(d − 1) = (ab)2(a2 sin2(✓) + b2 cos2(✓))4 (d − 1).

We assume the data is partitioned into 10 groups (d = 10). The ellipse parameters are a = 1,b = 5, and the true parameter is ✓∗ = ⇡�4. Fig. 1 shows the errors and biases when we vary thetotal sample size (n). We can see that the practical biases and mean square errors matche closelywith the theoretical predictions when the sample size is large (e.g., n ≥ 250). The KL averagingis consistently better than linear averaging both in term of recovering the global MLE and the trueparameters.

9

Experiments• 2D  Gaussian  distribution  on  an  ellipse

30

x1

x2

�⇠ N (

a cos(✓)

b sin(✓)

�,�

2I)

a

b

• Statistical  curvature  =  geometric  curvature

150 250 500 1000−8

−6

−4

Total Sample Size (n)

logE[��

ˆ ✓f−ˆ ✓m

le��2 ]

Linear−averagingKL−averagingTheoretical Limits

150 250 500 1000

−6

−4

−2

Total Sample Size (n)

log��E[

ˆ ✓f−ˆ ✓m

le]��

(a). MSE w.r.t. global MLE (b). Bias w.r.t. global MLE

150 250 500 1000

−4

−3.5

−3

Total Sample Size (n)

logE[��

ˆ ✓f−ˆ ✓∗��2 ]

Linear−averagingKL−averagingGlobal MLE

150 250 500 1000

−4

−3

−2

Total Sample Size (n)

log��E[

ˆ ✓f−ˆ ✓∗]��

(c). MSE w.r.t. true parameters (d). Bias w.r.t. true paramegters

Figure 1: Bivariate Normal distributions on Ellipse.

4. Prove the model consistency of ˆ✓KL in terms of KL divergence or the Hellinger distance(with assuming the parameter identifiability)?

5. What happens when it is allowed to send messages iteratively (like that in ADMM). Canwe design procedures that monotonically decrease the error across iterations.

6 Experiments

6.1 Bivariate Normal on Ellipse

We start with a toy example on which we demonstrate our asymptotic analysis matches with thesimulation very well. Consider distribution family

p(x�✓)∝ exp

� − 1

2

(x21 + x2

2) + a cos(✓)x1 + b sin(✓)x2�, ✓ ∈ (0,2⇡), x ∈ R2.

It is a bivariate normal distribution with identity covariance matrix and mean restricted on a ellipse⌘(✓) = [a cos(✓), b sin(✓)]. We can exactly calculate the theoretical bound,

n2E[��ˆ✓KL − ˆ✓mle��2] ≈ �2✓

I✓

(d − 1) = (ab)2(a2 sin2(✓) + b2 cos2(✓))4 (d − 1).

We assume the data is partitioned into 10 groups (d = 10). The ellipse parameters are a = 1,b = 5, and the true parameter is ✓∗ = ⇡�4. Fig. 1 shows the errors and biases when we vary thetotal sample size (n). We can see that the practical biases and mean square errors matche closelywith the theoretical predictions when the sample size is large (e.g., n ≥ 250). The KL averagingis consistently better than linear averaging both in term of recovering the global MLE and the trueparameters.

9

10

10

10

Total  sample   size  (n)

MSE w.r.t. global MLE

Page 31: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Experiments• 2D  Gaussian  distribution  on  an  ellipse

31

x1

x2

�⇠ N (

a cos(✓)

b sin(✓)

�,�

2I)

a

b

• Statistical  curvature  =  geometric  curvature

Total  sample   size  (n)

10

10

10

150 250 500 1000−8

−6

−4

Total Sample Size (n)

logE[��

ˆ ✓f−ˆ ✓m

le��2 ]

Linear−averagingKL−averagingTheoretical Limits

150 250 500 1000

−6

−4

−2

Total Sample Size (n)

log��E[

ˆ ✓f−ˆ ✓m

le]��

(a). MSE w.r.t. global MLE (b). Bias w.r.t. global MLE

150 250 500 1000

−4

−3.5

−3

Total Sample Size (n)

logE[��

ˆ ✓f−ˆ ✓∗��2 ]

Linear−averagingKL−averagingGlobal MLE

150 250 500 1000

−4

−3

−2

Total Sample Size (n)log��E[

ˆ ✓f−ˆ ✓∗]��

(c). MSE w.r.t. true parameters (d). Bias w.r.t. true paramegters

Figure 1: Bivariate Normal distributions on Ellipse.

Figure 2: Bivariate Normal distributions on Ellipse.

9

MSE w.r.t. true parameter

Page 32: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

500 5000 50000

−635

−630

−625

−620

Total Sample Size (n)

Test

ing

log-

likel

ihoo

d

Random  partition

Local MLEsGlobal MLELinear−Avg−MatchedLinear−AvgKL−Avg

Gaussian  mixture  on  MNIST  data:• Training  data  partitioned  into  10  groups• Calculate  KL  divergence  by  Monte  Carlo    

(i.e.,  the  parametric  bootstrap  procedure)• Show  testing  likelihood  for  different  methods

Training  Sample  Size  (n)

Page 33: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

500 5000 50000−650

−640

−630

−620

Total Sample Size (n)500 5000 50000

−635

−630

−625

−620

Total Sample Size (n)

Test

ing

log-

likel

ihoo

d

Random  partition Label-­‐wise  partition

Local MLEsGlobal MLELinear−Avg−MatchedLinear−AvgKL−Avg

Training  Sample  Size  (n)

Gaussian  mixture  on  MNIST  data:• Training  data  partitioned  into  10  groups• Calculate  KL  divergence  by  Monte  Carlo    

(i.e.,  the  parametric  bootstrap  procedure)• Show  testing  likelihood  for  different  methods

Training  Sample  Size  (n) Training  Sample  Size  (n)

Page 34: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

34

1000 10000100000

−140

−120

−100

Training Sample Size (n)

Test

ing

log-

likel

ihoo

d

Local MLEsGlobal MLELinear−Avg−MatchedLinear−AvgKL−Avg

• YearPredictionMSD dataset  from  UCI  repository

• Similar  setting  as  before  (random  partition)

More  datasets

Page 35: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

Conclusion

35

X

✓̂mle

MLE

• Combination  function  f()?  – KL-­‐averaging    vs.  linear-­‐averaging

• Statistical  loss?  – Exponential  /  non-­‐exponential  families– statistical  curvature  =  Information  loss

vs.  

• Distributed  Learning:  

X1 X2 Xd…"

✓̂f

✓̂1 ✓̂2 ✓̂d

(= f(✓̂1, . . . , ✓̂d))

MLE MLE MLE

Page 36: KL distributed icsa vancouver - University of Texas at Austinlqiang/PDF/KL_distributed_icsa_vancouv… · Distributed*Estimation,*Information* Loss*and*Exponential*Families Qiang*Liu*

36

Thanks!