Download - Fundamentals cig 4thdec

Fundamentals of Algorithms and Data-Stru tures in

Information-Geometri Spa es

Frank NIELSEN

É ole Polyte hnique, Fran e

Sony Computer S ien e Laboratories, In

MEXT-ISM Workshop on Information Geometry for Ma hine Learning

Brain S ien e Institute, RIKEN

4th De ember 2014

© 2014 Frank Nielsen 1/75

Brief histori al review of Computational Geometry (CG)

Three resear h periods:

1. Geometri algorithms:

Voronoi/Delaunay, minimum spanning trees, data-stru tures for proximity

queries

2. Geometri omputing:

robustness, algebrai degree of predi ates, programs that work/s ale!

3. Computational topology:

simpli ial omplexes, ltrations, input=distan e matrix

→ paradigm of Topologi al Data Analysis (TDA)

Show asing libraries for CG software:

CGAL http://www. gal.org/

Geometry Fa tory http://geometryfa tory. om/

Gudhi https://proje t.inria.fr/gudhi/

Ayasdi http://www.ayasdi. om/

© 2014 Frank Nielsen 1.CG History 2/75

http://www.cgal.org/

http://geometryfactory.com/

https://project.inria.fr/gudhi/

http://www.ayasdi.com/

Outline

Review of the basi algorithmi toolbox in omputational geometry:

Voronoi diagrams and dual Delaunay, spanning balls

Generalizations of those on epts and toolbox to information spa es:

Riemannian omputational information geometry

Dually ane onne tions omputational information geometry

Appli ations to lustering, learning mixtures, et .

What is a good/friendly geometri omputing spa e?

© 2014 Frank Nielsen 1.CG History 3/75

Basi s of Eu lidean

Computational Geometry:

Voronoi diagrams and dual

Delaunay omplexes

© 2014 Frank Nielsen 2.Ordinary CG 4/75

Eu lidean (ordinary) Voronoi diagrams

P = P1

, ...,Pn: n distin t point generators in Eu lidean spa e Ed

V (Pi) = X : DE (Pi ,X ) ≤ DE (Pj ,X ), ∀j 6= i

Voronoi diagram = ell omplex V (Pi)'s with their fa es


Voronoi diagrams from bise tors and ∩ halfspa es

Bise tors

Bi(P ,Q) = X : DE (P ,X ) = DE (Q,X )

→ are hyperplanes in Eu lidean geometry

Voronoi ells as halfspa e interse tions:

V (Pi) = X : DE (Pi ,X ) ≤ DE (Pj ,X ), ∀j 6= i = ∩ni=1

Bi

+(Pi ,Pj)

DE (P ,Q) = ‖θ(P)− θ(Q)‖2

=√

∑di=1

(θi (P)− θi (Q))2

θ(P) = p: Cartesian oordinate system with θj(Pi ) = p(j)i .

⇒ Many appli ations of Voronoï diagrams: rystal growth,

odebook/quantization, mole ule interfa es/do king, motion planning, et .


Voronoi diagrams and dual Delaunay simpli ial omplex

Empty sphere property, max min angle triangulation, et

Voronoi & dual Delaunay triangulation

→ non-degenerate point set = no (d + 2) points o-spheri al

Duality: Voronoi k-fa e ⇔ Delaunay (d − k)-simplex

Bise tor Bi(P ,Q) perpendi ular ⊥ to segment [PQ]


Voronoi & Delaunay : Complexity and algorithms

Combinatorial omplexity: Θ(n⌈

d2

⌉) (→ quadrati in 3D)

mat hed for points on the moment urve: t 7→ (t, t2, .., td )

Constru tion: Θ(n log n + n⌈

d2

⌉), optimal

some output-sensitive algorithms but...

Ω(n log n + f ), not yet optimal output-sensitive algorithms.


Modeling population spa es in information geometry

Population spa e Pθ(x)θ interpreted as a smooth manifold equipped with

the Fisher Information Matrix (FIM):

Riemannian modeling: metri length spa e with the FIM as metri tensor

(orthogonality), and the Levi-Civita metri onne tion for length

minimizing geodesi s

Dual ±1 ane onne tion modeling: dual geodesi s that des ribe

parallel transport, non-metri dual divergen es indu ed by dual potential

Legendre onvex fun tions. Dual ±α onne tions.

→ Algorithmi onsiderations of these two approa hes

Population spa e, parameter spa e, obje t-oriented geometry, et .

© 2014 Frank Nielsen 3.Information geometry 9/75

Riemannian omputational

information geometry from

the viewpoint of omputing

© 2014 Frank Nielsen 4.Riemannian CIG 10/75

Population spa es: Hotelling (1930) [12 & Rao (1945) [33

Birth of dierential-geometri methods in statisti s.

Fisher information matrix (non-degenerate positive denite) an be used

as a (smooth) Riemannian metri tensor g .

Distan e between two populations indexed by θ

1

and θ2

: Riemannian

distan e (metri length)

First appli ations in statisti s:

Fisher-Hotelling-Rao (FHR) geodesi distan e used in lassi ation:

Find the losest population to a given set of populations

Used in tests of signi an e (null versus alternative hypothesis), power

of a test: P(reje t H0

|H0

is false)→ dene surfa es in population spa es


Rao's distan e (1945, introdu ed by Hotelling 1930 [12)

Innitesimal squared length element:

ds2 =∑

i ,j

gij(θ)dθidθj = dθT I (θ)dθ

Geodesi and distan e are hard to expli itly al ulate:

ρ(p(x ; θ1

), p(x ; θ2

)) = min

θ(s)θ(0)=θ

1

θ(1)=θ2

∫

1

0

√

(

dθ

ds

)T

I (θ)dθ

dsds

Rao's distan e not known in losed-form for multivariate normals

Advantages: Metri property of ρ + many tools of dierential

geometry [1: Riemannian Log/Exp tangent/manifold mapping


Extrinsi Computational Geometry on tangent planes

Tensor g = Q(x) ≻ 0 denes smooth inner produ t 〈p, q〉x = p⊤Q(x)qthat indu es a normed distan e:

dx(p, q) = ‖p − q‖x =√

(p − q)⊤Q(x)(p − q)

Mahalanobis metri distan e on tangent planes :

∆Σ(X1

,X2

) =√

(µ1

− µ2

)⊤Σ−1(µ1

− µ2

) =√

∆µ⊤Σ−1∆µ

Cholesky de omposition Σ = LL⊤

∆(X1

,X2

) = DE (L−1µ

1

, L−1µ2

)

CG on tangent planes = ordinary CG on transformed points x ′ ← L−1x .

Extrinsi vs intrinsi means [10

© 2014 Frank Nielsen 4.Riemannian CIG-1.Mahalanobis 13/75

Mahalanobis Voronoi diagrams on tangent planes (extrinsi )

In statisti s, ovarian e matrix Σ a ount for both orrelation and dimension

(feature) s aling

⇔

Dual stru ture ≡ anisotropi Delaunay triangulation

⇒ empty ir umellipse property (Cholesky de omposition)


Riemannian Mahalanobis metri tensor (Σ−1, PSD)

ρ(p1

, p2

) =√

(p1

− p2

)⊤Σ−1(p1

− p2

), g(p) = Σ−1 =

[

1 −1−1 2

]

non- onformal geometry: g(p) 6= f (p)I © 2014 Frank Nielsen 4.Riemannian CIG-1.Mahalanobis 15/75

Riemannian statisti al Voronoi diagrams

... for statisti al population spa es:

Lo ation-s ale 2D families have onstant non-positive urvature

(Hotelling, 1930): Riemannian statisti al Voronoi diagrams amount

to hyperboli Voronoi diagrams or Eu lidean diagrams (lo ation

families only like isotropi Gaussians)

Multinomial family has spheri al geometry on the positive orthant:

Spheri al Voronoi diagram

( ompute as stereographi proje tion ∝ Eu lidean Voronoi diagrams)

But for arbitrary families p(x |θ): Geodesi s not in losed forms → limited

omputational framework in pra ti e (ray shooting, et .)


Normal/Gaussian family and 2D lo ation-s ale families

Fisher Information Matrix (FIM):

I (θ) =

[

Ii ,j(θ) = Eθ

[

∂

∂θilog p(x |θ) ∂

∂θjlog p(x |θ)

]]

FIM for univariate normal/multivariate spheri al distributions:

I (µ, σ) =

[

1

σ20

0

2

σ2

]

=1

σ2

[

1 0

0 2

]

, I (µ, σ) = diag

(

1

σ2

, ...,1

σ2

,2

σ2

)

→ amount to Poin aré metri

dx2+dy2

y2, hyperboli geometry in

upper half plane/spa e.

© 2014 Frank Nielsen 4.Riemannian CIG-2.Hyperboli geometry 17/75

Riemannian Poin aré upper plane metri tensor ( onformal)

osh ρ(p1

, p2

) = 1+‖p

1

− p2

‖22y

1

y2

, g(p) =

[

1

y20

0

1

y2

]

=1

y2I

onformal: g(p) = 1

y2I


Matrix SPD spa es and hyperboli geometry

Symmetri Positive Denite matri es M: ∀x 6= 0, x⊤Mx > 0.

2D SPD(2) matrix spa e has dimension d = 3: A positive one.

SPD(2)

(a, b, c) ∈ R3 : a > 0, ab − c2 > 0

Can be peeled into sheets of dimension 2, ea h sheet orresponding to a

onstant value of the determinant of the elements [8

SPD(2) = SSPD(2)× R+

where SSPD(2) = a, b, c =√1− ab) : a > 0, ab − c2 = 1

Mapping M(a, b, c)→ H

2

:

(

x0

= a+b2

≥ 1, x1

= a−b2

, x2

= c)

in hyperboloid model [28

z = a−b+2ic2+a+b

in Poin aré disk [28.


Riemannian manifolds: Choi e of equivalent models?

Many equivalent models of hyperboli geometry:

Conformal (good for visualization sin e we an measure angles) versus

non- onformal ( omputationally-friendly for geodesi s) models.

Convert equivalently to other models of hyperboli geometry: Poin aré

disk, upper half spa e, hyperboloid, Beltrami hemisphere, et .

Two questions:

Given a metri tensor g and its indu ed metri distan e ρg (p, q), whatare the equivalent metri tensors g ′ ∼ g su h that ρg (p, q) = ρg ′(p′, q′)?Is one metri tensor better for omputing spa e?

Metri s yielding straight geodesi s are fully hara terized in 2D but in

higher dimensions?


Riemannian Poin aré disk metri tensor ( onformal)

→ often used in Human Computer Interfa es, network routing (embedding

trees), et .


Riemannian Klein disk metri tensor (non- onformal)

re ommended for omputing spa e sin e geodesi s are straight line

segments

Klein is also onformal at the origin (so we an perform translation

from and ba k to the origin)

Geodesi s passing through O in the Poin aré disk are straight (so we an

perform translation from and ba k to the origin)


Hyperboli Voronoi diagrams [25, 29

In arbitrary dimension, Hd

In Klein disk, the hyperboli Voronoi diagram amounts to a lipped

ane Voronoi diagram, or a lipped power diagram with e ient

lipping algorithm [5.

then onvert to other models of hyperboli geometry: Poin aré disk,

upper half spa e, hyperboloid, Beltrami hemisphere, et .

Conformal (good for visualization) versus non- onformal (good for

omputing) models.



Hyperboli Voronoi diagram in Klein disk = lipped power diagram.

Power distan e:

‖x − p‖2 − wp

→ additively weighted ordinary Voronoi = ordinary CG



5 ommon models of the abstra t hyperboli geometry

https://www.youtube. om/wat h?v=i9IUzNxeH4o (5 min. video)

ACM Symposium on Computational Geometry (SoCG'14)


https://www.youtube.com/watch?v=i9IUzNxeH4o

Dually ane onne tion

omputational information

geometry

© 2014 Frank Nielsen 5.Dually at CIG 26/75

Dually at spa e onstru tion from onvex fun tions F

Convex and stri tly dierentiable fun tion F (θ) admits a

Legendre-Fen hel onvex onjugate F ∗(η):

F ∗(η) = sup

θ(θ⊤η − F (θ)), ∇F (θ) = η = (∇F ∗)−1(θ)

Young's inequality gives rise to anoni al divergen e [15:

F (θ) + F ∗(η′) ≥ θ⊤η′ ⇒ AF ,F∗(θ, η′) = F (θ) + F ∗(η′)− θ⊤η′

Writing using single oordinate system, get dual Bregman

divergen es:

BF (θp : θq) = F (θp)− F (θq)− (θp − θq)⊤∇F (θq)

= BF∗(ηq : ηp) = AF ,F∗(θp , ηq) = AF∗,F (ηq : θp)

dual ane oordinate systems with geodesi s straight:

η = ∇F (θ)⇔ θ = ∇F ∗(η). Tensor g(θ) = g∗(η)

© 2014 Frank Nielsen 5.Dually at CIG 27/75

Dual divergen e/Bregman dual bise tors [6, 24, 26

Bregman sided (referen e) bise tors related by onvex duality:

BiF (θ1, θ2) = θ ∈ Θ |BF (θ : θ1

) = BF (θ : θ1

)BiF∗(η

1

, η2

) = η ∈ H |BF∗(η : η1

) = BF∗(η : η1

)

Right-sided bise tor: → θ-hyperplane, η-hypersurfa e

HF (p, q) = x ∈ X | BF (x : p ) = BF (x : q ).

HF : 〈∇F (p)−∇F (q), x〉 + (F (p)− F (q) + 〈q,∇F (q)〉 − 〈p,∇F (p)〉) = 0

Left-sided bise tor: → θ-hypersurfa e, η-hyperplane

H ′F (p, q) = x ∈ X | BF ( p : x) = BF ( q : x)

H ′F : 〈∇F (x), q − p〉+ F (p)− F (q) = 0

hyperplane = autoparallel submanifold of dimension d − 1

© 2014 Frank Nielsen 5.Dually at CIG-1.bise tor 28/75

Visualizing Bregman bise tors

Primal oordinates θ Dual oordinates ηnatural parameters expe tation parameters

p

qSource Space: Itakura-Saito

p(0.52977081,0.72041688) q(0.85824458,0.29083834)

D(p,q)=0.66969016 D(q,p)=0.44835617

p’

q’

Gradient Space: Itakura-Saito dual

p’(-1.88760873,-1.38808518) q’(-1.16516903,-3.43833618)

D*(p’,q’)=0.44835617 D*(q’,p’)=0.66969016

Bi(P ,Q) and Bi

∗(P ,Q) an be expressed in either θ/η oordinate systems

© 2014 Frank Nielsen 5.Dually at CIG-1.bise tor 29/75

Spa es of spheres: 1-to-1

mapping between d-spheres

and (d + 1)-hyperplanes usingpotential fun tions

© 2014 Frank Nielsen 5.Dually at CIG-2.Spa e of spheres 30/75

Spa e of Bregman spheres and Bregman balls [6

Dual sided Bregman balls (bounding Bregman spheres):

Ball

rF (c , r) = x ∈ X | BF (x : c) ≤ r

Ball

lF (c , r) = x ∈ X | BF (c : x) ≤ r

Legendre duality:

Ball

lF (c , r) = (∇F )−1(BallrF∗(∇F (c), r))

Illustration for Itakura-Saito divergen e, F (x) = − log x


Spa e of Bregman spheres: Lifting map [6

F : x 7→ x = (x ,F (x)), hypersurfa e in Rd+1

, potential fun tion

Hp : Tangent hyperplane at p, z = Hp(x) = 〈x − p,∇F (p)〉+ F (p)

Bregman sphere σ −→ σ with supporting hyperplane

Hσ : z = 〈x − c ,∇F (c)〉 + F (c) + r .

(// to Hc and shifted verti ally by r)

σ = F ∩ Hσ.

interse tion of any hyperplane H with F proje ts onto X as a Bregman

sphere:

H : z = 〈x , a〉+ b → σ : BallF (c = (∇F )−1(a), r = 〈a, c〉 − F (c) + b)


Lifting/Polarity: Potential fun tion graph F


Spa e of Bregman spheres: Algorithmi appli ations [6

Union/interse tion of Bregman d -spheres from representational

(d + 1)-polytope [6

Radi al axis of two Bregman balls is an hyperplane: Appli ations to

Nearest Neighbor sear h trees like Bregman ball trees or Bregman

vantage point trees [31.


Bregman proximity data stru tures [31

Vantage point trees: partition spa e a ording to Bregman balls

Partitionning spa e with interse tion of Kullba k-Leibler balls

→ e ient nearest neighbour queries in information spa es


Appli ation: Minimum En losing Ball [23, 32

To a hyperplane Hσ = H(a, b) : z = 〈a, x〉 + b in Rd+1

, orresponds a ball

σ = Ball(c , r) in Rdwith enter c = ∇F ∗(a) and radius:

r = 〈a, c〉 − F (c) + b = 〈a,∇F ∗(a)〉 − F (∇F ∗(a)) + b = F ∗(a) + b

sin e F (∇F ∗(a)) = 〈∇F ∗(a), a〉 − F ∗(a) (Young equality)

SEB: Find halfspa e H(a, b)− : z ≤ 〈a, x〉+ b that ontains all lifted points:

min

a,br = F ∗(a) + b,

∀i ∈ 1, ..., n, 〈a, xi 〉+ b − F (xi ) ≥ 0

→ Convex Program (CP) with linear inequality onstraints

F (θ) = F ∗(η) = 1

2

x⊤x : CP → Quadrati Programming (QP) [11 used in

SVM. Smallest en losing ball used as a primitive in SVM [34


Smallest Bregman en losing balls [32, 22

Algorithm 1: BBCA(P, l).c1

← hoose randomly a point in P ;for i = 2 to l − 1 do

// farthest point from ci wrt. BF

si ← argmax

nj=1

BF (ci : pj);

// update the enter: walk on the η-segment [ci , psi ]η

ci+1

← ∇F−1(∇F (ci )# 1

i+1∇F (psi )) ;

end

// Return the SEBB approximation

return Ball(cl , rl = BF (cl : X )) ;

θ-, η-geodesi segments in dually at geometry.


Smallest en losing balls: Core-sets [32

Core-set C ⊆ S: SOL(S) ≤ SOL(C) ≤ (1+ ǫ)SOL(S)

extended Kullba k-Leibler Itakura-Saito


InSphere predi ates wrt Bregman divergen es [6

Impli it representation of Bregman spheres/balls: onsider d + 1 support

points on the boundary

Is x inside the Bregman ball dened by d + 1 support points?

InSphere(x ; p0

, ..., pd ) =

∣

∣

∣

∣

∣

∣

1 ... 1 1

p0

... pd x

F (p0

) ... F (pd ) F (x)

∣

∣

∣

∣

∣

∣

sign of a (d + 2)× (d + 2) matrix determinant

InSphere(x ; p

0

, ..., pd ) is negative, null or positive depending on whether

x lies inside, on, or outside σ.


Smallest en losing ball in Riemannian manifolds [2

c = a#Mt b: point γ(t) on the geodesi line segment [ab] wrt M su h that

ρM(a, c) = t × ρM(a, b) (with ρM the metri distan e on manifold M)

Algorithm 2: GeoA

c1

← hoose randomly a point in P ;for i = 2 to l do

// farthest point from ci

si ← argmax

nj=1

ρ(ci , pj);

// update the enter: walk on the geodesi line segment

[ci , psi ]

ci+1

← ci#M1

i+1

psi ;

end

// Return the SEB approximation

return Ball(cl , rl = ρ(cl ,P)) ;


Approximating the smallest en losing ball in hyperboli

spa e

Initialization First iteration

Se ond iteration Third iteration

Fourth iteration after 104 iterations

http://www.sony sl. o.jp/person/nielsen/infogeo/RiemannMinimax/


http://www.sonycsl.co.jp/person/nielsen/infogeo/RiemannMinimax/

Bregman dual regular/Delaunay triangulations

Embedded geodesi Delaunay triangulations+empty Bregman balls

Delaunay Exponential Del. Hellinger-like Del.

empty Bregman sphere property,

geodesi triangles: embedded Delaunay.


Dually orthogonal Bregman Voronoi & Triangulations

Ordinary Voronoi diagram is perpendi ular to Delaunay triangulation:

Voronoi k-fa e ⊥ Delaunay d − k-fa e

Bi(P ,Q) ⊥ γ∗(P ,Q)

γ(P ,Q) ⊥ Bi

∗(P ,Q)


Syntheti geometry: Exa t

hara terization of the

Bayesian error exponent but

no losed-form known

© 2014 Frank Nielsen 6.Bayesian error exponent 44/75

Bayesian hypothesis testing, MAP rule and probability of

error Pe

Mixture p(x) =

∑

i wipi (x). Task = Classify x Whi h omponent?

Prior probabilities: wi = P(X ∼ Pi) > 0 (with

∑ni=1

wi = 1)

Conditional probabilities: P(X = x |X ∼ Pi ).

P(X = x) =

n∑

i=1

P(X ∼ Pi )P(X = x |X ∼ Pi ) =

n∑

i=1

wiP(X |Pi)

Best rule = Maximum a posteriori probability (MAP) rule:

map(x) = argmaxi∈1,...,n wipi(x)

where pi(x) = P(X = x |X ∼ Pi ) are the onditional probabilities.

For w

1

= w2

= 1

2

, probability of error

Pe =1

2

∫

min(p1

(x), p2

(x))dx ≤ 1

2

∫

p1

(x)αp2

(x)1−αdx , for α ∈ (0, 1).

Best exponent α∗


Error exponent for exponential families

Exponential families have nite dimensional su ient statisti s: →Redu e n data to D statisti s.

∀x ∈ X , P(x |θ) = exp(θ⊤t(x)− F (θ) + k(x))

F (·): log-normalizer/ umulant/partition fun tion, k(x): auxiliary term

for arrier measure.

Maximum likelihood estimator (MLE): ∇F (θ) = 1

n

∑

i t(Xi) = η

Bije tion between exponential families and Bregman divergen es:

log p(x |θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)

Exponential families are log- on ave


Geometry of the best error exponent

On the exponential family manifold, Cherno α- oe ient [7:

cα(Pθ1

: Pθ2

) =

∫

pαθ1

(x)p1−αθ2

(x)dµ(x) = exp(−J(α)F (θ1

: θ2

))

Skew Jensen divergen e [20 on the natural parameters:

J

(α)F

(θ1

: θ2

) = αF(θ1

) + (1− α)F(θ2

)− F(θ(α)12

)

Cherno information = Bregman divergen e for exponential families:

C (Pθ1

: Pθ2

) = B(θ1

: θ(α∗)12

) = B(θ2

: θ(α∗)12

)

Finding best error exponent α∗?


Geometry of the best error exponent: binary hypothesis [17

Cherno distribution P∗:

P∗ = Pθ∗12

= Ge(P1

,P2

) ∩ Bim(P1

,P2

)

e-geodesi :

Ge(P1

,P2

) =

E(λ)12

| θ(E (λ)12

) = (1− λ)θ1

+ λθ2

, λ ∈ [0, 1]

,

m-bise tor:

Bim(P1

,P2

) :

P | F (θ1

)− F (θ2

) + η(P)⊤∆θ = 0

,

Optimal natural parameter of P∗:

θ∗ = θ(α∗)12

= argminθ∈ΘB(θ1

: θ) = argminθ∈ΘB(θ2

: θ).

→ losed-form for order-1 family, or e ient bise tion sear h.


Geometry of the best error exponent: binary hypothesis

P∗ = Pθ∗12

= Ge(P1

,P2

) ∩ Bim(P1

,P2

)

pθ1

pθ2

pθ∗

12

m-bisector

e-geodesic Ge(Pθ1, Pθ2

)

η-coordinate system

Pθ∗

12

C(θ1 : θ2) = B(θ1 : θ∗12)

Bim(Pθ1, Pθ2

)

Binary Hypothesis Testing: Pe bounded using Bregman divergen e between

Cherno distribution and lass- onditional distributions.


Clustering and Learning

nite statisti al mixtures


α-divergen es

For α ∈ R 6= ±1, α-divergen es [9 on positive arrays [36 :

Dα(p : q).=

d∑

i=1

4

1− α2

(

1− α

2

pi +1+ α

2

qi − (pi)1−α

2 (qi )1+α

2

)

with

Dα(p : q) = D−α(q : p) and in the limit ases D−1

(p : q) = KL(p : q)and D

1

(p : q) = KL(q : p), where KL is the extended Kullba kLeibler

divergen e KL(p : q).=

∑di=1

pi log pi

qi+ qi − pi

α-divergen es belong to the lass of Csiszár f -divergen es

If (p : q).=

∑di=1

qi f(

pi

qi

)

with the following generator:

f (t) =

4

1−α2

(

1− t(1+α)/2)

, if α 6= ±1,t ln t, if α = 1,− ln t, if α = −1

Information monotoni ity


Mixed divergen es [30

Dened on three parameters p, q and r :

Mλ(p : q : r).= λD(p : q) + (1− λ)D(q : r)

for λ ∈ [0, 1].

Mixed divergen es in lude:

the sided divergen es for λ ∈ 0, 1,

the symmetrized (arithmeti mean) divergen e for λ = 1

2

, or skew

symmetrized for λ 6= 1

2

.

© 2014 Frank Nielsen 7.Mixed divergen es 52/75

Symmetrizing α-divergen es

Sα(p, q) =1

2

(Dα(p : q) + Dα(q : p)) = S−α(p, q),

= M1

2

(p : q : p),

For α = ±1, we get half of Jereys divergen e:

S±1

(p, q) =1

2

d∑

i=1

(pi − qi) logpi

qi

Centroids for symmetrized α-divergen e usually not in losed form.

How to perform enter-based lustering without losed form entroids?


Jereys positive entroid [16

Jereys divergen e is symmetrized α = ±1 divergen es.

The Jereys positive entroid c = (c1, ..., cd ) of a set h

1

, ..., hn of nweighted positive histograms with d bins an be al ulated

omponent-wise exa tly using the Lambert W analyti fun tion:

c i =ai

W(

ai

g i e)

where ai =∑n

j=1

πjhij denotes the oordinate-wise arithmeti weighted

means and g i =∏n

j=1

(hij )πj

the oordinate-wise geometri weighted

means.

The Lambert analyti fun tion W [4 (positive bran h) is dened by

W (x)eW (x) = x for x ≥ 0.

→ Jereys k-means lustering . But for α 6= 1, how to luster?


Mixed α-divergen es/α-Jereys symmetrized divergen e

Mixed α-divergen e between a histogram x to two histograms p and q:

Mλ,α(p : x : q) = λDα(p : x) + (1− λ)Dα(x : q),

= λD−α(x : p) + (1− λ)D−α(q : x),

= M1−λ,−α(q : x : p),

α-Jereys symmetrized divergen e is obtained for λ = 1

2

:

Sα(p, q) = M1

2

,α(q : p : q) = M1

2

,α(p : q : p)

skew symmetrized α-divergen e is dened by:

Sλ,α(p : q) = λDα(p : q) + (1− λ)Dα(q : p)


Mixed divergen e-based k-means lustering

k distin t seeds from the dataset with li = ri .

Input: Weighted histogram set H, divergen e D(·, ·), integer k > 0, real

λ ∈ [0, 1];

Initialize left-sided/right-sided seeds C = (li , ri )ki=1

;

repeat

//Assignment

for i = 1, 2, ..., k do

Ci ← h ∈ H : i = arg minj Mλ(lj : h : rj );end

// Dual-sided entroid relo ation

for i = 1, 2, ..., k do

ri ← arg minx D(Ci : x) =∑

h∈CiwjD(h : x);

li ← arg minx D(x : Ci) =∑

h∈CiwjD(x : h);

end

until onvergen e;

dierent from the -means lustering with respe t to the symmetrized

divergen es


Mixed α-hard lustering: MAhC(H, k , λ, α)Input: Weighted histogram set H, integer k > 0, real λ ∈ [0, 1], real α ∈ R;

Let C = (li , ri )ki=1

← MAS(H, k , λ, α);repeat

//Assignment

for i = 1, 2, ..., k do

Ai ← h ∈ H : i = arg minj Mλ,α(lj : h : rj);end

// Centroid relo ation

for i = 1, 2, ..., k do

ri ←(

∑

h∈Aiwih

1−α

2

)2

1−α

;

li ←(

∑

h∈Aiwih

1+α

2

)2

1+α

;

end

until onvergen e;


Coupled k-Means++ α-Seeding

Algorithm 3: Mixed α-seeding; MAS(H, k , λ, α)Input: Weighted histogram set H, integer k ≥ 1, real λ ∈ [0, 1], real α ∈ R;

Let C ← hj with uniform probability ;

for i = 2, 3, ..., k do

Pi k at random histogram h ∈ H with probability:

πH(h).=

whMλ,α(ch : h : ch)∑

y∈H wyMλ,α(cy : y : cy ), (1)

//where (ch, ch).= arg min(z ,z)∈C Mλ,α(z : h : z);

C ← C ∪ (h, h);end

Output: Set of initial luster enters C;→ Guaranteed probabilisti bound. Just need to initialize! No entroid

omputations


Learning MMs: A geometri hard lustering viewpoint

Learn the parameters of a mixture m(x) =∑k

i=1

wip(x |θi)Maximize the omplete data likelihood= lustering obje tive fun tion

max

W ,Λlc(W ,Λ) =

n∑

i=1

k∑

j=1

zi ,j log(wjp(xi |θj))

= max

Λ

n∑

i=1

kmax

j=1

log(wjp(xi |θj))

≡ min

W ,Λ

n∑

i=1

k

min

j=1

Dj(xi ) ,

where cj = (wj , θj) ( luster prototype) and Dj(xi ) = − log p(xi |θj)− logwj

are potential distan e-like fun tions.

further atta h to ea h luster a dierent family of probability distributions.


Generalized k-MLE for learning statisti al mixtures

Model-based lustering: Assignment of points to lusters:

Dwj ,θj ,Fj(x) = − log pFj

(x ; θj )− logwj

k-GMLE:

1. Initialize weight W ∈ ∆k and family type (F1

, ...,Fk ) for ea h luster

2. Solve minΛ∑

i minj Dj(xi ) ( enter-based lustering for W xed) with

potential fun tions: Dj(xi) = − log pFj(xi |θj)− logwj

3. Solve family types maximizing the MLE in ea h luster Cj by hoosing

the parametri family of distributions Fj = F (γj) that yields the best

likelihood: minF1

=F (γ1

),...,Fk=F (γk )∈F (γ)

∑

i minj Dwj ,θj ,Fj(xi).

∀l , γl = maxj F∗j (ηl =

1

nl

∑

x∈Cltj(x)) +

1

nl

∑

x∈Clk(x).

4. Update weight W as the luster point proportion

5. Test for onvergen e and go to step 2) otherwise.

Drawba k = biased, non- onsistent estimator due to Voronoi support

trun ation.

© 2014 Frank Nielsen 8.k-GMLE 60/75

Computing f -divergen es for

generi f : Beyond sto hasti

numeri al integration

© 2014 Frank Nielsen 9.Computing f -divergen es 61/75

f -divergen es

If (X1

: X2

) =

∫

x1

(x)f

(

x2

(x)

x1

(x)

)

dν(x) ≥ 0

Name of the f -divergen e Formula If (P : Q) Generator f (u) with f (1) = 0

Total variation (metri )

1

2

∫

|p(x) − q(x)|dν(x) 1

2

|u − 1|Squared Hellinger

∫

(√

p(x) −√

q(x))2dν(x) (√u − 1)2

Pearson χ2

P

∫ (q(x)−p(x))2

p(x)dν(x) (u − 1)2

Neyman χ2

N

∫ (p(x)−q(x))2

q(x)dν(x)

(1−u)2

u

Pearson-Vajda χkP

∫ (q(x)−λp(x))k

pk−1(x)dν(x) (u − 1)k

Pearson-Vajda |χ|kP∫ |q(x)−λp(x)|k

pk−1(x)dν(x) |u − 1|k

Kullba k-Leibler

∫

p(x) logp(x)q(x)

dν(x) − log u

reverse Kullba k-Leibler

∫

q(x) logq(x)p(x)

dν(x) u log u

α-divergen e4

1−α2(1 −

∫

p1−α

2 (x)q1+α(x)dν(x)) 4

1−α2(1 − u

1+α

2 )

Jensen-Shannon

1

2

∫

(p(x) log2p(x)

p(x)+q(x)+ q(x) log

2q(x)p(x)+q(x)

)dν(x) −(u + 1) log 1+u2

+ u log u


f -divergen es and higher-order Vajda χkdivergen es

If (X1

: X2

) =

∞∑

k=0

f (k)(1)

k!χkP(X1

: X2

)

χkP(X1

: X2

) =

∫

(x2

(x) − x1

(x))k

x1

(x)k−1

dν(x),

|χ|kP(X1

: X2

) =

∫ |x2

(x)− x1

(x)|kx1

(x)k−1

dν(x),

are f -divergen es for the generators (u − 1)k and |u − 1|k .

When k = 1, χ1

P(X1

: X2

) =∫

(x1

(x)− x2

(x))dν(x) = 0 (never

dis riminative), and |χ1

P |(X1

,X2

) is twi e the total variation distan e.

χkP is a signed distan e


Ane exponential families

Canoni al de omposition of the probability measure:

pθ(x) = exp(〈t(x), θ〉 − F (θ) + k(x)),

onsider natural parameter spa e Θ ane (like multinomials).

Poi(λ) : p(x |λ) = λxe−λ

x!, λ > 0, x ∈ 0, 1, ...

NorI (µ) : p(x |µ) = (2π)−d2 e−

1

2

(x−µ)⊤(x−µ), µ ∈ Rd , x ∈ R

d

Family θ Θ F (θ) k(x) t(x) ν

Poisson log λ R eθ − log x! x νcIso.Gaussian µ R

d 1

2

θ⊤θ d2

log 2π − 1

2

x⊤x x νL


Higher-order Vajda χkdivergen es

The (signed) χkP distan e between members X

1

∼ EF (θ1) and X2

∼ EF (θ2) ofthe same ane exponential family is (k ∈ N) always bounded and equal to:

χkP(X1

: X2

) =k∑

j=0

(−1)k−j

(

k

j

)

eF ((1−j)θ1

+jθ2

)

e(1−j)F (θ1

)+jF (θ2

)

For Poisson/Normal distributions, we get losed-form formula:

χkP(λ1 : λ2) =

k∑

j=0

(−1)k−j

(

k

j

)

eλ1−j1

λj2

−((1−j)λ1

+jλ2

),

χkP(µ1

: µ2

) =

k∑

j=0

(−1)k−j

(

k

j

)

e1

2

j(j−1)(µ1

−µ2

)⊤(µ1

−µ2

).


f -divergen es: Analyti formula [14

λ = 1 ∈ int(dom(f (i))), f -divergen e (Theorem 1 of [3):

∣

∣

∣

∣

∣

If (X1

: X2

)−s

∑

k=0

f (k)(1)

k!χkP(X1

: X2

)

∣

∣

∣

∣

∣

≤ 1

(s + 1)!‖f (s+1)‖∞(M −m)s ,

where ‖f (s+1)‖∞ = supt∈[m,M] |f (s+1)(t)| and m ≤ pq≤ M.

λ = 0 (whenever 0 ∈ int(dom(f (i)))) and ane exponential families,

simpler expression:

If (X1

: X2

) =

∞∑

i=0

f (i)(0)

i !I1−i ,i(θ1 : θ2),

I1−i ,i(θ1 : θ2) =

eF (iθ2+(1−i)θ1

)

e iF (θ2)+(1−i)F (θ1

).


Geometri ally designed divergen es

Plot of the onvex generator F .

q pp+q

2

B(p : q)

J(p, q)

tB(p : q)

F : (x, F (x))

(p, F (p))

(q, F (q))


Divergen es: skew Jensen & Bregman divergen es

F a smooth onvex fun tion, the generator.

Skew Jensen divergen es:

J ′α(p : q) = αF (p) + (1− α)F (q) − F (αp + (1− α)q),

= (F (p)F (q))α − F ((pq)α),

where (pq)γ = γp + (1− γ)q = q + γ(p − q) and(F (p)F (q))γ = γF (p) + (1− γ)F (q) = F (q) + γ(F (p)− F (q)).

Bregman divergen es:

B(p : q) = F (p)− F (q)− 〈p − q,∇F (q)〉,

lim

α→0

Jα(p : q) = B(p : q),

lim

α→1

Jα(p : q) = B(q : p).

Statisti al skewed Bhatta harrya divergen e:

Bhat(p1

: p2

) = − log

∫

p1

(x)αp2

(x)1−αdν(x) = J ′α(θ1 : θ2)

for exponential families [21.


Total Bregman divergen es [13

Conformal divergen e, onformal fa tor ρ:

D ′(p : q) = ρ(p, q)D(p : q)

plays the rle of regularizer [35

Invarian e by rotation of the axes of the design spa e

tB(p : q) =B(p : q)

√

1+ 〈∇F (q),∇F (q)〉= ρB(q)B(p : q),

ρB(q) =1

√

1+ 〈∇F (q),∇F (q)〉.

For example, total squared Eu lidean divergen e:

tE (p, q) =1

2

〈p − q, p − q〉√

1+ 〈q, q〉.


Total skew Jensen divergen es [27

tB(p : q) = ρB(q)B(p : q), ρB(q) =

√

1

1+ 〈∇F (q),∇F (q)〉

tJα(p : q) = ρJ(p, q)Jα(p : q), ρJ(p, q) =

√

√

√

√

1

1+ (F (p)−F (q))2

〈p−q,p−q〉

Jensen-Shannon divergen e, square root is a metri :

JS(p, q) =1

2

d∑

i=1

pi log2pi

pi + qi+

1

2

d∑

i=1

qi log2qi

pi + qi

But the square root of the total Jensen-Shannon divergen e is not a metri .


Summary: Geometri Computing in Information Spa es

Lo ation-s ale families, spheri al normal, symmetri positive denite

matri es → hyperboli geometry.

Hyperboli geometry: CG ane onstru tions in Klein disk

Spa e of spheres in dually ane onne tion geometry

Syntheti geometry for hara terizing the best error exponent in Bayes

error

Conformal divergen es: total Bregman/total Jensen divergen es

Clustering using pair of entroids for lusters using mixed divergen es for

symmetrized alpha divergen es

Learning statisi al mixtures maximizing the omplete likelihood as a

sequen e of geometri lustering problems: k-GLME

In sear h of losed-form solutions: Jereys entroid using Lambert W

fun tion, f -divergen e approximation for ane exponential families.


Computational Information Geometry (Edited books)

[19 [18

http://www.springer. om/engineering/signals/book/978-3-642-30231-2

http://www.sony sl. o.jp/person/nielsen/infogeo/MIG/MIGBOOKWEB/

http://www.springer. om/engineering/signals/book/978-3-319-05316-5

http://www.sony sl. o.jp/person/nielsen/infogeo/GTI/Geometri TheoryOfInformation.html

© 2014 Frank Nielsen 11.Referen es 73/75

http://www.springer.com/engineering/signals/book/978-3-642-30231-2

http://www.sonycsl.co.jp/person/nielsen/infogeo/MIG/MIGBOOKWEB/

http://www.springer.com/engineering/signals/book/978-3-319-05316-5

http://www.sonycsl.co.jp/person/nielsen/infogeo/GTI/GeometricTheoryOfInformation.html

Geometri S ien es of Information (GSI) 2015

O tober 28-30th 2015. Deadline 1st Mar h 2015

http://www.gsi2015.org/


http://www.gsi2015.org/

Thank you!


Mar Arnaudon and Frank Nielsen.

On approximating the Riemannian 1- enter.

Comput. Geom. Theory Appl., 46(1):93104, January 2013.

Mar Arnaudon and Frank Nielsen.

On approximating the Riemannian 1- enter.

Computational Geometry, 46(1):93 104, 2013.

N.S. Barnett, P. Cerone, S.S. Dragomir, and A. Sofo.

Approximating Csiszár f -divergen e by the use of Taylor's formula with integral remainder.

Mathemati al Inequalities & Appli ations, 5(3):417434, 2002.

D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.

Real values of the W -fun tion.

ACM Trans. Math. Softw., 21(2):161171, June 1995.

Jean-Daniel Boissonnat and Christophe Delage.

Convex hull and Voronoi diagram of additively weighted points.

In Gerth St¸lting Brodal and Stefano Leonardi, editors, ESA, volume 3669 of Le ture Notes in Computer

S ien e, pages 367378. Springer, 2005.

Jean-Daniel Boissonnat, Frank Nielsen, and Ri hard No k.

Bregman Voronoi diagrams.

Dis rete and Computational Geometry, 44(2):281307, April 2010.

Herman Cherno.

A measure of asymptoti e ien y for tests of a hypothesis based on the sum of observations.

Annals of Mathemati al Statisti s, 23:493507, 1952.

Pas al Chossat and Olivier P. Faugeras.

Hyperboli planforms in relation to visual edges and textures per eption.

PLoS Computational Biology, 5(12), 2009.

Andrzej Ci ho ki, Sergio Cru es, and Shun-i hi Amari.

Generalized alpha-beta divergen es and their appli ation to robust nonnegative matrix fa torization.


Entropy, 13(1):134170, 2011.

P. Thomas Flet her, Conglin Lu, Stephen M. Pizer, and Sarang C. Joshi.

Prin ipal geodesi analysis for the study of nonlinear statisti s of shape.

IEEE Trans. Med. Imaging, 23(8):9951005, 2004.

Bernd Gärtner and Sven S hönherr.

An e ient, exa t, and generi quadrati programming solver for geometri optimization.

In Pro eedings of the sixteenth annual symposium on Computational geometry, pages 110118. ACM, 2000.

Harold Hotelling.

Meizhu Liu, Baba C. Vemuri, Shun-i hi Amari, and Frank Nielsen.

Shape retrieval using hierar hi al total Bregman soft lustering.

Transa tions on Pattern Analysis and Ma hine Intelligen e, 34(12):24072419, 2012.

F. Nielsen and R. No k.

On the hi square and higher-order hi distan es for approximating f -divergen es.

Signal Pro essing Letters, IEEE, 21(1):1013, 2014.

Frank Nielsen.

Legendre transformation and information geometry.

Te hni al Report CIG-MEMO2, September 2010.

Frank Nielsen.

Jereys entroids: A losed-form expression for positive histograms and a guaranteed tight approximation for

frequen y histograms.

Signal Pro essing Letters, IEEE, PP(99):11, 2013.

Frank Nielsen.

Generalized bhatta haryya and herno upper bounds on bayes error using quasi-arithmeti means.

Pattern Re ognition Letters, 42:2534, 2014.

Frank Nielsen.

Geometri Theory of Information.


Springer, 2014.

Frank Nielsen and Rajendra Bhatia, editors.

Matrix Information Geometry (Revised Invited Papers). Springer, 2012.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhatta haryya entroids.

IEEE Transa tions on Information Theory, 57(8):54555466, 2011.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhatta haryya entroids.

IEEE Transa tions on Information Theory, 57(8):54555466, August 2011.

Frank Nielsen and Ri hard No k.

On approximating the smallest en losing Bregman balls.

In Pro eedings of the Twenty-se ond Annual Symposium on Computational Geometry, SCG '06, pages 485486,

New York, NY, USA, 2006. ACM.


On the smallest en losing information disk.

Information Pro essing Letters (IPL), 105(3):9397, 2008.


The dual Voronoi diagrams with respe t to representational Bregman divergen es.

In International Symposium on Voronoi Diagrams (ISVD), pages 7178, 2009.


Hyperboli Voronoi diagrams made easy.

In 2013 13th International Conferen e on Computational S ien e and Its Appli ations, pages 7480. IEEE, 2010.


Hyperboli Voronoi diagrams made easy.

In International Conferen e on Computational S ien e and its Appli ations (ICCSA), volume 1, pages 7480, Los

Alamitos, CA, USA, mar h 2010. IEEE Computer So iety.



Total jensen divergen es: Denition, properties and k-means++ lustering.

CoRR, abs/1309.7109, 2013.


Visualizing hyperboli Voronoi diagrams.

In Pro eedings of the Thirtieth Annual Symposium on Computational Geometry, SOCG'14, pages 90:9090:91,

New York, NY, USA, 2014. ACM.


Visualizing hyperboli Voronoi diagrams.

In Symposium on Computational Geometry, page 90, 2014.

Frank Nielsen, Ri hard No k, and Shun-i hi Amari.

On lustering histograms with k-means by using mixed α-divergen es.

Entropy, 16(6):32733301, 2014.

Frank Nielsen, Paolo Piro, and Mi hel Barlaud.

Bregman vantage point trees for e ient nearest neighbor queries.

In Pro eedings of the 2009 IEEE International Conferen e on Multimedia and Expo (ICME), pages 878881, 2009.

Ri hard No k and Frank Nielsen.

Fitting the smallest en losing Bregman ball.

In Ma hine Learning, volume 3720 of Le ture Notes in Computer S ien e, pages 649656. Springer Berlin

Heidelberg, 2005.

Calyampudi Radhakrishna Rao.

Information and the a ura y attainable in the estimation of statisti al parameters.

Bulletin of the Cal utta Mathemati al So iety, 37:8189, 1945.

Ivor W. Tsang, Andras Ko sor, and James T. Kwok.

Simpler ore ve tor ma hines with en losing balls.

In Pro eedings of the 24th International Conferen e on Ma hine Learning (ICML), pages 911918, New York,

NY, USA, 2007. ACM.

Baba Vemuri, Meizhu Liu, Shun-i hi Amari, and Frank Nielsen.

Total Bregman divergen e and its appli ations to DTI analysis.


IEEE Transa tions on Medi al Imaging, pages 475483, 2011.

Huaiyu Zhu and Ri hard Rohwer.

Measurements of generalisation based on information geometry.

In StephenW. Ella ott, JohnC. Mason, and IainJ. Anderson, editors, Mathemati s of Neural Networks, volume 8

of Operations Resear h/Computer S ien e Interfa es Series, pages 394398. Springer US, 1997.