Download - Continuous Hierarchical Representations with Poincaré ...emilemathieu.fr/PDFs/pvae_poster.pdf · [1]Diederik P. Kingma and Max Welling.Auto-encoding variational bayes.In Proceedings

Continuous Hierarchical Representations with PoincaréVariational Auto-Encoders

Emile Mathieu†, Charline Le Lan†, Chris J. Maddison†∗, Ryota Tomioka‡ and Yee Whye Teh†∗† Department of Statistics, University of Oxford, United Kingdom, ∗ DeepMind, London, United Kingdom, ‡ Microsoft Research, Cambridge, United Kingdom

5

University of Oxford visual identity guidelines

At the heart of our visual identity is the Oxford logo.It should appear on everything we produce, from letterheads to leaflets and from online banners to bookmarks.

The primary quadrangle logo consists of an Oxford blue (Pantone 282) square with the words UNIVERSITY OF OXFORD at the foot and the belted crest in the top right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal rectangle logo, is only to be used where height (vertical space) is restricted.

These standard versions of the Oxford logo are intended for use on white or light-coloured backgrounds, including light uncomplicated photographic backgrounds.

Examples of how these logos should be used for various applications appear in the following pages.

NOTEThe minimum size for the quadrangle logo and the rectangle logo is 24mm wide. Smaller versions with bolder elements are available for use down to 15mm wide. See page 7.

The Oxford logoQuadrangle Logo

Rectangle Logo

This is the square logo of first choice or primary Oxford logo.

The rectangular secondary Oxford logo is for use only where height is restricted.

Overview & Contributions

Many real datasets are hierarchically structured. However, tra-ditional variational auto-encoders (VAEs) [1, 2] map data in a Eu-clidean latent space which cannot efficiently embed tree-like struc-tures. Hyperbolic spaces with negative curvature can [3] and theirsmoothness is well-suited for gradient based approaches [4].

1. We empirically demonstrate that endowing a VAE with aPoincaré ball latent space can be beneficial in terms of modelgeneralisation and can yield more interpretable representations.

2. We propose efficient and reparametrisable samplingschemes, and calculate the probability density functions,for two canonical Gaussian generalisations defined on thePoincaré ball, namely the maximum-entropy and wrappednormal distributions.

3. We introduce a decoder architecture taking into account thehyperbolic geometry, which we empirically show to be crucial.

The Poincaré ball model of hyperbolic geometry

The d-dimensional Poincaré ball with curvature−c is the Riemannianmanifold Bdc = (Bdc , gcp) [5], where Bdc = {z ∈ Rd | ‖z‖2 ≤ 1/

√c} , and

gcp its metric tensor,

gcp(z) = (λcz)2 ge(z) =

(2

1− c ‖z‖2

)2

ge(z) (1)

(a) Isometric embedding of tree.

0

z2

v1v2 exp0(v1)expz2(v2)

(b) Geodesics and exponential maps.

We are extremely grateful to Adam Foster, Phillipe Gagnon and Emmanuel Chevallier for their help. EM,YWT’s research leading to these results received funding from the European Research Council under theEuropean Union’s Seventh Framework Programme (FP7/2007- 2013) ERC grant agreement no. 617071and they acknowledge Microsoft Research and EPSRC for funding EM’s studentship, and EPSRC grantagreement no. EP/N509711/1 for funding CL’s studentship.

[1] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of theInternational Conference on Learning Representations (ICLR), 2014.

[2] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation andApproximate Inference in Deep Generative Models. In ICML, 2014.

[3] Rik Sarkar. Low distortion delaunay embedding of trees in hyperbolic plane. In Marc van Kreveldand Bettina Speckmann, editors, Graph Drawing, pages 355–366. Springer Berlin Heidelberg, 2012.

[4] Maximilian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations.In Advances in Neural Information Processing Systems, pages 6341–6350. 2017.

[5] E. Beltrami. Teoria fondamentale degli spazii di curvatura costante: memoria. F. Zanetti, 1868.[6] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements.

Journal of Mathematical Imaging and Vision, 25(1):127, Jul 2006.[7] Salem Said, Lionel Bombrun, and Yannick Berthoumieu. New riemannian priors on the univariate

normal model. Entropy, 16(7):4015–4031, 2014.[8] Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In

International Conference on Neural Information Processing Systems, pages 5350–5360, 2018.[9] Michael Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In

International Conference on Neural Information Processing Systems, pages 439–450, 2018.[10] Thomas N Kipf and Max Welling. Variational graph auto-encoders. Workshop on Bayesian Deep

Learning, NIPS, 2016.

Hyperbolic normal distributions reparametrisation

Invariant measure: dM(z) =√|G(z)|dz = (λcz)

ddz with dz the Lebesgue measure

Riemannian normal: NRBdc

(z|µ, σ2) ∝ exp(−dcp(µ, z)2/2σ2

)(maximum-entropy [6, 7])

Wrapped normal: z = expcµ(v/λcµ

)with v ∼ N (·|0,Σ) (push-forward)

NWBdc

(z|µ,Σ) = N(λcµ logµ(z)

∣∣0,Σ) (√c dcp(µ, z)/sinh(√c dcp(µ, z))

)d−1

Reparametrisation through the exponential map: z ∼ NBdc(z|µ, σ2) dM(z)

z = expcµ

(G(µ)−

12 v)

= expcµ(v/λcµ

)(2)

Isotropic: v = r α, with direction α ∼ U(Sd−1), and hyperbolic radius r = dcp(µ,x) density

ρR(r) ∝ 1R+(r)e−

r2

2σ2

(sinh(√cr)√c

)d−1

−−→c→0

ρW(r) ∝ 1R+(r) e−

r2

2σ2 rd−1 (3)

σ = (1.0, 1.0)) σ = (2.0, 0.5)) σ = (0.5, 2.0))Figure 2: Wrapped normal probability measures for Fréchet means µ, concentrations Σ = diag(σ) and c = 1.

Decoder & encoder architectures

Decoder Compute geodesic distance to hyperplanes (i.e. gyroplanes)

fa,p(z) = 〈a, z − p〉 = sign (〈a, z − p〉) ‖a‖ dE(z, Hca,p), with, Ha,p = p + {a}⊥.

f ca,p(z) = sign(⟨a, logcp(z)

⟩p

)‖a‖p d

cp(z, H

ca,p), with, Hc

a,p = expcp({a}⊥) [8].

p

Ha, pa

z

dcp(z, Ha, p)

p

a Ha, p

z

dcp(z, Ha, p)

Figure 3: Orthogonal projection on a hyperplane in B2c (a) and R2 (b).

Encoder µ = expµ(encµφ(x)) ∈ Bdc and σ = softplus(encσφ(x)) ∈ R+∗ .

Training

Model pθ(x|z) = p(x|decθ(z)), p(z) = NBdc(z|0, σ20) and qφ(z|x) = NBdc(z|encµφ(x),encσφ(x)2).

ELBO log p(x) ≥ LM(x; θ, φ) ,∫M

ln

(pθ(x|z)p(z)

qφ(z|x)

)qφ(z|x) dM(z) via Monte Carlo (4)

Gradients ∇µz via reparametrisation. ∇σz via reparametrisation for the wrapped normal andvia implicit reparametrisation [9] of ρR via its cdf FR(r;σ) for the Riemannian normal.

Branching diffusion process

Nodes (x1, . . . ,xN) ∈ Rn are hierarchically sampled as follow

xi ∼ N(· |xπ(i), γ

2)∀i ∈ 1, . . . , N, π(i) : parent of ith node

Table 1: Negative test marginal likelihood estimates LIWAE (with 5000 samples).

Modelsσ0 N -VAE P0.1-VAE P0.3-VAE P0.8-VAE P1.0-VAE P1.2-VAE

LIWAE 1 57.1±0.2 57.1±0.2 57.2±0.2 56.9±0.2 56.7±0.2 56.6±0.2

LIWAE 1.7 57.0±0.2 56.8±0.2 56.6±0.2 55.9±0.2 55.7±0.2 55.6±0.2

Figure 4: Latent representations learned by P1-VAE (a), N -VAE (b).Embeddings are represented by black crosses, and colour dots are posteriorsamples. Blue lines represent true hierarchy.

MNIST digits

One can view the natural clustering in MNIST images as a hierarchywith each of the 10 classes being internal nodes of the hierarchy.

Figure 5: Decoder ablationstudy with wrapped normalP1-VAE. Baseline decoder is aMLP. We additionally comparethe gyroplane layer introducedin the Decoder Section to aMLP pre-composed by log0. 2 5 10 20

Latent space dimension

0.0

0.5

1.0

1.5

2.0

Δ r m

argi

nal l

og-li

kelih

ood

(%)

log0 ∘ MLPgyroplane

Table 2: Per digit accuracy of a classifier trained on the 2-d latent embeddings.Results are averaged over 10 sets of embeddings and 5 classifier trainings.

Digits 0 1 2 3 4 5 6 7 8 9 AvgN -VAE 89 97 81 75 59 43 89 78 68 57 73.6P1.4-VAE 94 97 82 79 69 47 90 77 68 53 75.6

Graph embeddings

We evaluate the performance of a variational graph auto-encoder(VGAE) [10] for link prediction in networks.Table 3: Results on network link prediction. 95% confidence intervals arecomputed over 40 runs.

Phylogenetic CS PhDs Diseases

AUC AP AUC AP AUC APN -VAE 54.2±2.2 54.0±2.1 56.5±1.1 56.4±1.1 89.8±0.7 91.8±0.7

P-VAE 59.0±1.9 55.5±1.6 59.8±1.2 56.7±1.2 92.3±0.7 93.6±0.5