Integral Geometry, Hamiltonian Dynamics, and Markov Chain ...

Integral Geometry, Hamiltonian Dynamics, and

Markov Chain Monte Carlo

by MASS

Oren Mangoubi

B.S., Yale University (2011)

Submitted to the Department of Mathematicsin partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

ACHUSES INS ITUTEOF TECHNOLOGY

JUN 16 2016

LIBRARIES

MCHVES

June 2016

@ Oren Mangoubi, MMXVI. All rights reserved.

The author hereby grants to MIT permission to reproduce and todistribute publicly paper and electronic copies of this thesis document

in whole or in part in any medium now known or hereafter created.

AuthorSignature redacted ..................C/

Department of Mathematics

Certified by. Signature redactedApril 28, 2016

Alan EdelmanProfessor

Thesis Supervisor

Accepted bySignature redactedJonathan Kelner

Chairman, Applied Mathematics Committee

Integral Geometry, Hamiltonian Dynamics, and Markov

Chain Monte Carlo

by

Oren Mangoubi

Submitted to the Department of Mathematicson April 28, 2016, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy

Abstract

This thesis presents applications of differential geometry and graph theory to thedesign and analysis of Markov chain Monte Carlo (MCMC) algorithms. MCMC al-gorithms are used to generate samples from an arbitrary probability density ir incomputationally demanding situations, since their mixing times need not grow expo-nentially with the dimension of w. However, if w has many modes, MCMC algorithmsmay still have very long mixing times. It is therefore crucial to understand and reduceMCMC mixing times, and there is currently a need for global mixing time bounds aswell as algorithms that mix quickly for multi-modal densities.

In the Gibbs sampling MCMC algorithm, the variance in the size of modes inter-sected by the algorithm's search-subspaces can grow exponentially in the dimension,greatly increasing the mixing time. We use integral geometry, together with the Hes-sian of r and the Chern-Gauss-Bonnet theorem, to correct these distortions and avoidthis exponential increase in the mixing time. Towards this end, we prove a general-ization of the classical Crofton's formula in integral geometry that can allow one togreatly reduce the variance of Crofton's formula without introducing a bias.

Hamiltonian Monte Carlo (HMC) algorithms are some the most widely-used MCMCalgorithms. We use the symplectic properties of Hamiltonians to prove global Cheeger-type lower bounds for the mixing times of HMC algorithms, including RiemannianManifold HMC as well as No-U-Turn HMC, the workhorse of the popular Bayesiansoftware package Stan. One consequence of our work is the impossibility of energy-conserving Hamiltonian Markov chains to search for far-apart sub-Gaussian modes inpolynomial time. We then prove another generalization of Crofton's formula that ap-plies to Hamiltonian trajectories, and use our generalized Crofton formula to improvethe convergence speed of HMC-based integration on manifolds.

We also present a generalization of the Hopf fibration acting on arbitrary- ghost-valued random variables. For # = 4, the geometry of the Hopf fibration is encodedby the quaternions; we investigate the extent to which the elegant properties of thisencoding are preserved when one replaces quaternions with general 0 > 0 ghosts.

3

Thesis Supervisor: Alan EdelmanTitle: Professor

4

Acknowledgments

I am very grateful to my advisor and coauthor Alan Edelman' for his guidance and

collaboration on this thesis. I am also deeply grateful to my coauthor Natesh Pillai2

for his collaboration and advice on the Hamiltonian mixing times chapter of this

thesis. I could not have finished this thesis without their insights. I am deeply

thankful as well for indispensable advice and insights from Aaron Smith3 , Youssef

Marzouk4 , Michael Betancourt5 , Jonathan Kelner1 , Michael La Croixi, Jiahao Chen',

Laurent Demanet', Dennis Amelunxen', Ofer Zeitouni' 8 , Neil Shephard2 , and Nawaf

Bou-Rabee'.

I would also like to thank my mentors and previous coauthors Stephen Morse'o,

Yakar Kannai7 , Edwin Marengo", and Lucio Frydman"2 . I am very grateful to my

other mentors and professors at MIT and Yale, especially Roger Howe1", Gregory

Margulis13 , Victor Chernozhukov1 4 , Kumpati Narendra'0 , Andrew Barron15, Ivan

Marcus16 , Paulo Lozano 4, and Manuel Martinez-Sanchez'. For valuable opportuni-

ties to learn, teach and conduct research, I would like to thank the MIT Mathematics

department and the Theory of Computation group at the MIT Computer Science and

Artificial Intelligence Laboratory (CSAIL), as well as the Yale Mathematics and Elec-

trical Engineering departments, the Weizmann Institute Mathematics and Chemical

Physics departments, and the Northeastern Electrical Engineering department.

I am very thankful to have been blessed with a kind and loving family for who's

'MIT Mathematics Department2 Harvard Statistics Department3University of Ottawa Mathematics and Statistics Department4 MIT Department of Aeronautics and Astronautics5University of Warwick Statistics Department6 City University of Hong Kong Mathematics Department7Weizmann Institute of Science Mathematics Department8 Courant Institute of Mathematical Sciences at NYU9Rutgers Mathematical Sciences Department0Yale Electrical Engineering Department

"Northeastern University Electrical Engineering Department12Weizmann Institute of Science Chemical Physics Department13Yale Mathematics Department1 4MIT Economics Department"5 Yale Statistics Department' 6 Yale History Department

5

encouragement and support I am forever grateful: My mother and father, my brothers

Tomer and Daniel, and, most importantly, my grandparents M6m6, Oma and Opa,

as well as P6p6 (of blessed memory). I am also very thankful to my friends for their

kindness and companionship. My schoolteachers at Schechter and Gann, especially

my Mathematics teacher Mrs. Voolich and my Science teacher Mrs. Schreiber, have

been an inspiration to me as well. I also thank the MITxplore program for giving

me the opportunity to design and teach weekly Mathematics enrichment classes for

children in Cambridge and Boston public schools.

I deeply appreciate the generous support of a National Defense Science and Engi-

neering Graduate (NDSEG) Fellowship, as well as support from the National Science

Foundation (NSF DMS-1312831) and the MIT Mathematics Department.

Thesis Committee:

" Professor Alan Edelman

Thesis Committee Chairman and Thesis Advisor

Professor of Applied Mathematics, MIT

" Professor Natesh Pillai

Associate Professor of Statistics, Harvard

* Professor Youssef Marzouk

Associate Professor of Aeronautics and Astronautics, MIT

" Professor Jonathan Kelner

Associate Professor of Applied Mathematics, MIT

6

Contents

I Introductih 9

1.1 Somic wid liy-u -&( 1\CL C ig1( 1 u I .. .. .. . .. . . . .... . . . . 10

L 1. 1 RandomN Walk Metropoi . . . . . . . . . . . . . . . . . . . . 10

1.1.2 Gibbs sampling algoritin . . . . . . . . . . . . . . . . . . . . . 10

1.1.3 Hamiltonian Monte Carlo . . . . . . . . . . . . . . 12

1.2 Iiitegral & differential geometrY prelimninari . . . . . . . . . . . . . . 14

1.2. Kine a Wi m neas ... .. . .. .. . .. .. . .. .. . ...15

1.2.2 The Crofton for . . . . . . . . . . . . . . . . . . . . . . . 17

1.2.3 Concentration. . . . . . . . . . . . . . . . . . . . . . 17

1.2.4 The Chern-GaussR nnm. . . . . . . . . . . . . . . 18

1.3 Conitribution-, of this the>.......................11.3 Con rib ti ns f t is h . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Integral Geometry for Gibw Sa-mmers 21

2.1 Illt roductio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..21

2.2 A first-ordei ewtn (iiin7 ie I eiuion 1>...... . . . . . . . . . . . 28

2.2.1 The Crofton formula Gibbs sample . . . . . . . . . . . . . . 29

2.2.2 Traditional weights vs. integral geometry weigmt . . . . . . . 30

2.3 A generalized Crofton forini . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.1 The generalized Crofton formula Gibbs saimp . . . . . . . . . 44

2.3.2 The pe-- j ~ ( K~'~; KUi

densite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.3.3 An NC Im n aie( aw o

t e e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7

7

2.3.5 Higliei-or( hi' Cv~veril-Lill .. -u e .(W.1.j .t.1. 50

2. 3. 6 Colle ctioni-of-sphieres xaiilIe aii( oietatidm-)-miii 51

2.3.7 Vairianiicc due to )1r(l-efltYklma(nisue((

2..> lieoiet a; )t )oul1(s (Ieive( usin- 1C C .i en 31 i t-O)raj(

oiitY....... .. ...... .. ...... .. ...... .. ..... .. .. .... 60

2.4 Ranidoiii iilatrix a-pplicattionl: Sa111pf)lig 0t lestocia'st a 62

2.4.1 Approxiinate sanljplilig alo-ov)Y1+2 1!?~Cr 2...........63

2.5 Conditioinlg oni iiiltiple eigenivall( .. ...... ..... ....... 65

2.(6 Coniditioilling on- ai singp.le-eigeinvalti ue raI' er.. ...... ....... 66

3 Mixing Times of H-amiitouia-i 1\Iloit~e Carlc 71i

3.1 Itroducti . .... ..... ...... ..... ..... ........ 71

3.2 Haiifltoia 1, AUI I,- I .................................. . .. .. .. ....72

3.3 Clieger- bo)01 s111 in X!ItLCJC > ......................... 78

:3.4 Daindoiii \\itlIK iiiip1 rnn u&'..... ... .. .. .. .. .. .. .... 80

4 A Generalization of Crofton's Formula to H ~r'i

with Applictn~ii ~ atho IV~Ionte c>'a,2. 85

4.2 Cr-oftoni foiiuime 101- liiniironin uIvnanmII-. ................ 85

4.3 Mnhuifoli integrationi usinig HNWC anil the fllii 1111 ( irotti Oi bul 88

5 A Hopf Fibration for ' -Ghost Gaiissiau- 91

5.1 hit-odlUCti(o .. ... ..... ...... ..... ..... ........ 91

5.2 Defiingi th .......... .......................... 92

5.3 Hopf Fibrationi oni'.. .... ..... ..... ...... ..... ... 94

8

-1

Chapter 1

Introduction

Applications of sampling on probability distributions, defined on Euclidean space

or on other manifolds, arise in many fields, such as Statistics [18, 6, 24], Machine

Learning [4], Statistical Mechanics [39], General Relativity [16], Molecular Biology

[15], Linguistics [5], and Genetics [31]. In many cases these probability distributions

are difficult to sample from with straightforward methods such as rejection sampling

because the events we are conditioning on are very rare, or the probability density

concentrates in some small regions of space. Typically, the complexity of sampling

from these distributions grows exponentially with the dimension of the space. In

such situations, we require alternative sampling methods whose complexity promises

not to grow exponentially with dimension. In Markov chain Monte Carlo (MCMC)

algorithms, one of the most commonly used such methods, we run a Markov chain

that converges to the desired probability distribution [12].

MCMC algorithms are used to generate samples from an arbitrary probability

density 7r in computationally demanding situations such as high-dimensional Bayesian

statistics [60], machine learning [4], and molecular biology [15], since their mixing

times need not grow exponentially with the dimension of 7r. However, if ir has many

modes, MCMC algorithms may still have very long mixing times [35, 48, 28]. It

is therefore crucial to understand and reduce MCMC mixing times, and there is

currently a need for global mixing time bounds as well as algorithms that mix quickly

for multi-modal densities.

9

1.1 Some widely-used MCMC algorithms

In this section we review some widely-used MCMC algorithms.

1.1.1 Random Walk Metropolis

The Random Walk Metropolis (RWM) algorithm (Algorithm 1) is the most basic

MCMC algorithm. At each step of the Markov chain, the RWM algorithm proposes

to take the next step ii+1 in a random direction and distance from the current position

xi. The step is accepted with a probability of min{', 1}. If the step is rejected,

the algorithm stays at its current position until the next time step.

Algorithm 1 Random Walk Metropolis [45]

input: c > 0, xO, oracle for 7r : R' -+ [0, oc)output: x 1 , X2 , . .. , with stationary distribution 7r

1: for i = 1, 2, ... do2: Sample independent h ~ f(0, e)n3: set x'i+ = xi + h4: set xi+1 = i+1 with probability min{{ ,Gi+i) 1}; Else, set xi+1 = xi5: end for

Although the RWM algorithm is widely-used in practice due to its simplicity [7],

its mixing time slows down quadratically with a decrease in the step size since its

associated "random walk" behavior approximates a diffusion. We will discuss this

slowdown for the RWM algorithm further in Chapter 3.

1.1.2 Gibbs sampling algorithms

Gibbs sampling MCMC algorithms [23] offer one way of taking larger steps to avoid

the quadratic slowdown associated with diffusion-like "random walk" behavior. Gibbs

sampling algorithms work by sampling the next step xi+1 in the Markov chain from

the probability density 7 conditioned on a random search subspace S. xi+1 may be

sampled from S using a subroutine Markov chain or another sampling method. If a

subroutine Markov chain is used, the Gibbs sampler is oftentimes referred to as the

10

"Metropolis-within-Gibbs" algorithm, where the term "Metropolis" is used loosely

here to denote the subroutine Markov chain. The search subspace may be a line or

a multi-dimensional plane passing through xi. In this thesis we will consider search

subspaces with isotropic random orientation.

Algorithm 2 Gibbs Sampler (with isotropic random search subspaces) [23]

input: x0 , oracle for 7r : R --+ [0, oc)output: x 1 , x2 , .. ., with stationary distribution 7r

1: for i = 1, 2, ... do2: Sample a k-dimensional isotropic random search subspace S passing through xi3: Sample xi+1 with probability proportional to the restriction of ir(x) to S (using

a subroutine Markov chain or another sampling method)4: end for

Algorithm 3 Gibbs Sampler (for 7r supported on submanifold)

Algorithm 3 is identical to Algorithm 2 , except for the following steps:

3: Sample xi+1 from r(x)/I aII restricted to Sn M (using a subroutine Markovchain or another sampling method)

(Here I| denotes the product of the singular values of the projection map from

S onto M', the orthogonal complement of the tangent space of M at x.)

If the manifold can be mapped onto a sphere, it is sometimes simpler to bypass

the primary Markov chain in the Gibbs sampler and sample the manifold directly by

intersecting it with random subspaces moving according to the kinematic measure:

Algorithm 4 Great sphere sampler

Algorithm 4 is identical to Algorithms 2 and 3, except for the following steps:

input: oracle for 7r supported on a manifold M C Sn.

2: Sample a search subspace Si C S that is an isotropic random great sphere

independent of xi.

One problem with Gibbs sampling algorithms when sampling from distributions

with multiple modes is that the orientation of the search subspace S can greatly

11

distort the apparent size of a mode, slowing the algorithm. In Chapter 2 we use

concentration of measure to quantify how much these distortions slow down the Gibbs

sampling algorithm. We also show how one can use integral geometry to eliminate

some of these distortions.

1.1.3 Hamiltonian Monte Carlo algorithms

Like Gibbs sampling algorithms, Hamiltonian Monte Carlo (HMC) algorithms [19]

seek to avoid quadratic slowdowns associated with diffusion-like "random walk" be-

havior. They do so by simulating the trajectory of a Hamiltonian particle for some

amount of time T, and then refreshing the momentum according to the momentum's

Boltzman distribution from statistical mechanics. Since the particle has momentum,

it will tend to take large steps in the direction of the momentum, avoiding "random

walk" behavior. Since Hamiltonian trajectories conserve energy, there is no need to

reject any proposed steps. For this reason HMC algorithms work especially well in

high dimensions since concentration of the posterior measure 7r causes most other

MCMC algorithms to either sample steps that are very close (leading to "random

walk" behavior) or to propose steps that have low probability density and are thus

rejected with high probability.

In this section we review three commonly-used HMC algorithms (Figure 1-1). The

first two of these HMC algorithms (Algorithms 4 and 5) form the workhorse of the

popular Bayesian software package Stan [10]. All three algorithms generate a Markov

chain step by integrating a Hamiltonian trajectory for a time T, and refreshing the

momentum. In Chapter 3 we will show global lower bounds on the mixing times

of a large class of HMC algorithms sampling from arbitrary posterior distributions,

including Algorithms 3-5.

Isotropic-Momentum HMC (Algorithm 3) [44] is the most basic HMC algorithm

(Figure 1-1, top, entire solid+dotted trajectory),

The No-U-Turn Sampler is a modification of Algorithm 3 which seeks to take

longer steps by avoiding U-turns in the Hamiltonian trajectories. It does so by stop-

ping the trajectory once any two velocity vectors on the trajectory path form an angle

12

U

got

wool

Figure 1-1: The Isotropic-Momentum HMC (dashed and solid,top), No-U-Turn HMCtrajectory (solid only, top), and Rieimanian manifold HMC trajectory (bottom).

The Isotropic-Momentum and Riemannian Manifold trajectories evolve for a fixed

time T, while the No-U-Turn trajectory stops once any two momentum vectors on the

trajectory are orthogonal. The Isotropic-Momentum HMC and No-U-Turn HMC both

have spherical Gaussian random initial trajectories, while Riemannian Manifold H MC

has a non-spherical Gaussian random initial trajectory determined by the Hessian of

-log(7r) at t = 0. In Chapter 3, we will use boundaries such as OS to establish lower

bounds on the HMC mixing time.

I13

I

#S N&

IIT.

S 0 to0 0 t

Algorithm 5 Isotropic-Momentum HMC (idealized symplectic integrator) [44]

input: qO, oracle for : R' -+ [0, oc)output: q1, q2 ... ., with stationary distribution 7define: H(q, p) -log(7r(q)) + lp'p.

1: for i = 1, 2, ... do2: Sample independent pi ~ N(0, 1)n3: Integrate Hamiltonian trajectory (q(t),p(t)) with Hamiltonian H over the time

interval [0, T] and initial conditions (p(O), q(0)) = (pi, qi)4: set qg4 1 = q(T)5: end for

of more than 900 (Figure 1-1, top, solid trajectory)

Algorithm 6 Idealized No-U-Turn Sampler HMC (perfect symplectic integrator) [32]

Algorithm 6 is identical to Algorithm 5, except for step 3.

3: Integrate Hamiltonian trajectory (q(t),p(t)) over the time interval [0, T], withinitial conditions (p(O), q(0)) = (pi, qi), where T is the minimum time such thatthe velocity vectors at two points on the trajectory path form an angle greaterthan 900.

Riemannian Manifold HMC seeks to take longer steps by choosing initial momenta

from a multivariate Gaussian distribution that agrees with the local geometry of the

posterior density 7r (Figure 1-1, bottom):

Algorithm 7 Riemannian Manifold HMC (idealized symplectic integrator) [27, 25]

Algorithm 7 is identical to Algorithm 5, except for the following steps:

define: H(q,p) := -log(7r(q)) + cadet(G(q)) + jpT G(q)p, where G(q) is thenon-degenerate Fisher information matrix of 7r at q, and c, = jlog(pi)n.

2: Sample pi ~ .(0, G- 1 (qi))

1.2 Integral & differential geometry preliminaries

In this section we review results from differential geometry, integral geometry, and

concentration of measure that we will use extensively in this thesis.

14

1.2.1 Kinematic measure

Up until this point we have talked about random search-subspaces informally. This

notion of randomness is formally referred to as the kinematic measure [53, 54]. The

kinematic measure provides the right setting to state the Crofton Formula. The kine-

matic measure, as the name suggests, is invariant under translations and rotations.

The random subspace is said to be "moving according to the kinematic measure".

The kinematic measure is the formal way of discussing the following simple situ-

ation: we would like to take a random point p uniformly on the unit sphere or, say,

inside a cube in R'. First we consider the sphere. After choosing p we then choose

an isotropically random plane of dimension d + 1 through the point p and the center

of the sphere. In the case of the sphere, this is simply an isotropic random plane

through the center of the sphere. On a cube there are some technical issues, but the

basic idea of choosing a random point and an isotropic random orientation using that

point as the origin persists. On the cube we would allow any orientation not only

those through a "center". The technical issues relate to the boundary effects of a

finite cube or the lack of a concept of a uniform probability measure on an infinite

space. In any case the spherical geometry is the natural computational setting be-

cause it is compact (If we insist on artificially compactifying RI by conditioning on a

compact subset then either the boundary effects cause the different search-subspaces

to vary greatly in volume, slowing the algorithm, or we must restrict ourselves to such

a large subset of R' that most of the search-subspaces don't pass through much of the

region of interest). However, for the sake of completeness we introduce the kinematic

measure for the Euclidean as well as the spherical constant-curvature space because

it is relevant in more theoretical applications.

In the spherical geometry case, we define the kinematic measure with respect

to a fixed non-random subset Sfixed c S, usually a great subsphere, by the action

of the Haar measure on the special orthogonal group SO(n + 1) on Sfxe.. When

generalizing to Euclidean geometry, we must be a bit more careful, because there is

no uniform probability distribution on Rn. In the case where S has finite d-volume,

15

we can circumvent these issues simply by choosing p to be a point in the poisson point

process. To generalize to planes, we may define the kinematic measure as a poisson-

like point process for our search-subspaces with a translationally and rotationally

invariant distribution on all of R' (the "points" here are the search-subspaces):

Definition 1. (Kinematic measure)

Let Kn e {Sn, R'} be a constant-curvature space. Let Sfied be a d-dimensional

manifold that either has a finite d-volume (in R" or Sn), or is a plane (in RI only).

Let H be the Haar measure on G. If S has finite d-volume we take G to be the group

In of isometries of K". If S is a plane, we instead take G to be the quotient group

In/Id of the isometries on K" with the isometries on Sfix. Let N be the counting

process such that

(i) E[N(A)] = x H(A)

(ii) N(A) and N(B) are independent

for any disjoint Haar-measurable subsets A, B C G, where we drop the 1vold (sfixed)

term if Sfi.x is a plane. We define the kinematic measure with respect to Sfixe C Kn

to be the action of the elements of N on Saxe-

If we wish to actually sample from the kinematic measure for the infinite-measure

space R" in real life, we must restrict ourselves to some (almost surely) finite subset

of the infinite kinematic measure "point" process. For instance, we could condition

on those subspaces that intersect the manifold M that we wish to sample from.

Remark 1. There is in fact a third constant curvature-space, the constant negative-

curvature hyperbolic space Hn (S has constant positive-curvature and Rn constant

zero-curvature). Since the proof of Theorem 3 in chapter 2 seems to rely only on

the constant-curvature of the space, we suspect that nearly identical versions of this

proof and theorem probably apply to hyperbolic space as well. However, we do not

investigate this further as it is beyond the scope of this thesis.

16

1.2.2 The Crofton formula

In this section, we state the Crofton formula [17, 53, 54], which says that the volume of

a manifold M is proportional to the average of the volumes of the intersection SnM of

M with a random search-subspace S moving according to the kinematic measure. Our

first-order reweighting of the Gibbs sampler for submanifolds (section 2.2), referred

to as the "angle-independent" reweighting in the introduction of Chapter 2, is based

on this formula. In Section 2.3, we will prove a generalization of this formula that

will allow for higher-order reweightings. In Chapter 4, we will prove a generalization

of the Crofton formula that applies to trajectories in Hamiltonian dynamics including

trajectories in the HMC algorithms. We will then apply our "Hamiltonian Crofton

Formula" to improve the convergence rate of the HMC algorithm when it is used to

compute integrals on manifolds.

Lemma 1. (Crofton Formula)[1 7, 53, 54]

Let M be a codimension-k submanifold of Kn, where KI e { sn, Rn}. Let S be a

random d-dimensional manifold in Kn of finite volume (or a random plane), moving

according to the kinematic measure. Then there exists a constant Cd,k,n,K such that

Volnk (M) = Cd,k,n,K X Es[Vold-k(S nM)], (1.1)Void(S)

where we set Vold(S) to 1 if S is a plane. In the spherical case we have cd,k,n,S =

Vols S-k x s. cdk,n, is given in [53] and depends on whether Vold(S) is finite.

1.2.3 Concentration of measure

The Concentration of Measure phenomenon ([38], [46]), is the idea that volume con-

centrates in certain regions of high-dimensional space. One well-known result says

that all but an exponentially small (in n) volume of an n - 1 sphere concentrates at

a small distance from any single n - 2 dimensional equator [38].

In Section 2.3.6 we will briefly go over some of our generalizations [43] of this

concentration result to the kinematic measure, which says that most of the intersec-

17

tion volume of an n-sphere with kinematic measure-distributed d-dimensional search-

subspaces concentrates in a fraction of these search-subspaces that is exponentially

small in d, causing the variance of these intersection volumes to grow exponentially as

well. We will then use our concentration results for kinematic measure to compare the

convergence rates of the traditional Gibbs sampler to our curvature-reweighted Gibbs

sampler for an example involving the sampling of a manifold M that is a collection

of spheres.

1.2.4 The Chern-Gauss-Bonnet theorem

The Gauss-Bonnet theorem [56], states that the integral of the Gauss curvature C of

a 2-dimensional manifold M is proportional to its Euler characteristic X(M):

IM CdA = 27rX(M). (1.2)

The Chern-Gauss-Bonnet theorem, a generalization of the Gauss-Bonnet theorem

to arbitrary even-m-dimensional manifolds [13, 57], states that

IM Pf(Q)dVolm = (27r)ix(M), (1.3)

where Q is the curvature form of the Levi-Civita connection and Pf is the pfaffian.

The curvature form Q is an intrinsic property of the manifold, i.e., it does not depend

on the embedding. In the special case when M is a hypersurface, the curvature Pf(Q2)

may be computed as the Jacobian determinant of the Gauss map at x [59, 61].

The Chern-Gauss-Bonnet theorem is usually viewed as a way of relating the cur-

vature of the manifold with its Euler characteristic. In Section 2.3 we will instead

interpret the Chern-Gauss Bonnet theorem as a way of relating the volume form

dVolm to the curvature form Q. This will come in useful since the curvature form

does not change very quickly in sufficiently smooth manifolds, allowing us to get a

good estimate for the volume of the manifold from its curvature form at a single

point.

18

1.3 Contributions of this thesis

The contributions of this thesis are as follows:

" In Chapter 2 we show how Crofton formulae from integral geometry can be used

to eliminate inefficiencies in Gibbs sampling MCMC algorithms, and prove that

the transition kernels of the primary Gibbs sampling Markov chains remain

unchanged after applying Crofton formulae. In doing so, we also prove a gen-

eralization of Crofton's formula that allows for the use of generalized Gauss

curvature to reduce the variance in Crofton's formula without introducing a

bias. Some of our integral geometry results from Chapter 2 have since been

used and further generalized by Amelunxen and Lotz in [2, 3].

" In Chapter 3, we use the symplectic volume-preserving properties of Hamil-

tonian dynamics and Cheeger's inequality from graph theory to prove upper

bounds for the spectral gap of Hamiltonian Monte Carlo algorithms for general

posterior densities ir. Our results apply to the classical HMC algorithm, as

well as Riemannian Manifold HMC and No-U-Turn HMC, the workhorse of the

popular Bayesian software package Stan [10]. One consequence of our work is

the impossibility of energy-conserving Hamiltonian Markov chains to search for

far-apart sub-Gaussian modes in polynomial time.

" In Chapter 4, we prove a generalization of the Crofton formula that applies to

Hamiltonian trajectories, and use our generalized Crofton formula to improve

the convergence speed of HMC-based integration on submanifolds.

* In Chapter 5, we present a generalization of the Hopf fibration acting on

arbitrary-f ghost-valued random variables. For # = 4, the geometry of the Hopf

fibration is encoded by the quaternions; we investigate the extent to which the

elegant properties of this encoding are preserved when one replaces quaternions

with general # > 0 ghosts.

19

Chapter 2

Integral Geometry for Gibbs

Samplers

2.1 Introduction

In this chapter, we consider Gibbs sampler MCMC algorithms. If the density we wish

to sample from has many modes, or if the density has support on a submanifold M of

Rn, then sever inefficiencies can arise. The purpose of this chapter is to demonstrate

that integral geometry can be used to eliminate many of these inefficiencies. To

illustrate these inefficiencies and our proposed fix, we imagine we would like to sample

uniformly from a manifold M C Rn+1 (as illustrated in dark blue in Figure 2-1.) By

uniformly, we can imagine that M has finite volume, and the probability of being

picked in a region is equal to the volume of that region. More generally, we can put

a probability measure on M and sample from that measure.

We consider algorithms that produce a sequence of points {xi, X 2, .. .} (yellow dots

in Figure 2-1) with the property that xj+1 will be chosen somehow in an (isotropically

generated) random plane S (red plane in Figure 2-1) centered at xi. Further, the step

from xi to xj+1 is independent of all the previous steps (Markov chain property.)

This situation is known as a Gibbs sampling Markov chain with isotropic random

search-subspaces.

For our purposes, we find it helpful to pick a sphere (light blue) of radius r that

21

represents the length of the jump we will take upon stepping from xi to xi+,. Note

that r is usually random. The sphere will be the natural setting to mathematically

exploit the symmetries associated with isotropically distributed planes. Conditioning

on the sphere, the plane S becomes a great circle S (red), and the manifold A

becomes a submanifold (blue) of the sphere. Assuming we take a step length of r,

then necessarily xi+1 must be on the intersection (green dots in Figure 2-1, higher-

dimensional submanifolds in more general situations) of the red great circle and the

blue submanifold.

For definitiveness, suppose our ambient space is R+'1 where n = 2, our blue

manifold M has codimension k = 1, and our search-subspaces have dimension k +

1. Our sphere now has dimension n and the great circle dimension k = 1. The

intersections (green dots) of the great circle with M are 0-dimensional points.

We now turn to the specifics of how xi+ 1 may be chosen from the intersection of

the red curve and the blue curve. Every green point is on the intersection of the blue

manifold and the red circle. It is worth pondering the distinction between shallower

angles of intersection, and steeper angles. If we thicken the circle by a small constant

thickness E, we see that a point with a shallow angle has a larger intersection than a

steep angle. Therefore points with shallow angles should be weighted more. Figure

2-2 illustrates that 1 is the proper weighting for an intersection angle of 62.sin(62 )

We will argue that the distinction between shallower and steeper angles takes

on a false sense of importance and traditional algorithms may become unnecessarily

inefficient accordingly. A traditional algorithm focuses on the specific red circle that

happens to be generated by the algorithm and then gives more weight to intersection

points with shallower angles. We propose that knowledge of the isotropic distribution

of the red circle indicates that all angles may be given the same weight. Therefore,

any algorithmic work that goes into weighting points unequally based on the angle of

intersection is wasted work.

Specifically, as we will see in Section 2.2.2, has infinite variance, due in partsin(Oi)

to the fact that ) can become arbitrarily large for small enough 64. The algorithm

must therefore search through a large fraction of the (green) intersection points before

22

converging because any one point could contain a signifiant portion of the conditional

probability density, provided that its intersection angle is small enough. This causes

the algorithm to sample the intersection points very slowly in situations where the

dimension is large and there are typically exponentially many possible intersection

points to sample from.

This chapter justifies the validity of the angle-independent approach through the

mathematics of integral geometry [53, 54, 22, 30, 1], and the Crofton formula in

particular in Section 2.2. We should note that sampling all the intersection points

with equal probability cannot work for just any choice of random search-subspace

S. For instance, if the search-subspaces are chosen to be random longitudes on the

2-sphere, parts of M that have a nearly east-west orientation would be sampled

frequently but parts of M that have nearly north-south orientation would be almost

never sampled, introducing a statistical bias to the samples in favor of the east-west

oriented samples. However, if S is chosen to be isotropically random, the random

orientation of S does not favor either the north-south nor the east-west parts of

M, suggesting that we can sample the intersection points with equal probability

in this situation without introducing a bias. Effectively, by sampling with equal

probability weights and isotropic search-subspaces we will use integral geometry to

compute an analytical average of the weights, an average that we would otherwise

compute numerically, thereby freeing up computational resources and speeding up

the algorithm.

In Part II of this chapter, we perform a numerical implementation of an approxi-

mate version of the above algorithm in order to sample the eigenvalues of a random

matrix conditioned on certain rare events involving other eigenvalues of this matrix.

We obtain different histograms from these samples weighted according to both the

traditional weights as well as integral geometry weights (Figure 2-3; Figures 2-10 and

2-11 in part II). We find that using integral geometry greatly reduces the variance of

the weights. For instance, the integral geometry weights normalized by the median

weight had a sample variance of 3.6 x 10, 578, and 1879 times smaller than the tra-

ditional weights, respectively, for the top, middle, and bottom simulations of Figure

23

41

li -A

Figure 2-1: In this example we wish to generate random samples on a codimension-kmanifold A C R' (dark blue) with a Gibbs sampling Markov chain {Xi, X 2 , .. .} thatuses isotropic random search-subspaces S (light red) centered at the most recent pointxi (k = 1, n = 3 in figure). We will consider the sphere rS" of an arbitrary radius rcentered at xi (light blue), allowing us to make use of the spherical symmetry in thedistribution of the random search-subspace to improve the algorithm's convergencespeed. S now becomes an isotropically distributed random great k-sphere S = SnrS"(dark red), that intersects a codimension-k submanifold M = M4n rJS of the sphere.

24

. This reduction in variance allows us to get faster-converging (i.e., smoother for

the same number of data points) and more accurate histograms in Figure . In

fact, as we show in Section , the traditional weights have infinite variance due to

their second-order heavy tailed probability density, so the sample variance tends to

increase greatly as more samples are taken. Because of the second-order heavy-tailed

behavior in the weights, the smoother we desire the histogram to be, the greater the

speed up in the convergence time obtained by using the integral geometry weights in

place of the traditional weights.

Figure 2-2: Conditional on the next sample point xj+1 lying a distance r from xi, the

algorithm must randomly choose xj+1 from a probability distribution on the intersec-

tion points (middle, green) of the manifold M with the isotropic random great circle

S (red). If traditional Gibbs sampling is used, intersection points with a very small

angle of intersection Oi must be sampled with a much greater (unnormalized) prob-

ability 1 (right, top) than intersection points with a large angle (right, bottom).

This greatly increases the variance in the sampling probabilities for different points

and slows down the convergence of the method used to generate the next sample xi+1 .

However, since S is isotropically distributed on rS", the symmetry of the isotropic

distribution of S allows us to use the Crofton formula from integral geometry to

analytically average out these unequal probability weights so that every intersection

point now has the same weight. freeing the algorithm from the time-consuming task

of effectively computing this average numerically.

Remark 2. Since we are using an approximate truncated version of the full algorithm

that is not completely asymptotically accurate, the integral geometry weights also

cause an increase in asymptotic accuracy. The full MCMC algorithm should have

perfect asymptotic accuracy, so we expect this increase in accuracy to become an

25

increase in convergence speed if we allow the Markov chain to mix for a longer amount

of time.

For situations where the intersections are higher-dimensional submanifolds rather

than individual points, we show in Section 2.3 that the angle-independent approach

generalizes to a curvature-dependent approach. We stress that traditional algorithms

condition only on the plane that was actually generated while ignoring its isotropic

distribution. By taking the isotropy into account, our algorithm can use the curvature

information of the manifold to compute an analytical average of the local intersection

volumes (local in a second-order sense) with all possible isotropically distributed

search-subspaces, greatly reducing the variance of the volumes.

Higher-dimensional intersections occur in many (perhaps most) situations, such

as applications with events that are rare for reasons other than that their associated

submanifold has high codimension. In these situations, the probability of a low-

dimensional search-subspace intersecting M can be very small, so one may wish to

use a search-subspace S of dimension d that is greater than the codimension k of M

in order to increase the probability of intersecting M.

As we will see in Section 2.3.6, the traditional approach can lead to a huge vari-

ance in the intersection volumes that increases exponentially with the difference in

dimension d - k (Figure 2-4, right). This exponentially large variance leads to the

same type of algorithmic slowdowns of the traditional algorithm as the variance in

the traditional angle weights discussed above. Using the curvature-aware approach

can oftentimes reduce or eliminate this exponential slowdown.

This chapter justifies the validity of the curvature-aware approach by proving a

generalization of the Crofton formula (Section 2.3). We then motivate the use of the

curvature-aware approach over the traditional curvature-oblivious approach using the

mathematics of concentration-of-measure [43, 38, 46] (Section 2.3.6) and differential

geometry [56, 57], specifically the Chern-Gauss-Bonnet Theorem [13] whose curvature

form we use to re-weight the intersection volumes (Section 2.3.4).

26

3

> 2

1. 5

c1-00a-0.5

0

0.5

0.4

0 0.3

0.2

0--

0.4

00.3

0.2C.-g 0.10-

A4 1(A A2'3' A5 ,A6'7)=(2,-3.5,-4.65,-7.9,-9,-10.8)}

-Rejection Sampling of A41 (A 3 ,A5)

-Integral Geometry Weights- Traditional Weights

-8 -7.5 -7 -6.5 6 -5.5 -54

A2 A1=2

4.5

-A 1=2(Integral geometry weights)

A=2(traditional weigh_A 1 2(rejection sampli

6 - - -3'-2-6 -5 -4 -3 A\2 -2 -1 0 1

2 1=5

-5 -4 -3 -2 2

-1 0

ts)ng)

Figure 2-3: Histograms from 3 random matrix simulations (see Sections and )where we seek the distribution of an eigenvalue given conditions on one or more other

eigenvalues. In all three figures, the blue curve uses the integral geometry weights

proposed in this chapter, the red curve uses traditional weights, and the black curve

(only in the top two figures) is obtained by the accurate but very slow rejection

sampling method. Two things worth noticing is that the integral geometry weight

curve is more accurate than the traditional weight curve (at least when we have a

rejection sampling curve to compare), and that the integral geometry weight curve is

smoother than the traditional weight curve. The integral geometry algorithm achieves

these benefits in part because of the much smaller variance in the weights. (In these

three simulations the integral geometry sample variance was smaller by a factor of

105 , 600, and 2000 respectively)

27

-- Integral Geometry WeightsTraditional Weights

F__ _T __-F I __T

--

--

-

1

Variance of Vol(SnM;) normalized by its mean

10

10 2

0 0

10- 0 50 100 150 200 250 300 350 400d

Figure 2-4: In this example a collection A4 UA/ of n - 1-dimensional spheresA (blue, left) is intersected (intersection depicted as green circles) by a randomsearch-subspace S (red). The spheres that S intersects farther form their center willhave a much smaller intersection volume than the spheres that S intersects closer totheir center, with the variance in the intersection volumes increasing exponentially inthe dimension d of S (logarithmic plot, right). This curse of dimensionality for theintersection volume can lead to an exponential slowdown when using a traditionalalgorithm to sample from S n M. In Section we will see that this slowdown canbe avoided if we use the curvature information to reweight the intersection volumes,reducing the variance in the intersection volumes.

Part I

Theoretical results and discussion

2.2 A first-order reweighting via the Crofton for-

mula

As discussed in the introduction to this chapter, we can use Crofton's formula directly

to eliminate the weight 1/1 d'l in step 3 of Algorithm

Algorithm 8 Great sphere sampler

Algorithm is identical to Algorithm except for the following step:

3: Sample xj+ 1 from r(x) restricted to S n A (using a subroutine Markov chain

or another sampling method)

Before we apply Crofton's formula to the Gibbs sampler (Algorithm ), which

uses isotropic random linear search subspaces (as opposed to great spheres), we need

28

the following modification of Crofton's formula:

Theorem 1. (Crofton's formula for isotropic random linear subspaces)

Let S be an isotropic random linear subspace centered at the origin. Let 7r be a

function on a manifold M then

7r(x)dx = c x E[snm - |IxIV- dx]d!M.

where c = cd1,k,n_1,s is a constant and ||I 1| is the sine of the angle between the

line passing through both the origin and x.

Proof.

r(x)dx = (X) dxdrJ0 rsn-1 n.A |dM=

Crofton formula on rSn Scd-1,k,n-1,S - Es[If0 snmnrsn

7r(x) dx]drI df dr

||g r|

Fubini ir(x)Cd-1,k,n-1,s -Es[ f r dx]dr

JO JsnMnlrsn M=I

Cd-l,k,n-1,s -Es[ 7r ( jx-dx]

2.2.1 The Crofton formula Gibbs sampler

As discussed in the introduction, we can apply the first-order reweighting of Theo-

rem 1 to the Gibbs sampler algorithm with d-dimensional isotropic random search-

subspaces to get a more efficient MCMC algorithm (Algorithm 9):

Algorithm 9 Crofton Formula Gibbs Sampler (for 7r supported on submanifold)


3: Sample xj+1 from the (unnormalized) density 7r(x)/I d4-)| 1 restricted to

S n M (using a subroutine Markov chain or another sampling method)

29

IM

IM

Theorem 2. The primary Markov chains in Algorithm 9 and 3 (denoted by x1 , X2 ,...

in both algorithms) have identical probability transition kernels.

Proof. For convenience, in this proof we will denote the primary Markov chains in

Algorithm 9 and 3 by zi, 22, ... and 1, 32, ... , respectively.

Let k(U, V) := P(i+1 E VJzi E U) and k(U, V) := P(si+1 E V i E U) denote

the probability transition kernel for the primary Markov chain in Algorithm 3 and 9

respectively (here U, V C Rn).

Then

K(U, V) = P(Gi+1 E Vjzi E U) = Es[ r(y)/ dSdy]dxJ snv dMA

=jjr(y) 1 idydxThe/remiy -f d -

The-eml IEs[J r(y)/11 -)| dy]dxU snv dM

= P(i+1 E Vl.i E U)

=k(U, V)]

2.2.2 Traditional weights vs. integral geometry weights

In this section we find the theoretical distribution of the traditional weights and

compare them to the integral geometry weights of Theorem 1. We will see that

while both weights incorporate the factor 1 the traditional weights have an

additional component not present in the integral geometry weights that has infi-

nite variance, greatly slowing the traditional MCMC algorithm. Indeed, d =

d(x-x) Mds), where dS is the projection of dS onto the tangent spaceIMJ d MJ FI d(M--) p fdMJ

at x of the sphere of radius |Ix - xill centered at xi. Since both weights share the

component | I)I, for the remainder of this section we will focus our analysis on

the component I d(Mnsx)I that is unique to the traditional algorithm.

30

In the codimension-k = 1 case, we can find the distribution of the weights by

observing that the symmetry of the Haar measure means that the distribution of

the weights are a local property that does not depend on the choice of manifold

M. Moreover, since the kinematic measure is locally the same for both constant-

curvature spaces S' and R', the distribution is the same regardless of the choice of

constant-curvature space. Hence, without loss of generality, we may choose M to

be a cylinder of unit radius in R'. We observe that projecting the cylinder down to

the unit circle in R2 , together with the dimension k = 1 search-subspace, does not

increase the weight 1 . Because of the rotational symmetry of both the kinematic

measure and the circle, without loss of generality we may condition on only the

vertical lines { (x, t) : t E R}, in which case x is distributed uniformly on [-1, 1]. The

weights are then given by w = w(X) = 1 _+ 1X2 with exactly two intersections at

almost every x. Hence, E[w] = 2 f 1+ 1$ dx = 2r, the circumference of the

circle, as expected. However, E[w 2] = 2 f 1+ x 2 dx = oo. Hence, w has infinite

variance. Since projecting down to R2 did not increase the weights, the original

weights must have infinite variance as well, greatly slowing the convergence of the

sampling algorithm even in the codimension-k = 1 case! On the other hand, the

integral geometry weights, being identically = 1, have variance zero, so the weights

do not slow down the convergence at all. (A related computation, which we do not

give here, shows that the theoretical weights for general k are given by the Wishart

matrix determinant 1 1(,G), where G is a (k + 1) x k matrix of i.i.d. standard

normals, which also has infinite variance.)

2.3 A generalized Crofton formula

Oftentimes, it is necessary to use a random search-subspace of dimension d larger

than the codimension k of the constraint manifold M (the manifold we wish to

sample from). For instance, the manifold might represent a rare event, so we might

use a higher dimension than the codimension to increase the probability of finding an

intersection with the manifold. However, the intersections will no longer be points

31

but submanifolds of dimension d - k. How should one assign weights to the points on

this submanifold? The first-order factor in this weight is simple: it is the same as the

Jacobian weight of Theorem . However, the size of the intersection still depends on

the orientation of the search-subspace with respect to the constraint manifold. For

instance, we will see in Section that if we intersect a spherical manifold with a

plane near the sphere's center, then we will get a much larger intersection than if we

intersect the sphere with a plane far from its center.

This example suggests that we should weight the points on the intersection using

the local curvature form: If we intersect in a direction where the curvature is greater

(with the plane not passing near the center in the example) then we should use a larger

weight than in directions where the curvature is smaller (when the plane passes near

the center) (Figure ).

7

Figure 2-5: Both d-dimensional slices, S and S2, pass through the green point x,but the slice passing through the center of the n-1 sphere M has a much biggerintersection volume than the slice passing far from the center. The smaller slice alsohas larger curvature at any given point x. If we reweight the density of Si nM at x bythe Chern-Gauss-Bonnet curvature of Si n M at x, then both slices will have exactlythe same total reweighted volume (exact in this case since the sphere has constantcurvature form), since the Chern-Gauss-Bonnet theorem relates this curvature to thevolume measure.

Consider the simple case where M is a collection of spheres. If we were just

applying an algorithm based on the Classical Crofton formula, such as Algorithm ,

we would sample uniformly from the volume on the intersection S n M. However,

the intersected volume depends heavily on the orientation of the search-subspace S

with respect to each intersected sphere (Figure ), meaning that the algorithm will

32

in practice have to search through exponentially many spheres before converging to

the uniform distribution on S f M (See section 2.3.6). To avoid this problem, we

would like to sample from a density Wi that is proportional to the absolute value of

the Chern-Gauss-Bonnet curvature of S n M at each point x in the intersection:

w = W^ (x; S) = IPf(Qx(S n M))I (The motivation for using the Chern-Gauss-Bonnet

curvature Pf(Q2(S n M)) will be discussed in Section 2.3.4).

However, sampling from the density w^ (x; S) does not in general produce unbiased

samples uniformly distributed on M even when S is chosen at random according

to the kinematic measure. We will see in Theorem 3 that in order to guarantee

an unbiased uniform sampling of M we can instead sample from the normalized

curvature density

W(X; S) = x~dS .^(;S (2.1)Cd,k,n,K EQ[) (x; SQ) x det(Projm Q)]

The normalization term EQ [W^1(x; SQ) x det(ProjM Q)] is the average curvature at

x over all the random orientations at which S could have passed through x. Here

SQ = Q(S - x) + x is a random isotropically distributed rotation of S about x,

with Q the corresponding isotropic random orthogonal matrix. The determinant

inside the expectation is there because while S is originally isotropically distributed,

the conditioning of S to intersect M (at x) modifies the probability density of its

orientation by a factor of det(Projmj Q). Projag Q is the projection of the orthogonal

complement of the tangent space of M at x. In this collection of spheres example,

the denominator is a constant for each sphere of a radius R. For instance, in the

Euclidean case it can be computed analytically, using the Gauss-Bonnet theorem, as

d-1 F(d + 1) F(--1 + 1)F(n-- + 1) R(27r)- 2 X 2 2 Rn.7r! (n - d) (n - d)F(- 2 1 + n 2 +)

From this fact, together with the fact that the total curvature is always the same

for any intersection by the Chern-Gauss-Bonnet theorem, we see that when sampling

under the probability density w the probability that we will sample from any given

sphere is always the same regardless of the volume of the intersection of S with that

33

sphere. Since each sphere (of the same radius) has an equal probability of being

sampled, when sampling from M the algorithm has to search for far fewer spheres

before converging to a uniformly random point on S n M than when sampling from

the uniform distribution on S n M.

The need to guarantee that w will still allow us to sample uniformly without bias

from M motivates introducing the following generalization of the classical Crofton

formula (Theorem 3), which, as far as we know, is new to the literature. Since the

proof does not rely on the fact that w is derived from a curvature form, we state

the theorem in a more general form that allows for arbitrary w^ (see Section 2.3.5

for a discussion of higher-order choices of Cv beyond just the Chern-Gauss-Bonnet

curvature).

Theorem 3. (Generalized Crofton formula)

Let & be a weight function, M a manifold, and S a random search subspace

moving according to the kinematic measure, satisfying smoothness conditions Al and

A2 (defined below). Then

Vol(M) X E (x; S) dx]. (2.2)Cd,k,n,K snM EQ[w(x;SQ) x det(Projmj Q)]

Q is a matrix formed by the first d columns of a random matrix sampled from the

Haar measure on SO(n). SQ := Q(S - x) + x. Proju is the projection onto the

orthogonal complement of the tangent space of M at x.

(As in Lemma 1, if S is a plane, we set the "Vol(S)" term to 1.)

Remark 3. We note that Amelunxen and Lotz [2, 3] recently managed to provide a

more elegant proof of our Theorem 3, by modifying our proof using an algebraic ap-

proach similar to group-theoretic double-fibration arguments of [22, 1, 30]. Although

our proof of Theorem 3 relies on smoothness conditions Al and A2 (defined below),

their proof does not seem to rely on these two assumptions.

For MCMC applications, M is taken to be a component of a level set of 7F and tC

the magnitude of the Chern-Gauss-Bonnet curvature of SnM, since the Chern-Gauss-

34

Bonnet theorem states that the integral of the curvature form over the intersection

SnM is invariant under rotations of S as long as the topology of Sn M is unchanged.

Definition 2. (Smoothness conditions)

Al: A manifold (such as M or S in Theorem 3) satisfies condition Al if its cur-

vature form is uniformly bounded above.

A2: The pre-normalized weight Cv(x; S) is said to satisfy condition A2 if it is any

function such that a < wi(x; S) < b for some 0 < a < b, and is Lipschitz in the

variable x E M for some Lipschitz constant 0 < c / oo (when using a translation of

S to keep x in S n M when we vary x).

Proof. (Of Theorem 3)

We first observe that it suffices to prove Theorem 3 for the case where K' = R' is

Euclidean, S is a random plane, and w(x; S) = w(x; L) depends only on the orien-

tation d = d I of the tangent spaces of S and M at x. This is because constant-

curvature kinematic measure spaces are locally Euclidean (and converge uniformly to

a Euclidean geometry if we restrict ourselves to increasingly small neighborhoods of

any point in the space because the curvature is the same). We may use any geodesic

d-cube in place of the plane as a search-subspace S, since S can be decomposed as

a collection of cubes, and Equation 2.2 treats each subset of S in an identical way

(since so far we have assumed that w(x; S) depends only on the orientation of the

tangent spaces of S and M at x). We can then approximate any search-subspace S

of bounded curvature, and Lipschitz function w(x; S) that depends on the location

on S where S intersects M (in addition to L), by approximating S with very small

squares, each with a different "w(x; L)" that depends only on d.

The remainder of the proof consists of two parts. In Part I we prove the theorem

for the special case of very small codimension-k balls (in place of M). In Part II we

extend this result to the entire manifold by tiling the manifold with randomly placed

balls.

35

Part I: Special case for small codimension-k balls

Let BE = BE(x) be any k-ball of radius c that is tangent to M C R' at the ball's

center x. Let S and 5 be independent random d-planes distributed according to the

kinematic measure in RI. Let r be the distance in the k-plane containing BE (the

shortest line contained in this plane) from S to the ball's center x. Let 0 be the

orthogonal matrix denoting the orientation of S. Then we may write S = S,,O

Then almost surely (i.e., with probability 1; abbreviated "a.s.") Vol(Sr,o n BE)

does not depend on 0 (this is because BE is a codimension-k ball and S is a d-plane,

so the volume of S n BE, itself a d - k-ball, depends a.s. only on r and not on 9).

We also note that w(x; d) obviously does not depend on r as well. Define events

E := {S,,O n B #0} and := {5n B, # 0}. Then

Er,O [W (x; d ) 'X VOldk-(Sr,o Be)] (2.3)

= E,,O W (X; d x) X VOldk(Sr,o n BE) E x P(E) (2.4)

= Eo [ (X; dB E x Er[Vold-k(s,o nBE)IE] x P(E) (2.5)

= E0 x ' dB, E[ x E[VolVdk((S,o nBlB)EE] x P(E) (2.6)Cd,k,n,R E [ (x; d) ]V S f

1 Eo[W^ (x; dsx,)IE]= x dB x E[VOldk-(Sr,O n BE)IE] x P(E) (2.7)

Cd,k,n,R Eg[((X; d)|5

36

Cd,k,n,R

1

Cd,k,n,R

1

Cd,k,n,Rl

1

Cd,k,n,R

X 1 X Er[VOld-k(Sr,o n B,)|E] x IP(E)

X Er,O[VOld-(Sr,o n BI)E] x P(E)

X Er,O[VOld-k(Sr,o f Be)]

X Cd,k,n,R X Vold-k(BE)

= Vold-k(BE).

(2.8)

(2.9)

(2.10)

(2.11)

(2.12)

* Equation 2.5 is due to the fact that r and 0 are independent random vari-

ables even when conditioning on the event E. This is true because they are

independent in the unconditioned kinematic measure on S, and remain inde-

pendent once we condition on S intersecting BE (i.e., the event E) because of

the symmetry of the codimension-k ball BE.

* Equation 2.6 is due to the fact that, by the change of variables formula,

1' 1Vol(TQ + RQLy n BE)dVolfld(y) x d = Vol(BE) (2.13)

fRn-d det Proi j;L

for every orthogonal matrix Q, where the coordinates of the integral are con-

veniently chosen with the origin at the center of BE. RQI is rotation matrix

rotating the vector y so that it is orthogonal to TQ, the subspace spanned by

the rows of Q.

37

Multiplying by tzbx; Q) and rearranging terms gives

(x; Q) x det(ProjB=Q)

fR-d Vol(TQ + RQLyVol(B

n BE)dVolfd(y)

6) (2.14)

Taking the expectation with respect to Q (where Q is the first d columns of a

Haar(SO(n)) random matrix) on both sides of the equation gives

EQ[tb(x;Q) x det(ProjBIQ)]

~EK[WXnQ fR--dVol(TQ + RQIy nVol(BE)

BE)dVol_ (y)I

Recognizing the right hand side as an expectation with respect to the kinematic

measure on TQ+RQ y conditioned to intersect BE (since the fraction on the RHS

is exactly the density of the probability of intersection for a given orientation

of Q), we have:

EQ[w(x; Q) x det(Proj5 Q)] = E X; dM (2.16)

Equation 2.8 is due to the fact that ds' 6 = r because BE hasdBtge dB

tangent space, and hence

dSxo'dBe

a constant

E] = ErO [ -(x; '6) E]

= Er,O [ X; d~ o) E] = E I(X;

9 Equation 2.11 is by the Crofton formula.

Writing Es in place of E,,O in Equation 2.3 (LHS)/ 2.12 (RHS) (we may do this

since S = S,,O is determined by r and 0), and observing that _ = g = d, We an osevig ha B, =j B - dS

38

b(X; Q) x

(2.15)

dSEdBE ,

E]. (2.17)

have shown that

Es wj dM) x Vold-k(Sn B) =Vold-k(BE). (2.18)1(dM

Part II: Extension to all of M

All that remains to be done is to extend this result over all of M. To do so, we

consider the Poisson point process {x } on M, with density equal to 1 We wish

to approximate the volume-measure on M using the collection balls {BE(x )} (think

of making a papier-mich6 mold of M using the balls BE(x ) as tiny bits of paper).

Let A C M be any measurable subset of M. Since M and S have uniformly

bounded curvature forms, because of the symmetry of the balls and the symmetry

of the poisson distribution, the total volume of the balls intersected by S and A

converges a.s. to Vol(S n m n A) on any compact sumbanifold M C M:

Vol(B,(x ) n A) a.! VO(Vol(S n Be(D) x -- > Vol(S ( xn M n A), (2.19)

Vol(BE(x)) o

and similarly,

Vol(BE(xe) n A) -+ Vol(M n A). (2.20)C40{i:xiEM}

But, by assumption, w is Lipschitz in x on M (since WJ, which appears in both

the numerator and denominator of w, is Lipschitz, and the denominator is bounded

below by a > 0), so we can cut up M into a countable union of disjoint compact

submanifolds UL1 Mj such that Iw(t; ) - w(x; )I < 6 on all of x, t E M3 , and

hence, by Equation 2.19,

Vol(Bjxzi) n A) w x;dSlim Vol(S n BE(Zx) n A) x x ( E -4 0 Vol(BE(x )) ' dMj

- (x; d ) dVol(x) < 6 x Vol(S nM n A) (2.21)snasnf dyj

a. s. for every j.

39

Summing over all j in equation 2.21 implies that

Vol(S n BE(x,) n A) x Vol(V

snMnA w X

BE(xe) n A)l(BE(x ))

dS)dVol(

x w (xf; - )" dM

X) <6x voi(s n MnA)

almost surely. Since Equation 2.22 is true for every 6 > 0, we must have that

Vol(S nB(z) nA)Vol(BE(x,) n A)

Vol(B,,(x ))

dSdM )dVol(x).dM

Hence, taking the expectation Es on both sides of Equation 2.23, we get

ES V ol(S n B, (z) n A) x A)

-+ Es[ w (X;dS) dVol(x)]dMI

a.s. as c 4 0 (we may exchange the limit and the expectation by the dominated

convergence theorem, since I E Vol(S n BE(xz) n A) x w(x4; )I is dominated by

2 x Vol(S n M) x k) for sufficiently small e.

Since the sum on the LHS of Equation 2.24 is of nonnegative terms we may

exchange the sum and expectation, by the monotone convergence theorem:

Es [Vol(S nB(x) nA)

= ZEs Vol(S

Vol(B(xs) n A)x Vol(B(x))

n B,(xf)) x

w ; )

Vol(Be(xi) n A)

Vol(Bjxzi))dS)]

.

40

lim4O

(2.22)

dS )dM

xw X'

I

0 snMnA

w (x; (2.23)

x1w(XdS

)91

(2.24)

(2.25)

But by Equation 2.18, Es[Vol(S n B,(x )) x w(x ; A)] = Vol(BE(x )), so

Vo(S nB(xE)) xVOl(BE(xj) n A)

Vol(Bejxi))

Vol(B,(x ) nA)= Vol(B(xf)) x -+ Vol(M n A)

almost surely as e 4 0 by Equation 2.20.

Combining Equations 2.24 and 2.26 gives

Es [fflm nA w X; dS dVol(x) =dMJ

Vol(M n A).

We now prove Theorem 4, a version of Theorem 3 with somewhat more general

analytical assumptions.

Theorem 4.

Suppose that ?b(x; S) is c(t)-Lipschitz on M n {x : (x; S) < t}, and that

limt-*oo

EQ [(w^ (x; SQ) - A T-(x; SQ) V t) x det(Projm Q)]EQ [I A ?-(x; SQ) V t x det(ProjM Q)]

= 0

and

lim Esb-+oo [ Snm

IIA (X) x [w(x; S)1

- - V w(x; S)t

where we define the "A" and "V " operators to be r A s := min{r, s} and r V s

max{r, s}, respectively, for all r, s E R.

Then Theorem 3 holds even for a = 0 and b = c = oo.

Proof. (Of Theorem 4)

Define

EQ [(zC(x; SQ) - a A W^ (x; SQ) V b) x det(Projm LQ)]

EQ [a A w-(x; SQ) V b x det(Projm Q)]

41

ES[Z f dS )]dMJ

(2.26)

(2.27)

A t) d Vol] = 0,

Let A be any Lebesgue-measurable subset. Then

1A(X) x w(x; S) dVol]

= lim Es[

= t m Esj

1A(X) x w(x;S) dVoll

IA(X) x V w(X; S)t

1ILA(X) X [W(X; S) - A

t+ lim Es[

t-+ . snm

= lim Esj

= lim Esj

= lim Esjt-+oo . snm

1A(X) x 1t

IA(X) x

V w(x; S) A t]

V w(x; S) A t dVoll +0

V (x; S) A tEQ [C(x; SQ) x det(ProjM Q)] dVoll

RA(X)

EQ R V (x;V w(x; S) A t

SQ) A t x det(Projm Q)] x (1+ k(t))

42

Es [j (2.28)

(2.29)

(2.30)A t dVoll

dVoll

(2.31)

(2.32)

(2.33)

dVolJ

= lim Est-+0 [ L snm

E A v(x; S) V t 1

EQ[} A (x; SQ) V t x det(Projm Q)] dVOIJ X 1 + 0(t)

1= lim Vol(MlnA)x

t-+oo +

= Vol(M n A) x 1

= Vol(M n A).

* Equation 2.31 is true because

(2.35)

(2.36)

(2.37)

O Es 1A(x) x I A w(x;S) V tdVoll

5 EsE[ I A w(x;S) V t dVol -+0.[JsnM t t-0

* Equation 2.35 follows from Theorem 3 using .1 A ii(x; S) V t as our pre-weight.

Indeed, 1A (x; S) Vt obviously satisfies the boundedness conditions of Theorem

3. Moreover, since z(x; S) is c(t)-Lipschitz everywhere on M where zZ (x; S) < t,

the pre-weight .1A v(x; S) V t must be c(t)-Lipschitz on all x E M.

t

43

IA (X) (2.34)

2.3.1 The generalized Crofton formula Gibbs sampler

We now state a Generalization of Theorem 1, that can be proved in much the same

way as Theorem 1 by applying our Generalized Crofton's formula (Theorem 3) in

place of the classical Crofton formula:

Theorem 5. Let S be an isotropic random subspace centered at the origin. Let 7r be

a function on a manifold M then

j r(q) zi(q;SfnSq)7r(q)dq = c xE[ d-||jqj n- dq]

sn I I EQ [foi (q; SQ n Sq) x det(Proj(Mqfns,)!Q)]

where c = Cdl,k,n_1,S is a constant and ||gLr|| is the sine of the angle between the

line passing through both the origin and q. (Sq is the sphere of radius ||q|| centered at

the origin.)

Proof. The proof is identical to our proof of Theorem 1, with Crofon's formula on the

sphere replaced by the Generalized Crofton formula on the sphere (Theorem 3). 0

Applying Theorem 5 gives the following improvement to Algorithm 9:

Algorithm 10 Generalized Crofton Formula Gibbs Sampler (for 7r supported onsubmanifold)

Algorithm 10 is identical to Algorithms 9 and 3, except for the following step:

3: Sample xj 1 from the (unnormalized) density

wi(X) = 7r(x) bi (x; Si n Sx)- IEQ[f(x; SQ n Sx) x det(Proj(M.fns.)-Q)

restricted to S n M (using a subroutine Markov chain or another sampling

method). (Sx is the sphere of radius jix - xi4l centered at xi.)

As discussed in Section 2.3, we would usually set (i to fvi(x, S) = IPf(2(S nM))I.

However, as discussed in Section 2.3.5 in some cases it may be advantageous to use

other functions for di.

44

Theorem 6. The primary Markov chains in Algorithms 10 and 3 (denoted by x 1 , x 2 , ...


Proof. For convenience, in this proof we will denote the primary

Algorithms 3 and 10 by x 1, x 2 , ... and 1, ^ 2,. , respectively.

Let k(U, V) := P(i+1 E V1zi E U) and k(U, V) := P(i+1 Ethe probability transition kernel for the primary Markov chain in

10 respectively (here U, V C Rn).

Then

k(U, V) = P(zi+1 E Vjzi E U) = j Es[j

Markov chains in

Vj. E U) denote

Algorithms 3 and

dS740y/1 dSdy]dxdM

= 4 r(y) 1 _1 dydx

Theorem5 j j d(yx) i(y; S n S,)SESsnvy dM 1 EQ ['Ji (y; SQ n S,) x det(Proj(mdnsY) 0 ]dx

= P(i+ VjIj E U)

=K(U, V)]

Remark 4. The curvature form 0x(Si+1 n M n sx) of the intersected manifold can

be computed in terms of the curvature form Qx(M) of the original manifold by

applying the implicit function theorem twice in a row. Also, if M is a hyper-

surface then IPf(Qx(Si+1 n M n Sx)) is the determinant of the product of a ran-

dom Haar-measure orthogonal matrix with known deterministic matrices, and hence

EQ[IPf(Qx(Q n M n Sx))I x det(ProjM Q)] is also the expectation of a determinant

of a random matrix of this type. If the Hessian is positive-definite, then we can

obtain an analytical solution in terms of zonal polynomials. Even in the case when

the curvature form is not a positive-definite matrix (it is a matrix with entries in the

algebra of differential forms), the fact that the curvature form is the Pfaffian of a

random curvature form (in particular, a determinant of a real-valued random matrix

45

in the codimension-1 case) should make it very easy to compute numerically, perhaps

by a Monte Carlo method.

This fact also means that it should be easy to bound the expectation, which allows

us to use Theorem 3 to get bounds for the volumes of algebraic manifolds (Section

2.3.8).

Remark 5. While the Chern-Gauss-Bonnet theorem only holds for even-dimensional

manifolds, if M has odd dimension we can always include a dummy variable to

increase both the dimensions n and d by 1.

2.3.2 The generalized Crofton formula Gibbs sampler for

full-dimensional densities

In many cases one might wish to sample from a full-dimensional set of nonzero prob-

ability measure. One could still reweight in this situation to achieve faster conver-

gence by decomposing the probability density into its level sets, and applying the

weights of Theorem 5 separately to each of the (infinitely many) level sets. We ex-

pect this reweighting to speed convergence in cases where the probability density is

concentrated in certain regions, since when d is large, intersecting these regions with

a random search-subspace S typically causes large variations in the integral of the

probability density over the different regions intersected by S, unless we reweight

using Theorem 5.

Algorithm 11 Generalized Crofton Formula Gibbs Sampler (for full-dimensional 7r)

Algorithm 11 is identical to Algorithms 9 and 10, except for the following step:

3: Sample xj+1 (using a subroutine Markov chain or another sampling method)

from the (unnormalized) density

wi(x) = ir(X) -Q (x S nEQ [ i (x; S n Sr)]'

where Sx is the sphere of radius |x - xi I centered at xi.

46

As discussed at the beginning of Section 2.3, we would usually set ?i to tbi(x, S) =

IPf(Qx(S n Lx))I, where Cx is the level set of 7r passing through x. If we instead set

qii(x, S) = 1, we get the traditional Gibbs sampler (Algorithm 2).

Theorem 7. The primary Markov chains in Algorithms 11 and 2 (denoted by x1 , x 2, ...


Proof. For convenience, in this proof we will denote the primary Markov chains in

Algorithms 2 and 11 by i1 , z2, ... and i1, 2 ,..., respectively.

Let K(U, V) := P(zi+ 1 E V~j. E U) and k(U, V) :=QPs 1 E V|%i E U) denote

the probability transition kernel for the primary Markov chain in Algorithm 2 and 11

respectively (here U, V C Rn).

Then

f(U, V) = P(i+1 E VIzi E U) = Es[j ir(y)dy]dx

=r r(Y) _,dydx//\|y - X11|-

Theorem5 7r(Y i(y; S n yES r(y) S) dy]dJu fsnv ( EQ [i(y; SQ n Sy)]

= EGsi+1 E V|Si E U)

=K(U, V),

where we set M = R" when applying Theorem 5.

0

2.3.3 An MCMC volume estimator based on the Chern-Gauss-

Bonnet theorem

In this section we briefly go over a new MCMC method (which we plan to discuss in

much greater detail in a future paper) of estimating the volume of a manifold that is

47

based on the Chern-Gauss-Bonnet curvature. While this method is interesting in its

own right, we choose to introduce it here since it will serve as a good introduction

to our motivation (Section 2.3.4) for using the Chern-Gauss-Bonnet curvature as a

pre-weight for Theorem 3.

Suppose we somehow knew or had an estimate for the Euler characteristic X(M) /

0 of a closed manifold M of even-dimension m. We could then use a Markov chain

Monte Carlo algorithm to estimate the average Gauss curvature form Em[(Pf(Q))]

on M.

The Chern-Gauss-Bonnet theorem says that

IM Pf(Q)dVlm = (27r)iX(M). (2.38)

We may rewerite this as

fM Pf(Q)dVolm (27r) X(M)

fM dVolm fM dVOlm

By definition, the left hand side is Em[(Pf(Q))], and fM dVolm = Volm(M), so

(2-x) mX (M)EM[(Pf(Q))] = ,2)_ M (2.40)

Volm(M)

from which we may derive an equation for the volume in terms of the known quantities

Em[(Pf(Q))] and X(M)_(2wr) x(M)

Volm(M) = .( (M) (2.41)Em [(Pf(Q))]'

2.3.4 Motivation for reweighting with respect to Chern-Gauss-

Bonnet curvature

While Theorem 3 tells us that any pre-weight fv generates an unbiased weight w, it

does not tell us what pre-weights reduce the variance of the intersection volumes. We

argue here that the Chern-Gauss-Bonnet theorem in many cases provides us with an

ideal pre-weight if one only has access to local second-order information at a point x.

48

Equation 2.41 of Section 2.3.3 gives an estimate for the volume

Voldk (S n M) = (27r)dy(s n m) (2.42)EsnM[(Pf(Q(S nfM)))]'

where Q(S n M) is the curvature form of the submanifold S n M.

If we had access to all the quantities in Equation 2.42 our pre-weight would then

be 1 EsnM[P(Q(SnM)))]. However, as we shall see we cannot actually im-vold-k(snM) (27r)-7- x(SnM)

plement this pre-weight since some of these quantities represent higher-order informa-

tion. To make use of this weight to the best of our ability given only the second-order

information, we must separate the higher-order components of the weight from the

second-order components by dividing out the higher-order components.

The Euler characteristic is essentially a higher-order property, so it is not reason-

able in general to try to estimate the Euler characteristic x(S n M) using the second

derivatives of M at x because the local second-order information gives us little if any

information about X(S n M) (although it may in theory be possible to say a bit more

about the Euler characteristic if one has some prior knowledge of the manifold). The

best we can do at this point is to assume the Euler characteristic is a constant with

respect to S, or more generally, statistically independent of S.

All that remains to be done is to estimate EsnMPf(Q(S n M)). We observe that

EsnMPf(Q(S fl M))EsnMPf(Q (S n M)) = EsnM IPf(Q(S n M))I x EsnmPf(.(s n m)) (2.43)

EsE(Pf(S(s n m))

But the ratio ESnM PfsnMD is also a higher-order property since all it does is de-

scribe how much the second-order Chern-Gauss-Bonnet curvature form changes glob-

ally over the manifold, so in general we can say nothing about it using only the local

second-order information. The best we can do at this point is to assume that this

ratio is statistically independent of S as well.

49

Hence, we have:

1 M) = EsnMlPf(Q(S nf M))lxVoldki(S n M)44

Vd-k- Esn nf( m) )((27r)'2 'x(snA)E mfQ(S n M)), (2.44)

Esnm IPf(Q (S n M))|

where we lose nothing by dividing out the unknown quantity

(27r)m X(M) EsnsPf (snm) since we have no information about it and it is indepen-

dent of S.

We would therefore like to use Esnm Pf(Q(SnM)) as a pre-weight. Since we only

know the curvature form Q(Sn M) locally at x, our best estimate for Esnm Pf(Q(Sn

M)) is the absolute value IPf(Qx(SnM)| of the Chern-Gauss-Bonnet curvature at x.

Hence, our best local second-order choice for the pre-weight is W^ = |Pf(Q (S n M)|.

2.3.5 Higher-order Chern-Gauss-Bonnet reweightings

One may consider higher-order reweightings which attempt to guess not only the

second-order local intersection volume, but also make a better guess for both the

Euler characteristic of the intersection SQ fM and how the curvature would vary over

SQfnM. Nevertheless, higher-order approximations are probably harder to implement

for the same reason that most nonlinear solvers, such as Newton's method, do not

use higher-order derivatives. Moreover, it may not even be desirable to implement

higher-order reweightings. Indeed, if the local intersection region whose volume we

are aiming to estimate is so large that the second derivatives vary widely over this

region, then the statistic we wish to compute with our algorithm will most likely

also vary widely over this region, ensuring that different samples over this region will

contain different information about this statistic. Hence, we probably only need to

consider volume approximations that are local in a second-order sense.

50

2.3.6 Collection-of-spheres example and concentration-of-measure

In this section we argue that the traditional algorithms can suffer from an exponential

slowdown (exponential in the search-subspace dimension) unless we reweight the in-

tersection volumes using Theorem 3 with the Chern-Gauss-Bonnet curvature weights.

We do so by applying two results (Theorems 8 and 9) related to the concentration-

of-measure phenomenon, to an example involving a collection of hyperspheres.

Consider a collection of very many hyperspheres in R'. We wish to sample uni-

formly from these hyperspheres. To do so, we imagine running a Markov chain with

isotropically random search-subspaces. We imagine that there are so many hyper-

spheres that a random search-subspace typically intersects exponentially many hy-

perspheres. As a first step we would use Theorem 1 which allows us to sample the

intersected hypersphere from the uniform distribution on their intersection volumes.

While using Theorem 1 should speed convergence somewhat (as discussed in Section

2.2.2), concentration-of-measure causes the intersections with the different hyper-

spheres to have very different volumes (Figure 2-6). In fact we shall see that the

variance of these volumes increases exponentially in d, causing an exponential slow-

down if only Theorem 1 is used, since the subroutine Markov chain would need to

find exponentially many subspheres before converging.

Reweighting intersection volumes using Theorem 3 causes each random intersec-

tion S n Mi (where Mi is a subsphere) to have exactly the same reweighted inter-

section volume, regardless of the location where S intersects Mi, and regardless of d.

Hence, in this example, Theorem 3 allows us to avoid the exponential slowdown in

convergence speed that would arise from the variance of the intersection volumes.

The first result deals with the variance of the intersection volumes of a sphere

in Euclidean space. It says that the variance of the intersection volume, normalized

by it's mean, increases exponentially with the dimension d (as long as d is not too

close to n). Although isotropically random search-subspaces are (conditional on the

radial direction) distributed according to the Haar measure in spherical space, the

Euclidean case is still of interest to us since it represents the limiting case when the

51

Figure 2-6: The random search-subspace S intersects a collection of spheres M4. Eventhough the spheres in this example all have the same n - I-volume, the d - I-volumeof the intersection of S with each individual sphere (green circles) varies greatlydepending on where S intersects the sphere if d is large. In fact, the variance of theintersection volume of each intersected sphere increases exponentially with d. This"curse of dimensionality" for the intersection volume variance leads to an exponentialslowdown if we wish to sample from sn M with a Markov chain sampler (and S n M4consists of exponentially many intersected spheres). However, if we use the Chern-Gauss-Bonnet curvature to reweight the intersection volumes, then all spheres in this

example will have exactly the same reweighed intersection volume, greatly increasingthe convergence speed of the Markov chain sampler.

hyperspheres are small, since spherical space is locally Euclidean.

T heorem 8. (Variance resulting from concentration of Euclidean kinematic measure)

Let S C R' be a random d-dimensional plane distributed according to the kinematic

measure on R". Let M4 = S" C R" be the unit sphere in R n. Defining a :=1, we

have

k (a, d)eCd x a") I < Var( VlSnA ) < K (a, d)eCdxpd) -- 1, (2.45)E [Vol (S n M4)]-

where

o(a) = log(2) + ()log(-) - ( + -)log(- + 1) - ( )log(- 1)a a 2ce 2 a 2a 2 a

(27T 2 (TI -- 1)(" d ) -2 1k (a, d) = 4 )(n - d)2 4-

e- (n - 1)(n _ d -2

K (a, d) = 0 3 ( -d)2 T" - C)( -n)-1r47r2 (d - 1)(n + d - 2)

52

Proof. Consider the unit sphere M = S- 1 centered at the origin. By symmetry of

the sphere, the intersection M n S of the unit sphere with a d-dimensional plane S

is entirely determined (up to a rotation) by the plane's orthogonal complement S'

that passes through the origin, and the intersection point x = S n S'. By symmetry,

we may assume S' = Rn-d is aligned with the first n - d coordinate axes. If S is

distributed according to the kinematic measure, we must have that x is distributed

uniformly on the ball Bn-d c Rn-d. Hence, IP(IlxJl 5 R) = vld(RBnd)= Rn-d,VOnd(Bn-d)

0 < R < 1, where B-d is the unit n - d ball.

The radius of SnM is just /1 - Ilxi 12, and hence Vol(SnM) = Cd x (1- lxi 2)d/2,

where Cd drd/2

Denoting by S, a d-plane whose associated x-value has | xii = r, and by Cd =

the constant in the volume formula for the d-dimensional unit ball, we have:

E[Vol(SnS"- 1)1] = )dVol(Srn-)'dP(r) = (cd(1-r2)d )txr -d-dr

(C)_ 2 d-1 __ (cd)t F(td 2 + 1)F(!2 + 1)(1 - r2) 2 x r dr = x (2.46)

(cd ___ - _d___-_) __t_-_-+ ____1)

where the last equality is by Gauss's theorem for the Gauss hypergeometric function.

In particular,

Cd p(di +1),(n-+2)E[Vol(S n S f-1)] = d x 2 2 (2.47)n - d (n - d)]F(n - 1 + 1)

and

Var[Vol(S n S"-1)] = E[Vol(S n Sn- 1) 2] - E[Vol(S n S"-1)]2

_ (Cd) 2 1(d - 1 + 1)F(n-d + 1) Cd F(!. 1 + 1)F("~a2 ) 2 (2.48)

n - d (n - d)(d - 1 + "+ 1) n - d (n - d)(zi - 1 + 1)

53

Combining Equations 2.47 and 2.48 gives

Var o(s n -1) ~I ]VarE[voi(s n sn-1)_

r(d - 1 +1)r(ngd + 1)

F(d - I1+ nid + 1)

E[(S n n-1)2]

E[(S n Sn-1)]2 - 1

(n - d)( - j + 1) - 1r(d + 1)]p(n -d +)

F(d - 1+ 1) (n - d) 2 F( - j+1)2 1

r(nd - 1 + 1) F(9 1 + 1) 2 (n-d + 1)

= k x (n - d)2 X(d - 1)d--le-d+1+d - -i- +1

k x (n-d)2 X - 1 x(n-JI)n+'

(n+d-2 n~d x

where the second-to-last equality is a consequence of Stirling's formula and k = k(d, n)(27r)~ _3

is some number that satisfies 4 k < e. To shorten our equations, we set A

log(k)+2log(n-d) - log(d 1) and B := log(2-j) -1 log(2+1- 2)+ log(a -1).

Hence,

log(Var + 1)

= dlog(2) + (n + ) log( )2 2 -(n + d

n- d + ) log( n -d) + A2 2 2

=dlog(2) + (n + ) log(n )2 d -(n + d

- - ( 1 ) - d2-( )log )log(2d) [(n +2

-(n + d

- ( + )log(n d ) + A2 2 d

= dlog(2) + (n + 1) log( - )2 d d

-(n d + 1)

n+ d-(2

)log(n + d2 2

2)~n d~2

(n + d

- 1)log(n + d- 2

- 1) log(

log( - 1)+ A

= d[log(2) + (n) log(n -d 2d

-)log(-2 d -d - 1)log(-2d 2 d

- 1)]+ A + B

54

x -1

- 1, (2.49)

1 n - d+ )] +A

((-1)n,-a+i)2

(( -1 d1)2(n-d)n-+),

vol(s n sn-1)

E[voi(s n sn-1)

=d log(2) + (n + 1) log(n )

+1 -2

+ -2)

But, log(x) - a < log(x - a) log(x) - whenever a > 0 and x > 0. Therefore,

log(Var [ Vo1(S n -1)] + 1)>

d x [log(2) + (,)[log(,) -A]n/d

- + 1)[log(n + 1) -

2d - 1) log(-2d2 d- 1)] + A + B

= d x log(2)

- (2

= d x log(2)

1 1+ (-)[log(-)-

1log(- -

a1 1

+ (-)[log(-)]a a

- ( + )[log( + 1) -2a 2 a

1)] + A + B

1 1 )[log(-a + ) 1 log(

log( e4 )+2

1

log( - )

+ -log(2 d

1 ) - )2a 2

and, in the reverse direction,

log(Var[ Vol(S n n-1) + 1) <o E[Voi(S n S 1)] -1

= d x [log(2) + (n)[log(n)1

d - 3(7n 1 7

'd + -)[log(-T2d1) log(- - 1)] + A

2 d

= d x [log(2) + (n)[log(n)]

+ [log( e 2) + 2 log(n

(n+[log(n

- d) - Ilog(d - 1)2

- ) log( - 1)]

1 ni+ -log(- -)

2 d d

- 1 log( + 1 - 2) + 1 log(n - 1) - _ +1].n2-1I

Combining the above two inequalities (Equations 2.50 and 2.51), we get the double-

55

2d 2

2]

1e - 1)]

21~+ 1 - (2.50)

2

+ 1)- d

(2.51)

-1 )

log(n - d) - Ilog(d - )+ 1

log( n+ 1 - )2

n

+ n) -(

sided inequality

dlog(2) + (n + ) log( - 1) - ( +d- -)log(n+ 1 )- (nd2 d 2 2 d 2

+ [log(k) + 2 log(n - d) - I log(d - 1)]2

< log(Var vo(i(sn -) + 1)- E[vol(s n sn-1 )]]

< d log(2) + (n + 1) log(n) - (n + d I-) log(-)2 d

n - d+ ) log ( - 1)

+ [log(k) + 2log(n - d) - I log(d - 1)].2

Substituting a e into Equation 2.52, we getn

d x [log(2) + ( I + 1) log( - 1)-a (- + ) log( + 1)

2a 2 a

+ [log( ) - log(- + 1) + 2log(d(a - 1))

[ Vol(S nl "-1)< log(Var LEVol(S n Sn-1)] +)

-)log( 3-) S 2 )log

- log(d2 - 1)]

1)]

+ [log( e)1 1

+ log( - -1) + 2 log(ad - d) - - log(d - 1)],2 a 2

where the second set of brackets contain only lower-order terms. This completes the

proof of Theorem 8. 1

56

+ I) log( - 1)

(2.52)

(2.53)

< d x [log(2) + ( 1

2.3.7 Variance due to spherical-geometry kinematic measure

concentration

The next result (Theorem 9 and Figure 2-8) deals with the spherical geometry case.

As in the Euclidean case, the concentration of spherical-geometry kinematic measure

causes the variance of the intersection volume to increase exponentially with the

dimension d as well. (While we were able to derive the analytical expression for the

variance of the intersection volumes (Theorem 9), which we used to generate the plot

in Figure 2-8 showing an exponential increase in variance, we have not yet finished

deriving an inequality analogous to Theorem 8 for the spherical geometry case. We

hope to make the analogous result available soon in [43])

In this example M (Figure 2-7, dark blue) is the boundary of a spherical cap of

the unit n-sphere S" (Figure 2-7, light blue), contained in a hyperplane M a distance

h from the center of S . We want to calculate the volume of the intersection Sn M to

show the exponentially large variance which results from the concentration of measure

of these intersection volumes. S n M is a dimension d - 1 subsphere (Figure 2-7,

top left, green), which we project down to a dimension-0 sphere consisting of two

green points in the figures in order to save our precious 3 dimensions to illustrate

other features. To compute Vol(S n.M), we must find the radius r of the intersection

S n M. As a first step we will consider the el component y of the maximum point

(Figure 2-7, yellow) of S (Figure 2-7, top left, red) in the el direction, where el

is defined to be the direction orthogonal to the hyperplane M containing M. By

congruence of the two triangles in the bottom-left diagram, we know that y is just

||Pge1 |l, the length of the projection of el onto S (Figure 2-7, red), the smallest

Euclidean subspace containing S. By congruence of the two triangles in the bottom-

right diagram we have that the length a of the dotted diagonal line is a = L. DrawingY

S" from a different (3-dimensional) projection (Figure 2-7, top right) that contains

a diameter of S n M, we see by the pythagorean theorem that S n M has radius

r =VI/ -1a2

57

1h

Figure 2-7: Diagrams used to obtain the volume of the intersection of a great d-sphere S c S' with the boundary of a spherical cap M c S" (diagrams arrangedcounterclockwise from top-left)

P can be generated as the submatrix consisting of the first d + 1 rows of a

random matrix Q sampled from the Haar measure on the special orthogonal group

SO(n+1). Since the distribution of Q is invariant under action by orthogonal matrices

in SO(n + 1), each row of Q corresponds to a random vector on Sn, and is therefore

distributed according to .. N+i)T where N1 , ..., N+ 1 are independent standard

normals. Hence, y = ||Pseilj is distributed according to KNi Nd~l)TH XI(Ni - x2-I-Y 2

Tj

where X~ Xd+1 and Y ~ nX-d are independent. Hence, = 1 + , where ZdIx is F-distributed with parameters (n - d, d+1), conditional on Z < (i.e.

conditional on y > h). Now we can use the fact that we know the distribution of y to

obtain the variance of the intersection volume (Theorem and Figure ). The plot

(Figure ) shows that the variance of the intersection volume grows exponentially

with the dimension d of S when d is not too large, and grows exponentially with the

codimension n - d of S when the codimension n - d is not too large.

58

Theorem 9. (Variance for spherical-geometry kinematic measure)

Let M be a n - 1-dimensional subsphere of Sn C Rn+1, such that the hyperplane

containing M lies at a distance h from the center of Sn. Let S be an isotropic random

d-dimensional great subsphere of Sn. Then

Vol(S n M) h7+01 _ nn-d _n--i 1Var z2 (1+ z) 2 dz

kiE[Vol(S n M)] J d + 1d+ ( 11)

_

n- -)(1- h 2 (1 + n z))d-1z- -1(1 + z)- n+1 dz (2.54)

x L(1 - 9/h nd 1 n n d + 2x1 Z2 t )

=h (I - h 2(1 + n-dZ)) d Z - ( )--d

Proof. From the discussion above (illustrated in Figure 2-7), we have

dP(Z < z) _ [F(T) n - d n-d n-d _ 1 n - d _n+fz(z) : z )(d) d +1) 2 xZ 2 (1+ 1 z) 2 (2.55)

dE(Vol(S n~ M)t x 1{Z < z})=V(S M)f ()-~ (c2~~1 2

dE vo~s n my x td< z}z vol(s n m ) (z) x fzz< +i ( i-1) (z)

= Cd- lr(Zjd-1 t X d - Z)

' f-d h fz(z)dz (2.56)

2(1 n -d 2z- - + 1 2-c x 1-h 2 (1+ z) xd-+f

+ 1Zn)-l dz

Hence,

Vol(S n M) 7(-- n-a_ n - d _+1Var zz2 (1+ z) 2 dz

E[Vol(S n M)]J[ d0+d 1(1 - -Z Z21 di)2d

n_.ghT 21 + -dZ -d d n+1(2.57)fn (1 h1 + n))d-Z- -1(1 + gz)-4 2dz(

x + )

_h7 (I - 42(1 + E-_z))d zn-d-1(1 + 2 z)" Z

59

U

We can now numerically evaluate the integrals in Equation to obtain a plot

(Figure ) of Var( VEs")) for different values of d:

Variance of Vol(SnM.) normalized by its mean12510 --- --

1020

S10 15_

10

10 0-

10-50 50 100 150 200 250 300 350 400d

Figure 2-8: This log-scale plot shows the variance of Vol(S n Mj) normalized byits mean, when S is an isotropic random d-dimensional great subsphere of S', for

different values of d where n = 400. Mi is taken to be the boundary of a spherical

cap of the unit sphere S' with geodesic radius r(d) such that S has a 10% probability

of intersecting Mi. The variance increases exponentially with the dimension d of the

search-subspace (as long as d is not too close to n), leading to an exponential slow-

down in the convergence for the traditional Gibbs sampling algorithm applied to the

collection-of-spheres example of Section . Reweighting the intersection volumes

with the Chern-Gauss-Bonnet curvature using Theorem in this example (where

M = UM is a collection of equal-radius subspheres M 2 ) causes each (nonempty)

random intersection S n Mi to have exactly the same reweighted intersection volume

regardless of d, allowing us to avoid the exponential slowdown in the convergence

speed that would otherwise arise from the variance in the intersection volumes.

2.3.8 Theoretical bounds derived using Theorem and alge-

braic geometry

Generalizing on bounds for lower-dimensional algebraic manifolds based on the Crofton

formula (such as the bounds for tubular neighborhoods in [42] and [29]), it is also

possible to use Theorem to get a bound for the volume of an algebraic manifold

M of given degree s, as long as one can also use analytical arguments to bound

the second-order Chern-Gauss-Bonnet curvature reweighting factor on M for some

convenient search-subspace dimension d:

60

Theorem 10. Let M C R' be an algebraic manifold of degree s and codimension

1, such that EQ[IPf(Qx(SQ n M))| x det(Proj pIQ)] > b for every x E M, and the

conditions of Theorem 3 are satisfied if we set Wi(x; s) = |Pf (G(S n M))1. Then

1 1 s x (s - 1)dVol(M) < x - x vl(s) (2.58)cd,k,n,R b 2

Proof. If we have an algebraic manifold of degree s in R", by Bezout's theorem the

intersection with an arbitrary plane is also degree s. Hence (at least in the case

where M has codimension 1), we can use Risler's bound to bound the integral of the

absolute value of the Gauss curvature over S n M by a := sx(-1)d Vol(Sn) [51, 49].2

By Theorem 3,

Vol(M) = Es[VOld(S) xPf(Q(snM)) dVold k]cd,k,n,K IEQ[IPf(Qx(SQ n M))| x det(ProjpI Q)]

11 1 1< x Es[If(Qx(S n M))] x - < x a x

cd,k,n,R b- cd,k,n,R b

El

Unlike a bound derived using only the Crofton formula for point intersections,

the bound in Theorem 10 allows us to incorporate additional information about the

curvature, so we suspect that this bound will be much stronger in situations where

the curvature does not vary too much in most directions over the manifold. We hope

to investigate examples of such manifolds in the future where we suspect Theorem

10 will provide stronger bounds, but do not pursue such examples here because it is

beyond the scope of this chapter.

61

Part II

Numerical simulations

2.4 Random matrix application: Sampling the stochas-

tic Airy operator

Oftentimes, one would like to know the distribution of the largest eigenvalues of a ran-

dom matrix in the large-n limit, for instance when performing principal component

analysis [34]. For a large class of random matrices that includes the Gaussian or-

thogonal/unitary/symplectic ensembles, and more generally the beta-ensemble point

processes, the joint distribution of the largest eigenvalues converges in the large-

n limit, after rescaling, to the so-called hard-edge limiting distribution (the single-

largest eigenvalue's limiting distribution is the well-known Tracy-Widom distribution)

[34, 58, 20, 21]. One way to learn about these distributions is to generate samples

from certain large matrix models. One such matrix model that converges particularly

fast to the large-n limit is the tridiagonal matrix discretization of the Stochastic Airy

operator of Edelman and Sutton [58, 20],

d2 2- x + -- dW, (2.59)

where dW is the white noise process. We wish to study the distributions of eigenvalues

of the hard edge conditioned on other eigenvalue(s).

To obtain samples from these conditional distributions, we can use Algorithm 8,

which is straightforward to apply in this case since dW is already discretized as i.i.d.

Gaussians.

The stochastic Airy operator (2.59) can be discretized as the tridiagonal matrix

[20, 58]1 2

A3= A - h x diag(1, 2, ... , k) + N, (2.60)h2 h(

62

-2 1

1 -2 1

1 -2 1where A = is the k x k discretized Laplacian, N

1 -2 1

1 -2

diag(K(0, 1)') is a vector of independent standard normals, and the cutoff k is cho-

sen (as in [20, 58]) to be k = 1OnA (the O(10n- ) cutoff is due to the decay of

the eigenvectors corresponding to the largest eigenvalues, which decay like the Airy

function, causing only the first O(10n-1) entries to be computationally significant).

2.4.1 Approximate sampling algorithm implementation

The discretized stochastic operator A, is a function of spherical (i.i.d.) Gaussians

h;7N (Equation 2.60). Since, conditional on their magnitude, these Gaussians are

uniformly distributed on a sphere, we can use the following modification of Algo-

rithm 8 to sample AO conditional on our eigenvalue constraints of interest after first

independently sampling their X,-distributed magnitude (Figure 2-10):

Algorithm 12 Great sphere sampler (with weights)Algorithm 12 is identical to Algorithms 4 and 8, except for the following steps:

output: x 1, x2 ,..., with associated weights w 1, w2 , ... , having (weighted)

distribution 7r

3: Sample xj+ 1 uniformly from S n M (using a subroutine Markov chain or

another sampling method). Set the weight wj+j = ir(Xi+1).

To simplify the algorithm, in our simulations we will use a deterministic nonlin-

ear solver with random starting points in place of the nonlinear solver-based MCMC

"Metropolis" subroutine of Algorithm 12 to get an approximate sampling. This is

somewhat analogous to setting both the hot and cold baths in a simulated annealing-

based (see, for instance, [28]) "Metropolis step" in a Metropolis-within-Gibbs algo-

63

rithm to zero temperature, since we are setting the randomness of the Metropolis

subroutine to zero while fully retaining the randomness of the search-subspaces.

3F,

Figure 2-9: In Algorithm the random great circle (red) intersects the constraintmanifold (the blue ribbon which represents the level set {g : A = 3} in this exam-ple) at different points, generating samples (green dots). The constraint manifoldhas different (differential) thickness at different points, given by 1 . Instead of

weighting the green dots by the (differential) intersection length of the great circleand the constraint manifold at the green dot, Crofton's formula allows Algorithmto instead weight it by the local differential thickness, greatly reducing the variationin the weights (see Sections , . and ).

Remark 6. Using a deterministic solver with random starting point in place of the

more random nonlinear solver-based "Metropolis" Markov chain subroutine of Algo-

rithm introduces some bias in the samples, since the nonlinear solver probably will

not find each point in the intersection Si 1 n M n rSi with equal probability. There

is nothing preventing us from using a more random Markov chain in place of the

deterministic solver, which one would normally do. However, since we only wanted

to compare weighting schemes, we can afford to use a more deterministic solver in

order to simplify numerical implementation for the time being, as the implementation

of the "Metropolis" step would be beyond the scope of this chapter. It is important

to note that this bias is not a failure of the reweighting scheme, but rather just a

consequence of using a purely deterministic solver in place of the "Metropolis" step.

On the contrary, we will see in Sections and , that this bias is in fact much

smaller than the bias present when the traditional weighting scheme is used together

with the same deterministic solver. In the future, we plan to also perform numerical

64

simulations with a random "Metropolis" step in place of the deterministic solver, as

described in Algorithm 12.

2.5 Conditioning on multiple eigenvalues

In the first simulation (Figure 2-10), we sampled the fourth-largest eigenvalue con-

ditioned on the remaining 1st- through 7th- largest eigenvalues. We begin with this

example since in this particular situation, when conditioned only on the 3rd and 5th

eigenvalues, the 4th eigenvalue is not too strongly dependent on the other eigenvalues

(the intuition for this reasoning comes from the fact that the eigenvalues behave as

a system of repelling particles with only week repulsion, so the majority of the inter-

action involves the immediate neighbors of A4 ). Hence, in this situation, we are able

to test the accuracy of the local solver approximation by comparison to brute force

rejection sampling. Of course, in a more general situation where we do not have these

relatively week conditional dependencies, rejection sampling would be prohibitively

slow (e.g., even if we allow a 10% probability interval for each of the six eigenvalues,

conditioning on all six eigenvalues gives a box that would be rejection-sampled with

probability 10-6).

Despite the fact that the integral geometry algorithm is solving for 6 different

eigenvalues simultaneously, the conditional probability density histogram obtained

using Algorithm 12 with the integral geometry weights (Figure 2-10, blue) agrees

closely with the conditional probability density histogram obtained using rejection

sampling (Figure 2-10, black). Weighting the exact same data points obtained with

Algorithm 12 with the traditional weights instead yields a probability density his-

togram (Figure 2-10, red) that is much more skewed to the right than either the

black or blue curves. This is probably because, while theoretically unbiased, the

traditional weights greatly amplify a small bias in the nonlinear solver's selection of

intersection points.

65

A 4 {(A1 ,A 2'A 3 A 5'A6 'A 7)=(-2,-3.5,-4.65,-7.9,-9,-1 0.8)}

2.5 - Rejection Sampling of A41 (A3 A5)-Integral Geometry Weights

a 2- Traditional Weights0>11.5-

-00o 0.5-

-8 -7.5 -7 -6.5 -6 -5.5 -5 -4.5

Figure 2-10: In this simulation we used Algorithm together with both the tradi-tional weights (red) and the integral geometry weights (blue) to plot the histogram

of A 4 (A1,A2, A3 , A5, A6 , A7 ) = (-2, -3.5, -4.65, -7.9, -9, -10.8) . We also provideda histogram obtained using rejection sampling of the approximated conditioning

A4 1(A 3,A 5 ) - [-4.65 0.001] x [-7.9 0.001] (black) for comparison (conditioningon all six eigenvalues would have caused rejection sampling to be much too slow).

Since we used a deterministic solver in place of the Metropolis subroutine in Algo-

rithm 2, some bias is expected for both reweighting schemes. Despite this, we seethat the integral geometry histogram agrees closely with the approximated rejec-

tion sampling histogram, but the traditional weights lead to an extremely skewed

histogram. This is probably because, while theoretically unbiased, the traditional

weights greatly amplify a small bias in the nonlinear solver's selection of intersection

points. The skewness is especially large (in comparison to Figure ) because we

are conditioning on 6 eigenvalues simultaneously.

2.6 Conditioning on a single-eigenvalue rare event

In this set of simulations (Figure ), we sampled the second-largest eigenvalue

conditioned on the largest eigenvalue being equal to -2, 0, 2, and 5. Since A = 5

is a very rare event, we do not have any reasonable chance of finding a point in

the intersection of the codimension 1 constraint manifold A = {A = 5} with the

search-subspace unless we use a search-subspace of dimension d >> 1. Indeed, the

analytical solution for A, tells us that P(Al > 2) = 1 x 10-4, P(Al > 4) = 5 x 10 8 and

P(Al > 5) < 8 x 10-10 [,), -17]. For this same reason, rejection sampling for A = 2

is very slow (58 sec./sanmple vs. 0.25 sec./sample for ) and we cannot hope to

perform rejection sampling for A = 5 (It would have taken about 84 days to get a

single sample!). To allow us to make a histogram in a reasonable amount of time, we

66

will use 12 with search-subspaces of dimension d = 23 >> 1, vastly increasing the

probability of the random search-subspace intersecting M.

In (Figure 2-11, top), we see that while the rejection sampling (black) and integral

geometry weight (blue) histograms of the density of A2 1 A, = 2 are fairly close to each

other, the plot obtained with the exact same data as the blue plot but weighted in

the traditional way (red) is much more skewed to the right and less smooth than both

the black and blue curves, implying that using the integral geometry weights from

Theorem 1 greatly reduces bias and increases the convergence speed (Although the

red curve is not as skewed as in Figure 2-11 of Section 2.5. This is probably because

in this situation the codimension of M is 1, while in Section 2.5 the codimension was

6.)

In (Figure 2-11, middle), where we conditioned instead on A1 = 5, we see that

solving from a random starting point but not restricting oneself to a random search-

subspace (purple plot) causes huge errors in the histogram of A 2 IA1. We also see that,

as in the case of A1 = 2, the plot of A 2 obtained with the traditional weights is much

more skewed to the right and less smooth than the plot obtained using the integral

geometry weights.

In (Figure 2-11, bottom), we apply our Algorithm 12 to study the behavior of

A2 IA1 for values of A1 at which it would be difficult to obtain accurate curves with

traditional weights or rejection sampling. We see that as we move A 1 to the right,

the variance of A 2IA1 increases and the mean shifts to the right. One explanation

for this is that the largest and third-largest eigenvalues normally repel the second-

largest eigenvalue, squeezing it between the largest and third-largest eigenvalues,

which reduces the variance of A 2 IA1. Hence, moving the largest eigenvalue to the right

effectively "decompresses" the probability density of the second-largest eigenvalue,

increasing it's variance. Moving the largest eigenvalue to the right also allows the

second-largest eigenvalue's mean to move to the right by reducing the repulsion from

the right caused by the largest eigenvalue.

Remark 7. As discussed in Remark 6 of Section 2.4.1, if we wanted to get a perfectly

accurate plot, we would still need to use a randomized solver, such as a subroutine

67

0.5 A 1 =2 .-.A1 =2(Integral geometry weights) >1 A, 2(traditional weights)

0.4 - .... _2(rejection sampling)a)

C 0.3-

TD .2 --

00.10

-26~~~~~ -5 - 3A2-

1 -7- A21 1,=0 -Integral Geometry Weights_ Unconstrained Solver

Traditional Weights

- -"

-6 -5 -4 -3 -2 -1 0 1 2 3 4"2

A 21A1={-2,O,2,5}, Integral geometry weights

-6 -5 -4 -3 2 -2

A =0

.A =5

=2

-1 0 1

Figure 2-11: Histograms of A2 =A -2, 0, 2, and 5, generated using Algorithm .A search-subspace of dimension d 23 was used, allowing us to sample the rareevent A = 5. In the first plot (top) we see that the rejection sampling histogram ofA2A = 2 is much closer to the histogram obtained using the integral geometry weights(blue) than the histogram obtained with the traditional weights (red) because the redplot is much more skewed to the right and less smooth (it takes longer to converge)than either the blue or black plots. If we do not constrain the solver to a randomsearch-subspace, the histogram we get for A21 = 5 (purple) is very skewed to theright (middle plot), implying that using a random search-subspace (as a opposed tojust a random starting point) greatly helps in the mixing of our samples towards thecorrect distribution. As an application of our algorithm, in the last plot (bottom), theprobability densities of A2 JA obtained with the integral geometry weights show thatmoving the largest eigenvalue to the right has the effect of increasing the variance ofthe probability density of A2LJA and moving its mean to the right, probably becausethe second eigenvalue feels less repulsion from the largest eigenvalue as A, -- oc.

68

0.4U)

00.3

0.2

-00.10

0~

0.6 -

0.4 -

M 0.2 -00

CL-

SEEM_

Markov Chain, to randomize over the intersection points. Since d = 23, the volumes of

the exponentially many connected submanifolds in the intersection S i1 n M would

be concentrated in just a few of these submanifolds, with the concentration being

exponential in d, causing the algorithm to be prohibitively slow for d = 23 unless

we use Algorithm 11, which uses the Chern-Gauss-Bonnet curvature reweighting of

Theorem 3 (see Section 2.3.6). Hence, if we were to implement the randomized solver

of Algorithm 12, the red curve would converge extremely slowly unless we reweighted

according to Theorem 3 (in addition to Theorem 1). Hence, the situation for the

traditional weights is in fact much worse in comparison to the integral geometry

weights of Theorems 1 and 3 than even (Figure 2-11, middle) would suggest.

Acknowledgements

We gratefully acknowledge support from NSF DMS-1312831. Oren Mangoubi was

supported by the Department of Defense (DoD) through the National Defense Science

& Engineering Graduate Fellowship (NDSEG) Program.

69

Chapter 3

Mixing Times of Hamiltonian

Monte Carlo

3.1 Introduction

Hamiltonian Monte Carlo (also called Hybrid Monte Carlo or HMC) algorithms are

some the most widely-used [26, 14, 44] MCMC algorithms. In this chapter we derive

lower bounds for the mixing times of a large class of Hamiltonian Monte Carlo al-

gorithms sampling from an arbitrary probability density 7r, including the traditional

Isotropic-Momentum HMC algorithm [19], Riemannian Manifold HMC [27, 25] and

the No-U-Turn Sampler [32], the workhorse of the popular Bayesian software package

Stan [10] (Section 3.2). We do so by applying the continuity equation, a generaliza-

tion of the divergence theorem used extensively in fluid mechanics. For comparison,

we also prove lower bounds for the mixing times of the Random Walk Metropolis

MCMC algorithm (Section 3.4).

Since true mixing times, in the narrow sense of the word, usually do not exist for

continuous state-space Markov Chains, we use the term "mixing time" here in the

broader sense to refer to the relaxation time tre := 1, defined as the inverse of the

spectral gap p; the term "mixing times" is oftentimes used more loosely to include a

variety of measures of convergence times, including relaxation times [40]. Denoting

the Markov transition Kernel by P(., .), the spectral gap p is the smallest number

71

such that

l|P(., .)I|L2 (7r) PI|I|PL2 (7r)

for any signed measure p [52]. If P has a second-largest eigenvalue A2 (for example,

if P is a matrix of a finite state space Markov Chain), then p = 1 - A2. Geometric

ergodicity of HMC algorithms was proved under very general conditions in [41, 9],

implying existence of a non-zero spectral gap under those conditions [52].

Cheeger's inequality [11, 37, 55] provides bounds for the spectral gap in terms of

the bottleneck ratio 4D(S) of a subset S of the state space, a quantity proportional to

the probability that the Markov chain at stationary distribution transitions between

S and Sc:

*<5 p < 2(* (3.1)2

where 4* := minsCRA(S). In Section 3.2, we derive bounds for the spectral gap by

using the symplectic volume-conservation properties of the Hamiltonian in the phase

space to obtain an equation for the bottleneck ratio [40] and then applying Cheeger's

inequality to bound p.

3.2 Hamiltonian Monte Carlo mixing times

In this section we derive equations for the bottleneck ratio. First, we define the

following terms:

" Nas and Ng,) are the number of times a random trajectory intersects aS and

an E-ball of (q, p), respectively.

* Pq is the component of p in the direction orthogonal to oS at q.

" Ps+(q) is the half-space of momentum vectors pointing away from S at q E aS.

Definition 3. (of Q and Qqp))

Let F be the probability measure on the random trajectories -y at stationary

distribution conditioned to intersect aS at least once. Define Q to be the probability

measure whose density is proportional to dP(-y) - Nas(7y).

72

Similarly, let lP,,,) be the probability measure on the random trajectories -y at sta-

tionary distribution conditioned to intersect BE(q, p) nS x Rd at least once. Let Q'qp)

to be the probability measure whose density is proportional to dP',)(Y) N6 4,,(-).

Theorem 11. Isotropic-Momentum and Riemannian Manifold HMC.

Let r(q) be any probability density on Rd. Let S C Rd be any subset of the position

space. Then the bottleneck ratio for an HMC Markov chain with fixed trajectory

evolution time T and any smooth phase-space stationary distribution 7r(q, p) satisfies

D(S) = D+ -EI Nas - 1{Nas odd}] I(S) (3.2)

where the total positive flux 4+ is

4)+ =T j IOlog(ir (q, p)) dd= Tr -r(q, p) -dpdq

s+ (q) '9Pq

In the case of Isotropic-Momentum HMC, (D+ reduces to

+=T j 7r(q)dqIj)jas d vf2

The term EQ - 1{Nas odd} can be interpreted as the average periodicity of

the Hamiltonian trajectories comprising each iteration of the algorithm. Observing

that EQ [ -1{Nos odd} <; 1 gives an upper bound for bottleneck ratio. Numerical

simulations for various two-mode densities that approximate Gaussian mixture models

(Figure 3-1) suggest that this bound is nearly tight in many cases where T is not too

large. To extend Theorem 11 to algorithms with non-fixed trajectory time, such as

No-U-Trn HMC, we express T as a function T(q, p):

Theorem 12. (General HMC, including No-U-Turn HMC)

Let S C Rd be any subset of the position space. Then the bottleneck ratio 1(S) of

73

an HMC Markov chain with trajectory evolution time T(q, p) is

// Blg(?rq, p))s() =f r(qp) - -T(q, p) -liM -1{Nasodd} dpdq H(S).

+s P+(q) 0 p ( Nas

(3.3)

The proofs of Theorems 11 and 12, make use of the continuity equation, a general-

ization of the divergence theorem that says that the change in probability measure of a

subset S of the position space at any given time is equal to the flux of the probability

measure flowing through 9S [50, 33]. Also essential is the fact that the symplec-

tic phase-space volume preserving properties of Hamiltonian mechanics imply that

the stationary distribution of HMC is equal to the invariant measure of Hamilton's

equations at every point in time of the trajectory's evolution.

Proof. (of Theorem 11)

Define 4D+ to be the total flux flowing into (but not out of) SC during one step of the

algorithm (or equivalently, by reversibility, the flux into S)

First, note that by reversibility, P(Nas = n, 7y(0) E S) = P(Nas = n, y(O) E Sc).

Hence, P(Nas = n) = P(Nas = n, 7y(0) E S) + P(Nas = n, 7y(0) E S') = 2P(Nas =

n, y(0) E S), i.e.,

1P(Nas = n, y(O) E S) = P(Nas = n, 7(0) E SC) = P(Nas = n) (3.4)

2

Define the measure Q by d(7) := Nas(y) - dP(y) at every trajectory 7.

By the continuity equation,

Sn+1 -1<D+ E nL 12 P(Nas = n, -(O) E S) + n 2 P(Ns = n, (O) E Sc)

n=1,3,... n=1,3,...

+ P(Nas = n, y(O) E S) + E fP(Nas = n, 7(O) E Sc)n=2,4... n=2,4,...

74

E(Nas = n)n=1,2,3,...

2(n= 1, 2,3,..

Hence, Zn-1,2,3,... NO(Nas = n) = 1,

measure, and hence Q =

Therefore,

Nas = n)

so the measure is a probability

P(.y(T) E SC, (O) E S) = P(Nas = n, -y(O) E S)n==1,3,...

1 P(Nas = n)n=1,3,...

n 1 .

n= 1,3,...

Q(Nas = n) = <+-In E

n=1,3,...

-Q(Nas = n) =n

D+-EQ 1{Nas odd}]

All that remains to be done is to compute <D+. Towards this end, let v+ (q, p)

be the velocity of the Hamiltonian trajectory if it is flowing away from S, and zero

otherwise.

At any time t the time derivative of the total flux into Sc is:

dtb+dt = 1 r(q, p)

.s d- v+(q, p)T'r/(q)dpdq (3.5)

and hence

b+= T (q, p) -v+(q, P)T T7(q)dpdqdt0 'SI d.

= T. -asf I r(q,p)- v+(q, p)Tr(q)dq

Applying Hamilton's equations gives

v+(qp)T 7(q) =olog(7r(q, p))+pq 1pSc(q)

(3.6)

75

<b+ = T - f(q, p) log((q, p)) dqJ sP p(q) 09pq

In the case of Isotropic-Momentum HMC, Equation 3.7 simplifies to

4P+ =T - as fs

=T - j f

= T - -r(

= T - 2

d//F($)2

=T V2 -(\//F( ~)

= T - fr(q)dqas

(q)

r(q,p)- log(r(qp)) dqapq

q) ' &, (y) ydydq

q) - d dq%/2/r(l) v 2d7r

Js7r(q)dq - 10 ly2e-2Ndy

1

2d7r(q)dq

-F(I)

dVd


Define 4b+ to be the total flux going into S' during one step of the algorithm (or

equivalently, by reversibility, the flux into S) due to only those trajectories for which

Nas = n. Define 4D, E to be the total flux flowing into S' during one step of the(qp)

algorithm through BE(q, p) n 4S x Rd, and let < : m (p)

P(y(T) E SC IY(O) E S) = zn=1 ,3,....

P(Nas = n, y(O) E S) = =n=1,3,....

1 [n.P(Nas = n, y(U) C S)

(by the continuity equation)

jS Rdlim Qp)(Nas = n)dpdq640

(by law of total probability)

76

so

(3.7)

(3.8)

1n=1,3,....

n=1,3,...

isI p) liQq,p)(Ns = n)dpdqn=1,3,....

< n=1,3...

1 IRd

(by Monotone Convergence Theorem)

1- rli Q,,p) (Nas = n)dpdqn 40o d d

q,p) [jas. {Nas odd}] dpdq

All that remains to be done is to compute <D+,,y. Towards this end, let v+(q, p)

be the velocity of the Hamiltonian trajectory if it is flowing away from S, and zero

otherwise.

At any time t the time derivative of the total flux into S' is:

d) +dq4' = r(q, p) - IL{T(q, p) <5 t} - v+(q, p)T (q) (3.9)

and hence

- T

(q+) - 7r(q, p) - 1{T(q, p) <_ t} -v+(q, p, t)Ti(q)dt

= T(q,p) -7r(q,p) - v+(qp)T?(q)

Applying Hamilton's equations gives

v (qp)TJ(q) = log(7r(q, p))0Pq S7()P

(3.10)

so

(q p)= T(q, p) r(q,p)- (3.11)0og(7r(q, p)) ]-

09 q)p)

One consequence of this bound is that the time it takes energy-conserving Hamil-

tonian Markov chains to search for far-apart sub-Gaussian modes grows exponentially

with both the dimension and the distance between modes, resolving an open question

posed by Prof. Neil Shephard at Harvard.

77

4+ im E(Q,

Equation 3.3 also suggests that an optimal HMC algorithm should minimize a

particular function of the periodicity of each Hamiltonian trajectory. In the future,

we plan to investigate to what extent the No-U-Turn algorithm, the default algorithm

in the widely-used Stan software [10], approaches this optimality, and whether one

can design a better algorithm closer to optimality.

In many applications, the following bound for the Bottleneck ratio of HMC may

also be helpful:

Theorem 13. Let S c Rd be any subset of the state space. Then the bottleneck

ratio (D(S) of an HMC Markov chain (including Isotropic-Momentum, Riemannian

Manifold and No- U- Turn HMC) with any stationary distribution 7r satisfies

<(S) < H [I - Fj(E - U(q))]H(q)dq, (3.12)

where U(q) := -log(7r(q)), E := minqeasU(q), and FX2 is the CDF of the x2 random

variable.

Proof. No trajectory starting at qo E S can exit S if it does not start with energy at

least E := minqeOsU(q). Hence, the probability of exiting S starting at the stationary

distribution -r must be at most IP,({H(qo,po) ;> E} n {qo E S}) = fs(i - FX2(E -

U(q)))7r(q)dq. E

Finally, we provide a simulation (Figure 3-1) of isotropic HMC sampling of a two-

mode density. The results of the simulation agree closely with Theorems 11 and 12,

and illustrate the various components of Equations 3.3 and 3.2.

3.3 Cheeger bounds in extreme cases

If the mean step size T = e is small, the spectral gap is close to the bound .

Why is this true? First of all, as e 4 0 the expectation in the integrand of Theorem

12 approaches = 1 (i.e., there are no "u-turns" as e 4 0 since all paths are nearly

78

1 -

0.8-

C\J 0.6

0.2 -

T10-212 10 8-

6 T 4 2 --

Figure 3-1: In this simulation we computed the spectral gap for theIsotropic-Momentum HMC algorithm with the stationary distribution ir(q) =

2F 1 ()ax(fg(O,1)(q - a), J(o,1)(q - a)), for different distances 2a between the

two modes and different Hamiltonian trajectory times T. As predicted in Theoremthe spectral gap is bounded above by

linearly with T when a = 0 for T < .to the fact that the trajectories here haveexpectation term in Equation variesexponential decay in a2 is explained by tso the corresponding term in Equationpoint between the two modes) decreases eimates the Gaussian mixture model #(q)- - - + 0 as a -- oo. The plots were

linear function of T, and in fact increasesThe approximate periodicity in T is due

period ;> , meaning that the conditional

(approximately) periodically with T. Thehe fact that f0 8 (q)dq = fK(o,1)(q - a)),

(if we choose S = {0} to be the halfwayxponentially in a2 . Note that 7(q) approx-

=}fA(o,)(q - a) + lfg(o,1)(q - a); indeedgenerated by numerically diagonalizing an

analytical solution for the transition matrix of the HMC Markov chain.

79

straight and oS is nearly straight on the scale of E). Hence, ignoring constant factors,

by Theorem 12, 4, ~ e for small enough e.

If we multiply the integration time T = e by n, the algorithm (which approximates

a diffusion for small T = e) travels on average n times the distance in one step.

However, if we instead take n independent steps of size e in succession, the mean

distance traveled is only fin, so we need to make n 2 steps of size e to achieve the

same average displacement as a step of size n. Indeed, if (D. = e, then the upper bound

is I! = e2, but if I, = ne then the upper bound is * = n2e2. So, to get a spectral

gap of 1 - n 2 E2 using steps of size e, we would need to apply the transition matrix n2

times (i.e., take n 2 steps): An' (1- _ )n2 = (I _ .2)n 2 = 1-n 2 2 +L.O.T. ~ 1-n2 E2 .

On the other extreme, if there are two sets S, and S2 for which I(S*) is much

smaller than the relaxation time of the chain restricted to either S or S' by itself,

then we would expect the spectral gap for the entire space (S* U S*) to be close to

= 'P(S*). Indeed, this behavior has been proved in the case of a lazy random

walk on two discrete tori glued together at a single vertex [40]. We conjecture that

this behavior will occur for HMC as well, for instance, in the case of a two-mode

density with a deep valley, with one mode in the region S* and the other mode in S'

(assuming T is not too small).

3.4 Random Walk Metropolis mixing times

In this section we show two bounds for the RWM algorithm. Together, the bounds

suggest that the relaxation time of RWM grows exponentially with both the dimension

d and the distance between modes, even if one uses a mean step size e that is optimal.

Theorem 14 shows that the relaxation time grows with both the dimension d and

with the distance between modes for small e. Theorem 15 shows that the relaxation

time grows with the dimension d for all other (i.e., non-small) e.

Theorem 14. Consider a RWM Markov chain q 1, q2,... with proposal distribution

qw1 - qi ~ .j(0, e)d for every i E N. Let S be a subset of the state space. Then

80

<D(S) 52 j r(s + Br)]- fx(r/e)dr/7r(S),

where aS + B, := {x + y : x E S, y E B,} is the Minkowski sum, and B, is an r-ball

centered at the origin.

In particular, for every r > 0, we have

<DL(S) 27r(&S + B,) - Fr (r/e) + (1 - Fxd(r/e))] (S), (3.14)

Proof. Let qi, q2, .... be a RWM Markov chain. We wish to bound <D(S) =1 iP,(qi+1 E

SCqj E S).

Suppose that the step size is |lqj - q +1|| < r. Then whenever both qji+1 E Sc and

q E S, it must be true that either qj E aS + B, or qj+ 1 E aS + Br. Hence,

Pr qi+1 E Sc, qj E S ||qi+1 - qi| 11 r)(15

Pr(q E as + Br) + Pr(qi+1 E aS + Br) = 2ir(aS + B,).

Equations 3.13 and 3.14 now follow directly from the law of total probability.

Theorem 15. Let ir(q) = Ekmax ckirk(q), E ck = 1, be a Gaussian mixture model,

where 7rk has covariance matrix Ek and mean ak. Let the RWM proposal have covari-

ance matrix AE.

Then, for every 7, 6 > 0,

6 kmax

Paccept + min 1ZCj 0j,k (77)j=1

kmax kmax

+ x: CjVj,k( 6 )k=1 j=1

OJ,k () exp( -min{ - 2IE d3/ 2 + V 4Ej2d3 - 8tE l((IE d- (7r l(r())2), O}] 2)

41IEjI

81

where

(3.13)

+ AEd - (7r-'(ck )2'/j,k <) eXp

and

(rk ()2= -2lEkllog((2pi E -k12t)

Proof. Let X be distributed according to the stationary distribution (that is, an index

j is sampled at random with probability cj, and then a Gaussian random variable

X = Xj is sampled from X ~ r1j), and let Y be the independent random jump

proposal (so that the next proposed move is the point X + Y).

Then

Paccept = min{ 1,r(X+}Y)

7r(X) < -P 7r(X) > rq, ir(X+Y) 6) +1-P 1gr(X) < '}{w7r(X+Y)>

+Y) > 6)

Now, { Ek ck7rk(X + Y) < r1} C {ck7rk(X + Y) < 77} for every k, so

P lr(X) < < minP Irk = minEk

cJP rk(Xi) <

By Equation 4.3 of [36], if Z ~ 2 , then

IP(IE1Z 2 > t) exp(- [min{ - 2|Ejld3/ 2 + v41z3 -2d3 - 8I'E-I(|Z Id - t), 0} 2

so

P (rk(Xi) < r) = P(IEj|Z2 > (7rk1(r1))2

< exp ( --min{ - 2|Ejd3 / 2 + V 41Xj|2d3 - 8I|El (|E d - (7r l(r))2), O} 2)

41Z31

82

J})

n)(X) < r7)

d min{|Ej0}

2)

< 6+ P 7r(X) < r7 + P 7r(X

All that remains to be done is to bound P (lr(X + Y) > 6).

To do so we observe that {E cki7rk(X + Y) > 6} C Uk{ck7rk(X + Y) > so

1P 7r(X+Y) >6 < P 7rk(X-Y) > Cka = Z crk(Xj+Y) >Ckkk Ckkm. k j kka

but (for fixed covariance determinants IEkI and I Ej+ E), A P(7rk(XJ +Y) > Ckmax ) is

maximized if the means of Xj +Y and irk are the same and the covariance matrices are

both multiples of the identity matrix. Hence, we may replace IIXjH with IEj + AEIZ

without decreasing P(lrk(Xi) > j). By Equation 4.4 of [36], P(|Ej + AEIZ2 <

t) ; exp(-( )2), so

IP(wr(X) > ) = IP(IE + LIEZ2 < (7rjl( ))2)Ckk x Ckkma

exp (min{IEj + AEld - (7r'(ck ))2,0}2)d 2|Ej + AJE

Acknowledgement

We gratefully acknowledge support from ONR 14-0001 and NSF DMS-1312831.

83

Chapter 4

A Generalization of Crofton's

Formula to Hamiltonian Dynamics,

with Applications to Hamiltonian

Monte Carlo

4.1 Introduction

In this chapter we prove generalizations of Crofton's formula (Theorems 16 and 17)

that apply to particle trajectories in Hamiltonian Dynamics. We then use one of

these new formulae (Theorem 16) to increase the efficiency of computing integrals

over codimension-1 manifolds with Hamiltonian Monte Carlo algorithms.

4.2 Crofton formulae for Hamiltonian dynamics

Theorem 16. Let M be a codimension-1 submanifold in the position space Rn. Let

y be a random Hamiltonian trajectory with Hamiltonian energy functional H(q, p) =

U(q) + Tp. Then

r(q)dq = cE[NM], (4.1)

85

where NM is the number of times y intersects M (counted with multiplicity) and

c := C1,n_1,n,R is the same constant used in the classical Crofton's formula (Lemma

1).

Remark 8. If M = OS, then Theorem 16 gives the expected number of times that y

crosses from S to Sc or vis versa. If the stationary density 7r and integration time T

are such that -y never crosses OS more than once, then Theorem 16 and Theorem 11

imply each other in this special case.

Remark 9. The classical Crofton formula for lines (Lemma 1 with k = 1) can be viewed

as a corollary to Theorem 16: if we choose the Hamiltonian potential to be uniform

over a compact set Q, the Hamiltonian trajectories will be composed of lines that

move with the Kinematic measure, conditioned to intersect Q. Since the potential is

uniform, we have fM ir(q)dq -Vol(Q) = Vol(M), the value on the LHS of the classical

Crofton formula.


Let y a random Hamiltonian trajectory over time [0, T] with position and momen-

tum at time 0 sampled from the stationary density ir(q,p) = -r(q) - fg(0,1).(p). Let

NM be the number of intersection points of y with M, and let (qi, pi) be the phase

space coordinates of -y at its i'th intersection point with M.

Since Hamiltonian flow preserves the stationary distribution, -y is at stationary

distribution at any time t E [0, T], so for any function h we have

I NM h(i i

I h(q, p)ir(q, p)dpdq = TE [qIPro piI, (4.2)

where "projMT" denotes the projection onto the normal vector to M at q.

Also, since the marginal stationary density of the momentum f(o,i1)n (p) is inde-

pendent of the position q and is rotationally invariant with respect to p, for every

q E M we have

1' 1j I projMTp1. - fg(0,1)n(p)dp =] |proje1 pjj - fg(ol)n(p)d = -, (4.3)Rn q8R

86

where el := (1, 0, ... 0)T is a coordinate vector.

Hence, setting h(q,p) := ||projMTpII, we get

= Jir(q) . ldq

(by Eq. 4.3)= 7r(q) - c - I| IprojMTp|| - f(o,1)"(p)dpdq

= c - I projT pI| . r(q) - fg(0,1)n(p)dpdqfM Rn R

=- IM IlprojMTp|| -7r(q,p)dpdq

(by Eq. 4.2) 1 -N -

- ET E|proj p ||

- +E[Z-

= +E [NM

0

More generally, for arbitrary Hamiltonians we have

Theorem 17. Let M be a codimension-1 submanifold of the position space Rn. Let

^y be a random Hamiltonian trajectory with arbitrary Hamiltonian energy functional

H(q, p). Then

7r(q)dq = T E n q4p117(,POP4.4)Im . i_1 -7r"q dM7 I IEGpd

where qi is the position where y intersects M for the i'th time, and NM is the number

of intersections (counted with multiplicity). v(q, p) : - is the velocity givendt O p

by Hamilton's equations at position q and momentum p. | dv(q'p) 1| is the magnitude

of the component of v in the direction orthogonal to the tangent space of M at qi.

Proof. The proof of Theorem 17 follows the same steps as the proof of Theorem 16:

Let 7 be a random Hamiltonian trajectory over time [0, T] with position and mo-

mentum at time 0 sampled from the stationary density ir(q, p). Let NM be the number

87

M 7r(q)dq

of intersection points of -y with M, and let (qj, pi) be the phase space coordinates of

-y at its i'th intersection point with M.

Since Hamiltonian flow preserves the stationary distribution, 'Y is at stationary

distribution at any time t E [0, T], so for any function h we have

h(qipi)]M - (4.5)h (q, p)7r(q, p)dpdq = EfM Jn T- i=1

Hence, setting h(qp) := 1 11yI _Ibrqd we get

7r (q, p)dp

/ J

d v(qp)

M Rn ddo , )1r1 ,

LRn 1 dv 1)7r(q, p)dpdq

- 7r(q,p)dpdqp

(by Eq. 4.5) 1E NdM ,p)

T~ ~ ~ , v'P)117r (qj, p) dp dv(qp)

Ii I

f.. f I ,j4p)17r(qi,p)dp]

4.3 Manifold integration using HMC and the Hamil-

tonian Crofton formula

We now state a conventional method of using Hamiltonian Monte Carlo to compute

integrals over a submanifold (Algorithm 13).

88

IM 7r(q)dq = IMr(q) - dq

= IM(q) -

I E-NM

Algorithm 13 integration on a submanifold with HMC

input: qo, oracle for 7 : Rd - [0, oc), oracle for intersections with M

output: Estimator A for fM x(q)dq.

define: H(q,p) := -log(r(q)) + }p'p.

1: for i = 1, 2,... do

2: Sample independent pi ~ K(O, I)d

3:. Integrate Hamiltonian trajectory (q(t),p(t)) with Hamiltonian H over the time

interval [0, T] and initial conditions (p(O), q(O)) = (pi, qi)

4: set qi+1 = q(T)

5: compute the sequence of intersection angles Of, y,.... of the trajectory with M

6: end for

7: computeA:=Z)

Algorithm 13 requires the use of the weights . Unfortunately, these weightssin(Oi)

have infinite variance, greatly slowing the algorithm ( Var( L) = 00, as shown insin(Oq)2

Section 2.2.2 when analyzing classical Crofton formula). To eliminate these weights

we can apply our Hamiltonian Crofton formula (Theorem 16) to obtain an algorithm

with a much faster-converging estimator for the integral (Algorithm 14):

Algorithm 14 Crofton formula integration on a submanifold with HMC


output: Estimator -- NM for fm 7r(q)dq.

5: compute the number (with multiplicity) of intersections Ni of the trajectory

with M

7: compute NM : >jM Ni

Acknowledgements

We gratefully acknowledge support from NSF DMS-1312831.

89

Chapter 5

A Hopf Fibration for #-Ghost

Gaussians

5.1 Introduction

There is a wonderful geometrical construction known as the Hopf fibration. The

Hopf map is a continuous map from the 3-sphere to the 2-sphere where every point

on the 2-sphere is the image of a circle on the 3-sphere. This Hopf map has the

nice property that if a point is uniformly distributed on the 3-sphere, its image is

uniformly distributed on the 2-sphere.

The entire Hopf story can be expressed quickly and elegantly with quaternions. In

quaternion language, consider the map that takes z to zi., where z is a unit quaternion

(Izi = 1). By taking its conjugate, it is easy to see that zi2 is a unit quaternion with

zero real part (these are known as versors). Identifying unit quarternions with the

3-sphere, and those with zero real part with the 2-sphere, we have our Hopf map.

Notice that for a fixed unit quaternion q, the map from z with 0 real part to

qzq is a linear function of z which preserves JzJ. The matrix representation is a 3x3

rotation matrix. This association is widely used in practice in computer graphics and

other fields. Given any 3x3 rotation matrix, there is a well known construction to go

the other way based on computing the eigenvector (the axis of the rotation), and the

eigenvalues (which encode the angle of rotation.)

91

We offer a quick proof based on orthogonal invariance that if z is uniformly dis-

tributed on the 3-sphere, then zi is uniformly distributed on the 2-sphere. As just

discussed, any orthogonal rotation of the non-zero coordinates of ziz may be written

as qzizg = (qz)i(qz). Now if z is uniform on the 3-sphere, so is qz and thus qzizg has

the same distribution as zi2 implying that it is uniformly distributed on the 2-sphere.

This proof is very elegant and worth reading a few times. The very fact that the

geometry of Hopf is so nicely encoded in quaternions inspired us to ask what happens

if we replaced quaternions with general 0 ghosts.

5.2 Defining the 3-dimensional algebra

Let # be of even dimension. Write A = R8 = A x B x C x D

Where A = span{1}, B = span{i},

C = span{ji, ... , j,-} 2 R 2, and D = span{k, ..., k_} 6 R.2 2

Definition 4. (multiplication) The only multiplication allowed in this algebra is mul-

tiplication of an element in A by elements of span{1, i}, as well as any multiplication

where the output only has elements of A = A x B x C x D (e.g., r2 and rir are

allowed for any r E A). Left multiplication is defined for all t, s E {1, ... , 2} as:

i2 = j2 = k 2 = -1 (5.1)

ijt =k (5.2)

ikt = -Jt (5.3)

and right multiplication is defined by

jti =-k (5.4)

92

kti = jt (5.5)

Finally, we assume that all orthogonal pure imaginary components are anti-

commutative, although we only allow such operations if the end results cancel so

that the output is in A x B x C x D:

isit =-iti

jsjt = -jtJs (5.6)

isJt = -Jtis

Associativity of multiplication is assumed as well (but NOT commutativity). Ad-

dition is done as a vector space.

Finally, we define the conjugation operation on an element Z = a + bi + cr, where

a, b, c E R and r E S,-2 C C x D by Z a - b - cr.

The following theorems (Theorems 1-3) give algebraic properties of the ,3-dimensional

algebra that generalize properties of the quaternion algebra. These properties will

come in handy when generalizing the Hopf fibration to R8.

Theorem 18. r2 = -1 and (ir)2 = 1

Proof. 1 = Ir12 = rr = r x (-r) = -r 2 hence, r2 =_I

Since ir is pure imaginary (no real part) (this is because we are assuming that the

pure imaginary # - 2 sphere is closed under orthogonal multiplication), 1 = lir1 2

ir(-ir) = -(ir)2

Hence, (ir)2 = 1.

Theorem 19. ir = -ri

Proof. (i +r)(i +r) = Ji+r 12 =E R

93

2but (i+ r)(i+ r) = (i+ r)(-i - r) = -i2i - r2 I ir - ri+1

Hence, since ir and ri are non-real (since Equations 5.2-5.4 imply that the # - 1

pure imaginary sphere is closed under orthogonal multiplication), and Ii + rl is real,

it must be true that -ir - ri = 0. Hence, ir = -ri. L

Theorem 20. Ixr + yir = /x 2 + y2 for all x, y E R

Proof. Ixr + yir1 2 = (xr + yir)(-xr - yir) = -x 2r 2 - xyir2 - yrir - y 2 (ir) 2

x 2 +xyi+xyr2i+y2 = 2 +Xyi-Xyi+y 2 X 2 +y 2

5.3 Hopf Fibration on R

We begin by defining a version of the Hopf fibration W : R -+ R1- 1 for even-integer

# by generalizing the quaternion representation of the / = 4 Hopf fibration.

Let Z ~ .(O, 1)3 (where "Z ~ .(0, 1)13" means that Z is a random variable

sampled from the distribution .J(O, 1)a). Then

Z=a+bi+cr

where a, b ~ (0, 1), c - X-2, r Uniform(S'- 2) C C x D are independently

distributed.

We can now define the Hopf map W : R3 -+ R- by WH(Z) ZiZ.

Theorem 21.

7(Z) :Zi = (a2 + b2 - c2 )i + 2bcr - 2acir ~ (a2 + b2 - c2 )i + 2c (Va2+ b2)r

Proof.

Zi = (a+bi+cr)i(a-bi-cr) = (a+bi+cr)(ai+b-cir) = a2i+ab-acir-ab+b2 i-bci2 r+acri+bcr -c 2r

= (a2 + b2 )i - acir + bcr - acir + bcr + c2r 2 i = (a2 + b2 - c2 )i + 2bcr - 2acir

94

(a2+b2-2)i+2c (Vb2+a2)r

In particular, Theorem 21 shows that the Hopf map eliminates the real component,

just like it does in the quaternion (0 = 4) case. More generally, Theorem 21 allows

us to compare the distribution of Zi2 for different /. Towards this end, let W be the

i-component of ") . Then

W imagi(ZiZ) _ X-YIZi2| X+Y

where X:= a2 + b2 ~ x2 and Y:= c2 ~ X-2-

A quick multivariable integral computation gives the densities fw and fIwi of W

and 1W , respectively:

f2 1 -t _-2_fw (t) = x ( 2 ) 2 4 t E [-1, 1]

and

t E [0, 1]

In particular, fw(t) is constant for /3 = 4, has negative second derivative for 4 < /3 <

6, is linear for / = 6 and has positive second derivative for 3 > 6. fiwi is uniform

for both 3 = 4 and / = 6, and also has negative second derivative for 4 < /3 < 6 and

positive second derivative for 3 > 6.

The / - 2-dimensional surface area density p(t) of on the / - 2 sphere at

height t on the i-axis is:

P~t) fw(t) 0 -2( 1)82Vol(Pt n Si-2) - 48- 3 2(1 + t)

95

t E [-1 1]

# -2 1 +t 6_-2_1 1 - t 62_1fAw,(t) = 4 ( 2) 2 + ( 2) 2 , 1

where Pt is a # - 2-plane a distance t from the origin and

S1 - 3 := Voli( '- 3 ) = 2-

In particular, 0 < p(t) < oo everywhere except at t = -1, where p(-1) = oo. This

suggests that W- has a dimension reduction of 1 everywhere except at the "south pole"

of the # -2 sphere (-i), where the dimension was reduced by # -2 (r gets mapped to

-i for any /3-2- "phase" r. However, for instance, only the circle {a+bi : a2 + b2 = 1

gets mapped to +i, so the map is well-behaved at +i because the dimension goes down

by 1.)

Moreover, this fact, together with the fact that 7-(a + bi + cr) = (a2 + b2 - c2 )i +

2bcr - 2acir, suggests that W- is analytic everywhere except at -i E S8- 2 .

Acknowledgements

We gratefully acknowledge support from NSF DMS-1312831.

96

Bibliography

[1] J. C. Alvarez Paiva and E. Fernandes. Gelfand transforms and Crofton formulas.Selecta Math. (N.S.), 13(3):369-390, 2007.

[2] Dennis Amelunxen and Martin Lotz. Computational kinematics. manuscript inpreparation.

[3] Dennis Amelunxen and Martin Lotz. A comment on "Integral geometry forMCMC" . private correspondence.

[4] C. Andrieu, N. de Freitas, A. Doucet, and M.I. Jordan. An introduction toMCMC for machine learning. Machine Learning, 50:5-43, 2003.

[5] R.H. Baayen, D.J. Davidson, and D.M. Bates. Mixed-effects modeling withcrossed random effects for subjects and items. Journal of Memory and Language,59:390-412, 2008.

[6] Julian Besag. Markov chain Monte Carlo for statistical inference. Technicalreport, University of Washington, Department of Statistics, 04 2001.

[7] Louis J Billera and Persi Diaconis. A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 16(4):335-339, 2001.

[8] F. Bornemann. On the numerical evaluation of distributions in random matrixtheory: a review. Markov Process. Related Fields, 16(4):803-866, 2010.

[9] Nawaf Bou-Rabee and Jesus Mara Sanz-Sernax. Randomized Hamiltonian MonteCarlo. arXiv preprint arXiv:1511.09382v1, 2015.

[10] Bob Carpenter, A Gelman, M Hoffman, D Lee, B Goodrich, M Betancourt,M Brubaker, J Guo, P Li, and A Riddell. Stan: a probabilistic programminglanguage. Journal of Statistical Software, in press, 2015.

[11] Jeff Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. InProblems in analysis (Papers dedicated to Salomon Bochner, 1969), pages 195-199. Princeton Univ. Press, Princeton, N. J., 1970.

[12] Ming-Hui Chen, Qi-Man Shao, and Joseph G. Ibrahim. Monte Carlo methods inBayesian computation. Springer Series in Statistics. Springer-Verlag, New York,2000.

97

[13] Shiing-shen Chern. On the curvatura integra in a Riemannian manifold. Ann.of Math. (2), 46:674-684, 1945.

[14] Sai Hung Cheung and James L Beck. Bayesian model updating using hybridMonte Carlo simulation with application to structural dynamic models withmany uncertain parameters. Journal of engineering mechanics, 135(4):243-255,2009.

[15] Anders S. Christensen, Troels E. Linnet, Mikael Borg, Kresten Lindorff-LarsenWouter Boomsma, Thomas Hamelryck, and Jan H. Jense. Protein structure val-idation and refinement using amide proton chemical shifts derived from quantummechanics. PLoS ONE, 8(12):1-10, 2013.

[16] Neil J. Cornish and Edward K. Porter. MCMC exploration of supermassive blackhole binary inspirals. Classical Quantum Gravity, 23(19):761-767, 2006.

[17] Morgan W. Crofton. On the theory of local probability, applied to straight linesdrawn at random in a plane; the methods used being also extended to the proofof certain new theorems in the integral calculus. Philosophical Transactions ofthe Royal Society of London, 158:181-199, 1868.

[18] Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a man-ifold. In Advances in Modern Statistical Theory and Applications: A Festschriftin Honor of Morris L. Eaton, pages 102-125. Institute of Mathematical Statis-tics, 2013.

[19] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth.Hybrid Monte Carlo. Physics Letters B, 195(2):216-222, 1987.

[20] Alan Edelman and Brian D. Sutton. From random matrices to stochastic oper-ators. J. Stat. Phys., 127(6):1121-1165, 2007.

[21] Alan Edelman and Brian D. Sutton. From random matrices to stochastic oper-ators. J. Stat. Phys., 127(6):1121-1165, 2007.

[22] Israel M. Gelfand and Mikhail M. Smirnov. Lagrangians satisfying Crofton for-mulas, Radon transforms, and nonlocal differentials. Adv. Math., 109(2):188-227,1994.

[23] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, (6):721-741, 1984.

[24] Charles J. Geyer. Markov chain Monte Carlo maximum likelihood. In ComputingScience and Statistics: Proceedings of the 23rd Symposium on the Interface, pages156-163, 1991.

98

[25] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamilto-nian Monte Carlo methods. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 73(2):123-214, 2011.

[26] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamilto-nian Monte Carlo methods. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 73(2):123-214, 2011.

[27] Mark Girolami, Ben Calderhead, and Siu A Chin. Riemannian Manifold Hamil-tonian Monte Carlo. Arxiv preprint, 2009.

[28] Yongtao Guan and Stephen M. Krone. Small world MCMC and convergence tomulti-modal distributions: from slow mixing to fast mixing. Ann. Appl. Probab.,17(1):284-304, 2007.

[29] Larry Guth. Degree reduction and graininess for Kakeya-type sets in R3 . preprinton arXiv:1402.0518, 2014.

[30] Sigurdur Helgason. Integral geometry and Radon transforms. Springer, NewYork, 2011.

[31] Jody Hey and Rasmus Neilsen. Integration within the Felsenstein equation forimproved Markov chain Monte Carlo methods in population genetics. Proceedingsof the national academy of sciences of the United States of America, 104(8):2785-2790, 2006.

[32] Matthew D. Homan and Andrew Gelman. The No-U-Turn sampler: Adaptivelysetting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learn-ing Research, 15(1):1593-1623, 2014.

[33] JH Irving and John G Kirkwood. The statistical mechanical theory of transportprocesses. iv. the equations of hydrodynamics. The Journal of chemical physics,18(6):817-829, 1950.

[34] Iain M. Johnstone. On the distribution of the largest eigenvalue in principalcomponents analysis. Ann. Statist., 29(2):295-327, 2001.

[35] Daphne Koller and Nir Friedman. Probabilistic graphical models. Adaptive Com-putation and Machine Learning. MIT Press, Cambridge, MA, 2009. Principlesand techniques.

[36] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic func-tional by model selection. Annals of Statistics, 28(5):1302-1338, 2000.

[37] Gregory F Lawler and Alan D Sokal. Bounds on the L2 spectrum for Markovchains and Markov processes: a generalization of Cheeger's inequality. Transac-tions of the American mathematical society, 309(2):557-580, 1988.

99

[38] Michel Ledoux. The concentration of measure phenomenon, volume 89 of Math-ematical Surveys and Monographs. American Mathematical Society, Providence,RI, 2001.

[39] Tony Lelivre, Mathias Rousset, and Gabriel Stoltz. Free energy computations:A Mathematical Perspective. Imperial College Press, 2010.

[40] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains andmixing times. American Mathematical Soc., 2009.

[41] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami.On the geometric ergodicity of Hamiltonian Monte Carlo. arXiv preprintarXiv:1601.08057, 2016.

[42] Martin Lotz. On the volume of tubular neighborhoods of real algebraic varieties.Proc. Amer. Math. Soc., 143(5):1875-1889, 2015.

[43] Oren Mangoubi. Concentration of kinematic measure. manuscript in preparation.

[44] B Mehlig, DW Heermann, and BM Forrest. Hybrid Monte Carlo method forcondensed-matter systems. Physical Review B, 45(2):679, 1992.

[45] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta HTeller, and Edward Teller. Equation of state calculations by fast computingmachines. The journal of chemical physics, 21(6):1087-1092, 1953.

[46] V. D. Milman. A new proof of A. Dvoretzky's theorem on cross-sections of convexbodies. Funkcional. Anal. i Priloien., 5(4):28-37, 1971.

[47] Boaz Nadler. On the distribution of the ratio of the largest eigenvalue to thetrace of a Wishart matrix. J. Multivariate Anal., 102(2):363-371, 2011.

[48] Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markovchain Monte Carlo, Chapman & Hall/CRC Handb. Mod. Stat. Methods, pages113-162. CRC Press, Boca Raton, FL, 2011.

[49] S. Yu. Orevkov. Sharpness of Risler's upper bound for the total curvature of anaffine real algebraic hypersurface. Uspekhi Mat. Nauk, 62(2):169-170, 2007.

[50] Hannes Risken. Fokker-planck equation. Springer, 1984.

[51] Jean-Jacques Risler. On the curvature of the real Milnor fiber. Bull. LondonMath. Soc., 35(4):445-454, 2003.

[52] Gareth 0 Roberts and Jeffrey S Rosenthal. Geometric ergodicity and hybridMarkov chains. Electron. Comm. Probab, 2(2):13-25, 1997.

[53] Luis A. Santal6. Integral geometry and geometric probability. Cambridge Math-ematical Library. Cambridge University Press, Cambridge, second edition, 2004.With a foreword by Mark Kac.

100

[54] Rolf Schneider and Wolfgang Weil. Stochastic and integral geometry. Probabilityand its Applications (New York). Springer-Verlag, Berlin, 2008.

[55] Alistair Sinclair and Mark Jerrum. Approximate counting, uniform generationand rapidly mixing Markov chains. Information and Computation, 82(1):93-133,1989.

[56] Michael Spivak. A comprehensive introduction to differential geometry. Vol. III.Publish or Perish, Inc., Wilmington, Del., second edition, 1979.

[57] Michael Spivak. A comprehensive introduction to differential geometry. Vol. V.Publish or Perish, Inc., Wilmington, Del., second edition, 1979.

[58] Brian D. Sutton. The stochastic operator approach to random matrix theory.ProQuest LLC, Ann Arbor, MI, 2005. Thesis (Ph.D.)-Massachusetts Instituteof Technology.

[59] Mihai Tibar and Dirk Siersma. Curvature and Gauss-Bonnet defect of globalaffine hypersurfaces. Bulletin des Sciences Mathematiques, 130(2):110-122, 2006.

[60] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevindynamics. In Proceedings of the 28th International Conference on MachineLearning (ICML-11), pages 681-688, 2011.

[61] Chenchang Zhu. The Gauss-Bonnet theorem and its applications. http: //math.berkeley. edu/-alanw/240papersOO/zhu. pdf, 2004.

101

Integral Geometry, Hamiltonian Dynamics, and Markov Chain ...

Documents

Transcript of Integral Geometry, Hamiltonian Dynamics, and Markov Chain ...