Simulation of rare events and optimisation with the cross-entropy method

What is cross-entropy?From Riemann to Monte-Carlo

Cross-Entropy techniquesCross-Entropy tricks

Questions

Using cross-entropy techniques for rare event

simulation and optimization

Arthur Breitman

NYC Machine learning meetup

August 18, 2011

Arthur Breitman crossentropy for rare event simulation and optimization



Questions

EntropyKullback-Leibler divergence

OutlineWhat is cross-entropy?


From Riemann to Monte-CarloRiemann integrationMonte-Carlo integrationImportance sampling

Cross-Entropy techniquesAnalytical expressionsSimulation of rare eventsOptimizationFitting parameters

Cross-Entropy tricksMultiple maximaSlow convergence

Questions Arthur Breitman crossentropy for rare event simulation and optimization



Questions


Information entropy

definition of information entropy

Entropy measures disorder of a physical system

Entropy measures information (Shannon)

Entropy measures ignorance (E.T. Jaynes)

Formally:

H = −∑

x∈Ω

p(x) ln(p(x))




Questions


The continuous case

In the continuous case, for a random variable X with p.d.f p(x)entropy is defined as

H(X ) = −

∫

Ω

P(x) ln(p(x))dx

Simple, right?




Questions


The entropy of a probability distribution is meaningless

Wrong!

Not invariant under a change of variable

Can even be negative!

Not an extension of Shannon’s entropy.




Questions


E.T. Jaynes to the rescue

E.T. Jaynes, adjusted the definition. Consider a sequence ofdiscrete values in Ω dense in Ω, it must a approach a distributionm. Set

H(X ) = −

∫

Ω

P(x) ln

(

p(x)

m(x)

)

dx

N.B. m is not necessarily a probability distribution, just a density,so improper priors are O.K.




Questions










Questions


Definition of KL divergence

Kullback-Leibler divergence: entropy of a probability distribution p

relative to probability distribution q

DKL(P ||Q) = −

∫

Ω

P(x) ln

(

p(x)

q(x)

)

dx

Similar but distinct from entropy.

Expected number of nats (or bits) to encode data drawn fromQ assuming it is drawn from P .

Not symmetric!




Questions


Why code length matter

All ML problems ⇔ fitting a probability distribution

KL divergence measures how concise your description is

Relates to MDL and Solomonoff induction

PAC-learning patches against a lack of epistemology




Questions


Likelihood of parameters and Cross-Entropy

Given a sample qi of Q, and Pθ∈Θ,

LL(θ|qi ) = H(Pθ) + DKL

(

Pθ

∣

∣

∣

∣

∣

∣

∣

∣

∣

∣

1

N

∑

i

δqi

)

The likelihood of θ is the KL-divergence of Pθ w.r.t a Dirac comb.




Questions

Riemann integrationMonte-Carlo integrationImportance sampling









Questions


Riemann integration

How does one compute the integral of a function? Rectanglemethod:

∫ b

a

f (x)dx →1

N

N−1∑

i=0

f

(

a+ (b − a)i

N

)

Linear convergence.




Questions


The curse of dimensionality

Multiple dimensions?

∫ b1

a1

· · ·

∫ bm

am

f (x)dx →1

Nm

N−1∑

i1=0

· · ·

N−1∑

im=0

f

(

a+1

Ni (b− a)

)

Computation is exponential in m.




Questions










Questions


Monte-Carlo integration

If P is a probability distribution over Ω, draw xi from P :

∫

Ω

f (x)dx ∼1

N

N∑

i=1

f (xi )

p(xi )

Very simple to implement, often p ∼ 1




Questions


Monte-Carlo convergence

Let random variable Xp = f (x)/p(x)

If var(Xp) < ∞, convergence is O(N1/2) by the central-limittheorem!

If m > 2, Monte-Carlo becomes attractive.




Questions


Problems with MC

If the mass of f is concentrated in a small region, convergencecan be very slow.

also a problem with Riemann integration...




Questions










Questions


Importance sampling

Sample preferably the regions of interest by picking p tominimize the variance of f /p

In Riemann world, equivalent to an irregular grid

Ideal sampling distribution (if f > 0) is f∫f, but we don’t

know∫

f !

Best convergence when χ2 of f w.r.t p is minimized




Questions


Adaptive importance sampling

What if we don’t know the shape of f ?

Learn it adaptively from the sampling.

Iteratively improve the importance sampling function.




Questions


Vegas algorithm and cross-entropy

Vegas algorithm, use histograms and separate variables

Cross-entropy algorithm, pick p from a family of distributionsto minimize cross-entropy to the sample




Questions

Analytical expressionsSimulation of rare eventsOptimizationFitting parameters









Questions


Why cross-entropy?

In many cases, the expression is analytical and computationallycheap to derive, e.g.

the uniform distribution

the categorical distribution (finite, discrete)

all the natural exponential family




Questions


The natural exponential distribution?

fX (x|θ) = h(x) exp (θ∗x− A(θ))

theta is the sufficient statistic

maximum cross-entropy distribution given θ w.r.t dH

Examples: normal, multivariate normal, gamma, binomial,multinomial, negative binomial




Questions


Beta distribution

Not analytical! To fit, start with approximate values from themoment’s method

α = X

(

X (1− X

S2− 1

)

, β = (1− X )

(

X (1− X

S2− 1

)

The likelihood is given by

n(ln(Γ(α+β)−ln(Γ(α)−ln(Γ(β))+(α−1)n∑

i=0

ln(Xi )+(β−1)n∑

i=0

ln(1−Xi )

The first and second derivatives are the digamma and trigammafunction, available in the gsl. Newton’s method using the Jacobianconverges in a couple iterations. Very useful to model boundedvariables.




Questions










Questions


Surviving the zombie hordes

Figure: Electric fences, the horde and you




Questions


Simulating zombie breakouts

Each fence (Ui , λi ) delivers u ∼ max(Ui − Exp(λi ), 0) volts.

Crossing a fence deals u damage to a zombie

Zombies come from everywhere and can take 5 damage hitseach.

Zombies outbreaks are very rare!




Questions


Mere integration fails!

We can estimate this probability by sampling the randomvoltages and finding a shortest path.

Speed of Monte-Carlo proportional to poutbreak(1− poutbreak),too slow!




Questions


Cross-Entropy to the rescue

We want to approximate the multivariate power distributionconditional on an outbreak occurring!

Approximate the shape by changing the parameters Ui and λi

for each fence

Generate samples, fit Ui and λi on the samples inducing anoutbreak




Questions


The elite sample

What if the probability is so low that we don’t observe anyoutbreak in our sample?

Generate n samplings using the sampling distribution

If more than e samples are outbreaks, fit to those samples,break

Otherwise, fit on the e best sample, the elite sample.

Iterate

Generate a sample, weight each points by the importancesampling weight, estimate probability




Questions


Other examples

Modeling rare event for any complex probability distribution,e.g. Bayesian networks.

Estimating tails for the sum of fat-tailed distributions




Questions










Questions


From integration to optimization

Using an elite sample to help convergence is a trick that does aform of hill climbing of a smooth function approximating theindicator function of the rare event.

Interesting even if not interested in integrating f .

Keep iterating based on an elite sample to converge towardsone global maximum.

variance of the sampling distribution follows the curvature off .

e.g. using a multivariate normal allows the covariance toreflect the differential




Questions


Combinatorial optimization

One classical example if combinatorial optimization. To solve aTSP with Cross-Entropy:

Assume the travel is a Markov chain on the graph nodes.

Generate travels by coercing them to be permutations.

Update transition probabilities from the elite sample.




Questions


Clustering

CE does clustering too!

Assign probabilities of membership to classes for each point(the sampling distribution).

Sample random membership assignments.

Use average distance to centroids to find an elite sample.

Slower than K-means but much less sensitive to initial choiceof centroids.




Questions


A form of global optimization

Is it global optimization?

If the sampling distribution is bounded below by a distributionthat covers the global maximum, yes, with probability 1!

In practice we may never see one maximum and converge toanother local maximum.




Questions










Questions


Fitting model parameters with CE

Cross-Entropy techniques work generally very well for finding MLparameters of a model. Why?

Models often have different sensitivities to differentparameters, CE reflects that.

With a covariance structure, it does a form of gradient ascent.

But it can deal with discrete parameters at the same time!

It does not tend to get trapped in local maxima.

Well suited for high-dimensional parameter spaces.




Questions

Multiple maximaSlow convergence









Questions


Forgetting maxima

Some maxima can be ”forgotten”

Smooth changes in the sampling function.

Expand the sampling function (equivalent to applying a prioror ”shrinkage”).

Keep the entire sample




Questions


Not converging to a maximum

Multiple maxima may prevent variance of the sampling fromdecreasing.

Mixtures of multivariate normals can deal with this.

They can be introduced dynamically.

Fit with EM.




Questions










Questions


Independent variables

If the sampling distribution is separable, convergence can be spedup by sampling over one dimension at a time.




Questions

Questions

Questions?


Simulation of rare events and optimisation with the cross-entropy method

Education

Transcript of Simulation of rare events and optimisation with the cross-entropy method