ADVANCED MACHINE LEARNING FALL 2017

215
A DVANCED MACHINE L EARNING FALL 2017 VERSION:OCTOBER 12, 2017

Transcript of ADVANCED MACHINE LEARNING FALL 2017

ADVANCED MACHINE LEARNING

FALL 2017

VERSION: OCTOBER 12, 2017

COURSE ADMIN

TERM TIMELINE

First classSep 5/6 Midterm

Final classDec 7/11

Pythontutorial

TensorFlowtutorial

Final projectdue

Peter Orbanz John Cunningham

DatesPython tutorial 12/13 SeptemberTensorFlow tutorial 24/25 October

Midterm exam 19/23 OctoberFinal project due 11 December

Advanced Machine Learning 3 / 212

ASSISTANTS AND GRADING

Teaching Assistants

Ian Kinsella and Wenda Zhou

Office Hours Mon/Tue 5:30-7:30pm, Room 1025, Dept of Statistics, 10th floor SSW

Class Homepage

https://wendazhou.com/teaching/AdvancedMLFall17/

Homework• Some homework problems and final project require coding• Coding: Python• Homework due: Tue/Wed at 4pm – no late submissions• You can drop two homeworks from your final score

GradeHomework + Midterm Exam + Final Project

20% 40% 40%

Advanced Machine Learning 4 / 212

HOUSE RULES

EmailAll email to the TAs, please.

The instructors will not read your email unless it is forwarded by a TA.

Problems with exam/projectIf you cannot take the exam or finish the project: You must let us know

at least one week before

the midterm exam/project due date.

Advanced Machine Learning 5 / 212

READING

The relevant course material are the slides.

Books (optional)

See class homepage for references.

Advanced Machine Learning 6 / 212

OUTLINE (VERY TENTATIVE)

Part I (Orbanz)

• Neural networks (basic definitions andtraining)

• Graphical models (ditto)• Sampling algorithms• Variational inference• Optimization for GMs and NNs

Part II (Cunningham)

• NN software• Convolutional NNs and computer vision• Recurrent NNs• Reinforcement learning• Dimension reduction and autoencoders

Advanced Machine Learning 7 / 212

INTRODUCTION

AGAIN: MACHINE LEARNING

Historical origins: Artificial intelligence and engineeringMachines need to...• recognize patterns (e.g. vision, language)• make decisions based on experience (= data)• predict• cope with uncertainty

Modern applications: (A few) Examples• medical diagnosis• face detection/recognition• speech and handwriting recognition• web search

• recommender systems• bioinformatics• natural language processing• computer vision

Today

Machine learning and statistics have become hard to tell apart.

Advanced Machine Learning 9 / 212

LEARNING AND STATISTICS

TaskBalance the pendulumn upright by moving the sled left and right.• The computer can control only the motion of the sled.• Available data: Current state of system (measured 25 times/second).

FormalizationState = 4 variables (sled location, sled velocity, angle, angular velocity)

Actions = sled movements

The system can be described by a function

f : S ×A → S(state, action) 7→ state

Advanced Machine Learning 10 / 212

LEARNING AND STATISTICS

FormalizationState = 4 variables (sled location, sled velocity, angle, angular velocity)

Actions = sled movements

The system can be described by a function

f : S ×A → S(state, action) 7→ state

Advanced Machine Learning 10 / 212

LEARNING AND STATISTICS

Advanced Machine Learning 11 / 212

LEARNING AND STATISTICS

After each runFit a function

f : S ×A → S(state, action) 7→ state

to the data obtained in previous runs.

Running the system involves:1. The function f , which tells the system “how the world works”.

2. An optimization method that uses f to determine how to move towards the optimal state.

Note wellLearning how the world works is a regression problem.

Advanced Machine Learning 12 / 212

OUR MAIN TOPICS

Neural networks

• Define functions

• Represented by directed graph

Graphical models

• Define distributions

• Represented by directed graph

• Each vertex represents a function• Incoming edges: Function arguments• Outgoing edges: Function values• Learning: Differentiation/optimization

• Each vertex represents a distribution• Incoming edges: Conditions• Outgoing edges: Draws from distribution• Learning: Estimation/inference

Advanced Machine Learning 13 / 212

RECALL PREVIOUS TERM

sgn(〈vH, x〉 − c) < 0

sgn(〈vH, x〉 − c) > 0

Supervised learning Unsupervised learning

Classification Clustering (mixture models)Problems Regression HMMs

Dimension reduction (PCA)

Solutions Functions Distributions

Advanced Machine Learning 14 / 212

Neural networks Graphical models

v1 v2 v3

y2 = φ(vtx)

x1 x2 x3

φ

φ φ φ

ψ ψ ψ

y1 y2 y3

. . .

. . .

......

...

. . .

. . .

Advanced Machine Learning 15 / 212

NNS AND GRAPHICAL MODELS

Neural networks• Representation of function using a graph• Layers:

x g f

• Symbolizes: f (g(x))

“f depends on x only through g”

Graphical models• Representation of a distribution using a graph• Layers:

X Y Z

• Symbolizes: p(x, z, y) = p(z|y)p(y|x)p(x)

“Z is conditionally independent of X given Y”

v11v12

v13 v21

v22v23

v31

v32v33

y1 = φ(v1tx) y2 = φ(v2

tx) y3 = φ(v3tx)

x1 x2 x3

φ φ φ

Advanced Machine Learning 16 / 212

Grouping dependent variables intolayers is a good thing.

Advanced Machine Learning 17 / 212

HISTORICAL PERSPECTIVE: MCCULLOCH-PITTS NEURONMODEL (1943)

A neuron is modeled as a “thresholding device” that combines input signals:

x1 x2 x3

y

McCulloch-Pitts model• Collect the input signals x1, x2, x3 into a vector x = (x1, x2, x3) ∈ R3

• Choose fixed vector v ∈ R3 and constant c ∈ R.• Compute:

y = I{〈v, x〉 > c} for some c ∈ R .

Advanced Machine Learning 18 / 212

READING THE DIAGRAM

v1 v2 v3

x1 x2 x3

y

I{• > 0}

c

−1

y = I{ v1x1 + v2x2 + v3x3 + (−1)c > 0 } = I{〈v, x〉 > c}

Recall: Linear classifier

f (x) = sgn(〈v, x〉 − c)

Advanced Machine Learning 19 / 212

LINEAR CLASSIFICATION

v

x

〈x,v〉‖v‖

f (x) = sgn(〈v, x〉 − c)

Advanced Machine Learning 20 / 212

REMARKS

v1 v2 v3

y = I{vtx > c}

x1 x2 x3

• The “neural network” represents a linear two-class classifier (on R3).• It does not specify the training method.• To train the classifier, we need a cost function and an optimization method.

Advanced Machine Learning 21 / 212

TRAINING

• For parameter estimation by optimization, we need an optimization target.• Idea: Choose 0-1 loss as simplest loss for classification.• Minimize empirical risk (on training data) under this loss.16 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

-2

0

2

4

-2

0

2

4

0

100

-2

0

2

4

0

100

-2

0

2

4

-2

0

2

4

0

5

-2

0

2

4

0

5

-2

0

2

4

-2

0

2

4

0123

-2

0

2

4

0123

-2

0

2

4

-2

0

2

4

0

5

10

-2

0

2

4

0

5

10

y1y1

y1y1

y2y2

y2 y2

y3 y3

y3 y3

solutionregion

solutionregion

solutionregion

solutionregion

a2a2

a2a2

a1 a1

a1 a1

Jp(a)

Jq(a) Jr(a)

J(a)

Figure 5.11: Four learning criteria as a function of weights in a linear classifier. At theupper left is the total number of patterns misclassified, which is piecewise constantand hence unacceptable for gradient descent procedures. At the upper right is thePerceptron criterion (Eq. 16), which is piecewise linear and acceptable for gradientdescent. The lower left is squared error (Eq. 32), which has nice analytic propertiesand is useful even when the patterns are not linearly separable. The lower right isthe square error with margin (Eq. 33). A designer may adjust the margin b in orderto force the solution vector to lie toward the middle of the b = 0 solution region inhopes of improving generalization of the resulting classifier.

Thus, the batch Perceptron algorithm for finding a solution vector can be statedvery simply: the next weight vector is obtained by adding some multiple of the sumof the misclassified samples to the present weight vector. We use the term “batch”batch

training to refer to the fact that (in general) a large group of samples is used when com-puting each weight update. (We shall soon see alternate methods based on singlesamples.) Figure 5.12 shows how this algorithm yields a solution vector for a simpletwo-dimensional example with a(1) = 0, and η(k) = 1. We shall now show that itwill yield a solution for any linearly separable problem.

Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001Advanced Machine Learning 22 / 212

THE PERCEPTRON CRITERION

• Piece-wise constant function not suitable for numerical optimization.• Approximate by piece-wise linear function → perceptron cost function

16 CHAPTER 5. LINEAR DISCRIMINANT FUNCTIONS

-2

0

2

4

-2

0

2

4

0

100

-2

0

2

4

0

100

-2

0

2

4

-2

0

2

4

0

5

-2

0

2

4

0

5

-2

0

2

4

-2

0

2

4

0123

-2

0

2

4

0123

-2

0

2

4

-2

0

2

4

0

5

10

-2

0

2

4

0

5

10

y1y1

y1y1

y2y2

y2 y2

y3 y3

y3 y3

solutionregion

solutionregion

solutionregion

solutionregion

a2a2

a2a2

a1 a1

a1 a1

Jp(a)

Jq(a) Jr(a)

J(a)

Figure 5.11: Four learning criteria as a function of weights in a linear classifier. At theupper left is the total number of patterns misclassified, which is piecewise constantand hence unacceptable for gradient descent procedures. At the upper right is thePerceptron criterion (Eq. 16), which is piecewise linear and acceptable for gradientdescent. The lower left is squared error (Eq. 32), which has nice analytic propertiesand is useful even when the patterns are not linearly separable. The lower right isthe square error with margin (Eq. 33). A designer may adjust the margin b in orderto force the solution vector to lie toward the middle of the b = 0 solution region inhopes of improving generalization of the resulting classifier.

Thus, the batch Perceptron algorithm for finding a solution vector can be statedvery simply: the next weight vector is obtained by adding some multiple of the sumof the misclassified samples to the present weight vector. We use the term “batch”batch

training to refer to the fact that (in general) a large group of samples is used when com-puting each weight update. (We shall soon see alternate methods based on singlesamples.) Figure 5.12 shows how this algorithm yields a solution vector for a simpletwo-dimensional example with a(1) = 0, and η(k) = 1. We shall now show that itwill yield a solution for any linearly separable problem.

Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001Advanced Machine Learning 23 / 212

PERCEPTRON

“Train” McCulloch-Pitts model (that is: estimate (c, v)) by applying gradient descent to thefunction

Cp(c, v) :=n∑

i=1

I{sgn(〈v, xi〉 − c) 6= yi}∣∣∣∣⟨(c

v

),

(1xi

)⟩∣∣∣∣ ,called the Perceptron cost function.

Advanced Machine Learning 24 / 212

OUTLINE (VERY TENTATIVE)

Part I (Orbanz)

• Neural networks (basic definitions andtraining)

• Graphical models (ditto)• Sampling algorithms• Variational inference• Optimization for GMs and NNs

Part II (Cunningham)

• NN software• Convolutional NNs and computer vision• Recurrent NNs• Reinforcement learning• Dimension reduction and autoencoders

Advanced Machine Learning 25 / 212

TOOLS: LOGISTIC REGRESSION

SIGMOIDS

Sigmoid function

σ(x) =1

1 + e−x

-10 -5 5 10

0.2

0.4

0.6

0.8

1.0

Note

1−σ(x) =1 + e−x − 1

1 + e−x=

1ex + 1

= σ(−x)

Derivative

dσdx

(x) =e−x

(1 + e−x)2= σ(x)

(1− σ(x)

) -10 -5 5 10

0.2

0.4

0.6

0.8

1.0

Sigmoid (blue) and its derivative (red)

Advanced Machine Learning 27 / 212

APPROXIMATING DECISION BOUNDARIES

• In linear classification: Decisionboundary is a discontinuity

• Boundary is represented either byindicator function I{• > c} or signfunction sign(• − c)

• These representations are equivalent:Note sign(• − c) = 2 · I{• > c} − 1

-5 0 5 10

0.2

0.4

0.6

0.8

1.0

The most important use of the sigmoid function in machine learning is as a smoothapproximation to the indicator function.

Given a sigmoid σ and a data point x, we decide which side of the approximated boundary weare own by thresholding

σ(x) ≥12

Advanced Machine Learning 28 / 212

SCALING

We can add a scale parameter by definining

σθ(x) := σ(θx) =1

1− e−θxfor θ ∈ R

-5 0 5 10

0.2

0.4

0.6

0.8

1.0

Influence of θ• As θ increases, σθ approximates I more closely.• For θ →∞, the sigmoid converges to I pointwise, that is: For every x 6= 0, we have

σθ(x)→ I{x > 0} as θ → +∞ .

• Note σθ(0) = 12 always, regardless of θ.

Advanced Machine Learning 29 / 212

APPROXIMATING A LINEAR CLASSIFIER

So far, we have considered R, but linear classifiers usually live in Rd .

The decision boundary of a linear classifier inR2 is a discontinuous ridge:

• This is a linear classifier of the form

I{〈v, x〉 − c}.• Here: v = (1, 1) and c = 0.

We can “stretch” σ into a ridge function on R2:

• This is the functionx = (x1, x2) 7→ σ(x1).

• The ridge runs parallel to the x2-axes.• If we use σ(x2) instead, we rotate by 90

degrees (still axis-parallel).

Advanced Machine Learning 30 / 212

STEERING A SIGMOID

Just as for a linear classifier, we use a normal vector v ∈ Rd .

• The function σ(〈v, x〉 − c) is a sigmoid ridge, where the ridge is orthogonal to the normalvector v, and c is an offset that shifts the ridge “out of the origin”.

• The plot on the right shows the normal vector (here: v = (1, 1)) in black.• The parameters v and c have the same meaning for I and σ, that is, σ(〈v, x〉 − c)

approximates I{〈v, x〉 ≥ c}.

Advanced Machine Learning 31 / 212

LOGISTIC REGRESSION

Logistic regression is a classification method that approximates decision boundaries bysigmoids.

Setup• Two-class classification problem• Observations x1, . . . , xn ∈ Rd , class labels yi ∈ {0, 1}.

The logistic regression modelWe model the conditional distribution of the class label given the data as

P(y|x) := Bernoulli(σ(〈v, x〉 − c)

).

• Recall σ(〈v, x〉 − c) takes values in [0, 1] for all θ, and value 12 on the class boundary.

• The logistic regression model interprets this value as the probability of being in class y.

Advanced Machine Learning 32 / 212

LEARNING LOGISTIC REGRESSION

Since the model is defined by a parametric distribution, we can apply maximum likelihood.

NotationRecall from Statistical Machine Learning: We collect the parameters in a vector w by writing

w :=

(v−c

)and x :=

(x1

)so that 〈w, x〉 = 〈v, x〉 − c .

Likelihood function of the logistic regression modeln∏

i=1

σ(〈w, xi〉)yi(1− σ(〈w, xi〉)

)1−yi

Negative log-likelihood

L(w) := −n∑

i=1

(yi log σ(〈w, xi〉) + (1− yi) log

(1− σ(〈w, xi〉)

))

Advanced Machine Learning 33 / 212

MAXIMUM LIKELIHOOD

∇L(w) =n∑

i=1

(σ(wt xi)− yi

)xi

Note• Each training data point xi contributes to the sum proportionally to the approximation

error σ(wt xi)− yi incurred at xi by approximating the linear classifier by a sigmoid.

Maximum likelihood• The ML estimator w for w is the solution of

∇L(w) = 0 .

• For logistic regression, this equation has no solution in closed form.• To find w, we use numerical optimization.• The function L is convex (= ∪-shaped).

Advanced Machine Learning 34 / 212

RECALL FROM STATISTICAL MACHINE LEARNING

• If f is differentiable, we can apply gradient descent:

x(k+1) := x(k) −∇f (x(k))

where x(k) is the candidate solution in step k of the algorithm.• If the Hessian matrix Hf of partial second derivatives exists and is invertible, we can apply

Newton’s method, which converges faster:

x(k+1) := x(k) − H−1f (x(k)) · ∇f (x(k))

• Recall that the Hessian matrix of a (twice continuously differentiable) functionf : Rd → R is

Hf (x) :=( ∂2f∂xi∂xj

)i,j≤n

Since f is twice differentiable, each ∂2f/∂xi∂xj exists; since it is twice continuouslydifferentiable, ∂2f/∂xi∂xj = ∂2f/∂xj∂xi, so Hf is symmetric.

• The inverse of Hf (x) exists if and only if the matrix is positive definite (semidefinite doesnot suffice), which in turn is true if and only if f is strictly convex.

Advanced Machine Learning 35 / 212

NEWTON’S METHOD FOR LOGISTIC REGRESSION

Applying Newton

w(k+1) := w(k) − H−1L (w(k)) · ∇L(w(k))

Matrix notation

X :=

1 (x1)1 . . . (x1)j . . . (x1)d

......

...1 (xi)1 . . . (xi)j . . . (xi)d

......

...1 (xn)1 . . . (xn)j . . . (xn)d

xi Dσ =

σ(wt x1)(1− σ(wt x1)) . . . 0...

. . ....

0 . . . σ(wt xn)(1− σ(wt xn))

X is the data matrix (or design matrix) you know from linear regression. X has size n× (d + 1)and Dσ is n× n.

Newton step

w(k+1) =(XtDσX

)−1XtDσ(

Xw(k) − Dσ(

σ(⟨

w(k), x1⟩)

...σ(⟨

w(k), xn⟩)

−y1

...yn

))

Advanced Machine Learning 36 / 212

NEWTON’S METHOD FOR LOGISTIC REGRESSION

w(k+1) =(XtDσX

)−1XtDσ(

Xw(k) − Dσ(

σ(⟨

w(k), x1⟩)

...σ(⟨

w(k), xn⟩)

−y1

...yn

))

=: u(k)

=(XtDσX

)−1XtDσu(k)

Compare this to the least squares solution of a linear regression problem:

βββ = (XtX)−1Xty w(k+1) =(XtDσX

)−1XtDσu(k)

Differences:• The vector y of regression responses is substituted by the vector u(k) above.

• The matrix XtX is substituted by the matrix XtDσX.

• Note that matrices of product form XtX are positive semidefinite; since Dσ is diagonalwith non-negative entries, so is XtDσX.

Iteratively Reweighted Least Squares• At each step, the algorithm solves a least-squares problem “reweighted” by the matrix Dσ .• Since this happens at each step of an iterative algorithm, Newton’s method applied to the

logistic regression log-likelihood is also known as Iteratively Reweighted Least Squares.

Advanced Machine Learning 37 / 212

OTHER OPTIMIZATION METHODS

Newton: Cost• The size of the Hessian is (d + 1)× (d + 1).• In high-dimensional problems, inverting HL can become problematic.

Other methodsMaximum likelihood only requires that we minimize the negative log-likelihood; we can chooseany numerical method, not just Newton. Alternatives include:• Pseudo-Newton methods (only invert HL once, for w(1), but do not guarantee quadratic

convergence).• Gradient methods.• Approximate gradient methods, like stochastic gradient.

Advanced Machine Learning 38 / 212

OVERFITTING

Recall from Statistical Machine Learning

H

v

x

• If we increase the length of v withoutchanging its direction, the sign of 〈v, x〉does not change, but the value changes.

• That means: If v is the normal vector of aclassifier, and we scale v by some θ > 0,the decision boundary does not move, but〈θv, x〉 = θ 〈v, x〉.

Effect inside a sigmoid

σ(〈θv, x〉) = σ(θ 〈v, x〉) = σθ(〈v, x〉)

As the length of v increases, σ(〈v, x〉) becomesmore similar to I{〈v, x〉 > 0}. -5 0 5 10

0.2

0.4

0.6

0.8

1.0

longer v

longer v

Advanced Machine Learning 39 / 212

EFFECT ON ML FOR LOGISTIC REGRESSION

-5 0 5 10

0.2

0.4

0.6

0.8

1.0

xi

σ(wt xi)− yi

• Recall each training data point xi contributes an error term σ(wt xi)− yi to thelog-likelihood.

• By increasing the lenghts of w, we can make σ(wt xi)− yi arbitrarily small withoutmoving the decision boundary.

Advanced Machine Learning 40 / 212

OVERFITTING

Consequence for linearly separable data• Once the decision boundary is correctly located between the two classes, the maximization

algorithm can increase the log-likelihood arbitrarily by increasing the length of w.• That does not move the decision boundary, but he logistic function looks more and more

like the indicator I.• That may fit the training data more tightly, but can lead to bad generalization (e.g. for

similar reasons as for the perceptron, where the decision boundary may end up very closeto a training data point).

That is a form of overfitting.

Data that is not linearly separable• If the data is not separable, sufficiently many points on the “wrong” side of the decision

boundary prevent overfitting (since making w larger increases error contributions of thesepoints).

• For large data sets, overfitting can still occur if the fraction of such points is small.

Solutions• Overfitting can be addressed by including an additive penalty of the form L(w) + λ‖w‖.

Advanced Machine Learning 41 / 212

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Bernoulli and multinomial distributions• The mulitnomial distribution of N draws from K categories with parameter vector

(θ1, . . . , θK) (where∑

k≤K θk = 1) has probabililty mass function

P(m1, . . . ,mK |θ1, . . . , θK) =N!

m1! · · ·mK !

K∏k=1

θmkk where mk = # draws in category k

• Note that Bernoulli(p) = Multinomial(p, 1− p; N = 1).

Logistic regression• Recall two-class logistic regression is defined by P(Y|x) = Bernoulli(σ(wtx)).• Idea: To generalize logistic regression to K classes, choose a separate weight vector wk

for each class k, and define P(Y|x) by

Multinomial(σ(wt

1x), . . . , σ(wtKx))

where σ(wt1x) =

σ(wt1x)∑

k σ(wtkx) .

Advanced Machine Learning 42 / 212

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Logistic regression for K classesThe label y now takes values in {1, . . . ,K}.

P(y|x) =K∏

k=1

σ(wtk x)I{y=k}

The negative log-likelihood becomes

L(w1, . . . ,wK) = −∑

i≤n, k≤K

I{y = k} log σ(wtk xi)

This can again be optimized numerically.

Comparison to two-class case• Recall that 1− σ(x) = σ(−x).• That means

Bernoulli(σ(〈v, x〉 − c)

)≡ Multinomial

(σ(wtx), σ((−1)wtx)

)• That is: Two-class logistic regression as above is equivalent to multiclass logistic

regression with K = 2 provided we choose w2 = −w1.

Advanced Machine Learning 43 / 212

GRAPHICAL MODELS

GRAPHICAL MODELS

A graphical model represents the dependence structure within a set of random variablesas a graph.

OverviewRoughly speaking:• Each random variable is represented by vertex.• If Y depends on X, we draw an edge X → Y .• For example:

X Y

Z

This says: “X depends on Z, and Y depends on Z”.• We have to be careful: The above does not imply that X and Y are independent. We have

to make more precise what depends on means.

Advanced Machine Learning 45 / 212

We will use the notation:

L(X) = distribution of the random variable X

L(X|Y) = conditional distribution of X given Y

(L means “law”.)

Reason• If X is discrete, L(X) is usually given by a mass function P(x).• If it is continuous, L(X) is usually given by a density p(x).• With the notation above, we do not have to distinguish between discrete and continuous

variables.

Advanced Machine Learning 46 / 212

DEPENDENCE AND INDEPENDENCE

Dependence between random variables X1, . . . ,Xn is a property of theirjoint distribution L(X1, . . . ,Xn).

RecallTwo random variables are stochastically independent, or independent for short, if their jointdistribution factorizes:

L(X, Y) = L(X)L(Y)

For densities/mass functions:

P(x, y) = P(x)P(y) or p(x, y) = p(x)p(y)

Dependent means not independent.

IntuitivelyX and Y are dependent if knowing the outcome of X provides any information about theoutcome of Y .

More precisely:• If someone draws (X, Y) simultanuously, and only discloses X = x to you, does that

change your mind about the distribution of Y? (If so: Dependence.)• Once X is given, the distribution of Y is the conditional L(Y|X = x).• If that is still L(Y), as before X was drawn, the two are independent. IfL(Y|X = x) 6= L(Y), they are dependent.

Advanced Machine Learning 47 / 212

CONDITIONAL INDEPENDENCE

DefinitionGiven random variables X, Y , Z, we say that X is conditionally independent of Y given Z if

L(X, Y|Z = z) = L(X|Z = z)L(Y|Z = z) .

That is equivalent toL(X|Y = y, Z = z) = L(X|Z = z) .

Notation

X ⊥⊥Z Y

Intuitively

X and Y are dependent given Z = z if, although Z is known, knowing the outcome of X providesadditional information about the outcome of Y .

Advanced Machine Learning 48 / 212

GRAPHICAL MODEL NOTATION

Factorizing a joint distributionThe joint probability of random variables X1, . . . ,Xn can always be factorized as

L(X1, . . . ,Xn) = L(Xn|X1, . . . ,Xn−1)L(Xn−1|X1, . . . ,Xn−2) · · · L(X1) .

Note that we can re-arrange the variables in any order.

If there are conditional independencies, we can remove some variables from the conditionals:

L(X1, . . . ,Xn) = L(Xn|Xn)L(Xn−1|Xn−1) · · · L(X1) ,

where Xi is the subset of X1, . . . ,Xn on which Xi depends.

DefinitionLet X1, . . . ,Xn be random variables. A (directed) graphical model represents a factorizationof joint distribution L(X1, . . . ,Xn) as follows:• Factorize L(X1, . . . ,Xn).• Add one vertex for each variable Xi.• For each variable Xi, add and edge from each variable Xj ∈ Xi to Xi.

That is: An edge Xj → Xi is added if L(X1, . . . ,Xn) contains the factor L(Xi|Xj).

Advanced Machine Learning 49 / 212

GRAPHICAL MODEL NOTATION

Lack of uniquenessThe factorization is usually not unique, since e.g.

L(X, Y) = L(X|Y)L(Y) = L(Y|X)L(X) .

That means the direction of edges is not generally determined.

Remark• If we use a graphical model to define a model or visualize a model, we decide on the

direction of the edges.• Estimating the direction of edges from data is a very difficult (and very important)

problem. This is one of the main subjects of a research field called causal inference orcausality.

Advanced Machine Learning 50 / 212

A simple example

X Y

Z

X ⊥⊥Z Y

An example with layers

. . .

. . .

Layer 1

Layer 2

All variables in the (k + 1)st layer areconditionally independent given the

variables in the kth layer.

Advanced Machine Learning 51 / 212

WORDS OF CAUTION I

X Y

Z

X ⊥⊥Z Y

Important• X and Y are not independent, independence holds only conditionally on Z.• In other words: If we do not observe Z, X and Y are dependent, and we have to change the

graph:

X Y or X Y

Advanced Machine Learning 52 / 212

WORDS OF CAUTION II

X Y

Z

Conditioning on Z makes X and Y dependent.

Example• Suppose we start with two indepedent normal variables X and Y .• Z = X + Y .

If we know Z, and someone reveals the value of Y to us, we know everything about X.

This effect is known as explaining away. We will revisit it later.

Advanced Machine Learning 53 / 212

MACHINE LEARNING EXAMPLES I

Hidden Markov models

Z1 Z2 · · · Zn−1 Zn

X1 X2 · · · Xn−1 Xn

Advanced Machine Learning 54 / 212

MACHINE LEARNING EXAMPLES II

Sigmoid belief networksA graphical model in which each node is a binary random variable and each conditionalprobability is a logicist regression model is called a sigmoid belief network.

Terminology: “Belief network” or “Bayes net” are alternative names for graphical models.

Deep belief networksA deep belief network is a layered directedgraphical model that looks like this. −→Two tasks for deep belief nets are:• Sampling: Draw X1:d from L(X1:d|Θ1:d).

Note: Variables in each layer conditionallyindependent given previous layer.

• Inference: Estimate L(Θ1:d|X1:d = x1:d)when data x1:d is observed.

Problem: Conditioning a layer on thefollowing one makes variables dependent.

(More details later.)

. . .

. . .

......

...

. . .

. . .

Θ1 Θ2 Θd

X1 X2 Xd

Input layer

Data

Advanced Machine Learning 55 / 212

MARKOV RANDOM FIELDS

UNDIRECTED GRAPHICAL MODEL

• A graphical model is undirected when its dependency graph is undirected; equivalently, ifeach edge in the graph is either absent, or present in both directions.

X Y

Z

X Y

Z

X Y

Z

directed undirected

• An undirected graphical model is more commonly known as a Markov random field.• Markov random fields are special cases of (directed) graphical models, but have distinct

properties. We treat them separately.• We will consider the undirected case first.

Advanced Machine Learning 57 / 212

OVERVIEW

We start with an undirected graph:

ΘiΘi−1 Θi+1

Θk

Θj

Θk−1

Θj−1

Θk+1

Θj+1

wi+1,j+1

wi−1,i

......

...

......

...

. . .

. . .

. . .

. . .

. . .

. . .

A random variable Θi is associated with each vertex. Two random variables interact if they areneighbors in the graph.

Advanced Machine Learning 58 / 212

NEIGHBORHOOD GRAPH

• We define a neighborhood graph, which is a weighted, undirected graph:

N = (VN ,WN )

vertex setset of edge weights

The vertices vi ∈ VN are often referred to as sites.• The edge weights are scalars wij ∈ R. Since the graph is undirected, the weights are

symmetric (wij = wji).• An edge weight wij = 0 means "no edge between vi and vj".

NeighborhoodsThe set of all neighbors of vj in the graph,

∂ (i) := { j |wij 6= 0}

is called the neighborhood of vj.

vi

purple = ∂ (i)

Advanced Machine Learning 59 / 212

MARKOV RANDOM FIELDS

Given a neighborhood graphN , associate with each site vi ∈ VN a RV Θi.

The Markov propertyWe say that the joint distribution P of (Θ1, . . . ,Θn) satisfies the Markov property withrespect toN if

L(Θi|Θj, j 6= i) = L(Θi|Θj, j ∈ ∂ (i)) .

The set {Θj, j ∈ ∂ (i)} of random variables indexed by neighbors of vi is called the Markovblanket of Θi.

In wordsThe Markov property says that each Θi is conditionally independent of the remaining variablesgiven its Markov blanket.

DefinitionA distribution L(Θ1, . . . ,Θn) which satisfies theMarkov property for a given graphN is called a Markovrandom field.

Θi

Markov blanket of Θi

Advanced Machine Learning 60 / 212

ENERGY FUNCTIONS

Probabilities and energiesA (strictly positive) density p(x) can always be written in the form

p(x) =1Z

exp(−H(x)) where H : X→ R+

and Z is a normalization constant.

The function H is called an energy function, or cost function, or a potential.

MRF energyIn particular, we can write a MRF density for RVs Θ1:n as

p(θ1, . . . , θn) =1Z

exp(−H(θ1, . . . , θn))

Advanced Machine Learning 61 / 212

CLIQUES

Graphical models factorize over the graph. How does that work for MRFs?

A clique in a graph is a fully connected subgraph. In undirected graphs:

Clique Not a clique

1 2

3

4

5 6

The cliques in this graph are: i) The triangles (1, 2, 3), (1, 3, 4).

ii) Each pair of vertices connected by an edge (e.g. (2, 6)).

Advanced Machine Learning 62 / 212

THE HAMMERSLEY-CLIFFORD THEOREM

TheoremLetN be a neighborhood graph with vertex set VN . Suppose the random variables{Θi, i ∈ VN } take values in T , and their joint distribution has probability mass function P, sothere is an energy function H such that

P(θ1, . . . , θn) =e−H(θ1,...,θn)∑

i≤n∑θi∈T e−H(θ1,...,θn)

.

The joint distribution has the Markov property if and only if

H(θ1, θ2, . . .) =∑C∈C

HC(θi, i ∈ C) ,

where C is the set of cliques inN , and each HC is a non-negative function with |C| arguments.Hence,

P(θ1, . . . , θn) =∏C∈C

e−Hc(θi,i∈C)∑i∈C∑θi∈T e−Hc(θi,i∈C)

Markov random fields factorize over cliques.

Advanced Machine Learning 63 / 212

USE OF MRFS

In general• Modeling systems of dependent RVs is one of the hardest problems in probability.• MRFs model dependence, but break it down to a limited number of interactions to make

the model tractable.

MRFs on grids• By far the most widely used neighborhood graphs

are 2-dimensional grids.• MRFs on grids are used in spatial statistics to

model spatial interactions between RVs.• Hammersley-Clifford for grids: The only cliques

are the edges!

ΘiΘi−1 Θi+1

Θk

Θj

Θk−1

Θj−1

Θk+1

Θj+1

......

...

......

...

. . .

. . .

. . .

. . .

. . .

. . .

2-dimensional grid graph with 4-neighborhoods

MRFs on grids factorize over edges.

Advanced Machine Learning 64 / 212

THE POTTS MODEL

DefinitionSupposeN = (VN ,WN ) a neighborhood graph with n vertices and β > 0 a constant. Then

p(θ1:n) :=1

Z(β,WN )exp(β∑

i,j

wijI{θi = θj})

defines a joint distribution of n random variables Θ1, . . . ,Θn. This distribution is called thePotts model.

Interpretation• If wij > 0: The overall probability increases if Θi = Θj.• If wij < 0: The overall probability decreases if Θi = Θj.• If wij = 0: No interaction between Θi and Θj.

Positive weights encourage smoothness.

Advanced Machine Learning 65 / 212

EXAMPLE

Ising modelThe simplest choice is wij = 1 if (i, j) is an edge.

p(θ1:n) =1

Z(β)exp( ∑

(i,j) is an edge

βI{θi = θj})

IfN is a d-dim. grid, this model is called the Ising model.

ΘiΘi−1 Θi+1

Θk

Θj

Θk−1

Θj−1

Θk+1

Θj+1

......

...

......

...

. . .

. . .

. . .

. . .

. . .

. . .

ExampleSamples from an Ising model on a 56× 56 grid graph.

Increasing β −→

Advanced Machine Learning 66 / 212

MRFS AS SMOOTHNESS PRIORS

We consider a spatial problem with observations Xi. Each i is a location on a grid.

Spatial modelSuppose we model each Xi by a distribution L(X|Θi), i.e. each location i has its own parametervariable Θi. This model is Bayesian (the parameter is a random variable). We use an MRF asprior distribution.

p( . |θi)

Θi Θi+1

Θj Θj+1

Xi Xi+1

Xj Xj+1

unobserved

observed

We can think of L(X|Θi) as an emission probability, similar to an HMM.

Spatial smoothing• We can define the joint distribution (Θ1, . . . ,Θn) as a MRF on the grid graph.• For positive weights, the MRF will encourage the model to explain neighbors Xi and Xj by

the same parameter value.→ Spatial smoothing.

Advanced Machine Learning 67 / 212

EXAMPLE: SEGMENTATION OF NOISY IMAGES

Mixture model• A BMM can be used for image segmentation.• The BMM prior on the component parameters is a natural

conjugate prior q(θ).• In the spatial setting, we index the parameter of each Xi

separately as θi. For K mixture components, θ1:n containsonly K different values.

• The joint BMM prior on θ1:n is

qBMM(θ1:n) =n∏

i=1

q(θi) .

Smoothing termWe multiply the BMM prior qBMM(θ) with an MRF prior

qMRF(θ1:n) =1

Z(β)exp(β∑

wij 6=0

I{θi = θj})

This encourages spatial smoothnes of the segmentation.

Int J Comput Vis

equipped with a prior probability. The prior is controlled bymeans of the hyperparameter α. The number of classes de-pends on α, but the influence of the hyperparameter can beoverruled by observed evidence. A question of particular in-terest is therefore the influence of the hyperparameter α onthe number of clusters. Table 1 shows the average number ofclusters selected by the model for a wide range of hyperpa-rameter values, ranging over several orders of magnitude.Averages are taken over ten randomly initialized experi-

Fig. 6 A SAR image with a high noise level and ambiguous segments(upper left). Solutions without (upper right) and with smoothing

Fig. 7 Segmentation results for α = 10, at different levels of smooth-ing: Unconstrained (left), standard smoothing (λ = 1, middle) andstrong smoothing (λ = 5, right)

ments each. In general, the number of clusters increasesmonotonically with an increasing value of the DP scatter pa-rameter α. With smoothing activated, the average estimatebecomes more conservative, and more stable with respectto a changing α. The behavior of the estimate depends onthe class structure of the data. If the data is well-separated,estimation results become more stable, as is the case forthe MRI image (Fig. 8). With smoothing activated, the es-timated number of clusters stabilizes at NC = 4. In contrast,the data in Fig. 4 does not provide sufficient evidence fora particular number of classes, and no stabilization effectis observed. We thus conclude that, maybe not surprisingly,the reliability of MDP and MDP/MRF model selection re-sults depends on how well the parametric clustering modelused with the DP is able to separate the input features intodifferent classes. The effect of the base measure scatter, de-fied here by the parameter β , is demonstrated in Fig. 9. Thenumber of clusters selected is plotted over α at two differ-ent values of β = 50 and β = 200, each with and withoutsmoothing. The number of clusters is consistently decreasedby increasing β and activating the smoothing constraint.

The stabilizing effect of smoothing is particularly pro-nounced for large values of α, resulting in a large number

Fig. 8 MR frontal view image of a monkey’s head. Original image(upper left), unsmoothed MDP segmentation (upper right), smoothedMDP segmentation (lower left), original image overlaid with segmentboundaries (smoothed result, lower right)

Table 1 Average number ofclusters (with standarddeviations), chosen by thealgorithm on two images fordifferent values of thehyperparameter. Whensmoothing is activated (λ = 5,right column), the number ofclusters tends to be more stablewith respect to a changing α

α Image Fig. 4 Image Fig. 8

MDP Smoothed MDP Smoothed

1e-10 7.7 ± 1.1 4.8 ± 1.4 6.3 ± 0.2 2.0 ± 0.0

1e-8 12.9 ± 0.8 6.2 ± 0.4 6.5 ± 0.3 2.6 ± 0.9

1e-6 14.8 ± 1.7 8.0 ± 0.0 8.6 ± 0.9 4.0 ± 0.0

1e-4 20.6 ± 1.2 9.6 ± 0.7 12.5 ± 0.3 4.0 ± 0.0

1e-2 33.2 ± 4.6 11.8 ± 0.4 22.4 ± 1.8 4.0 ± 0.0

Input image.

Int J Comput Vis

equipped with a prior probability. The prior is controlled bymeans of the hyperparameter α. The number of classes de-pends on α, but the influence of the hyperparameter can beoverruled by observed evidence. A question of particular in-terest is therefore the influence of the hyperparameter α onthe number of clusters. Table 1 shows the average number ofclusters selected by the model for a wide range of hyperpa-rameter values, ranging over several orders of magnitude.Averages are taken over ten randomly initialized experi-

Fig. 6 A SAR image with a high noise level and ambiguous segments(upper left). Solutions without (upper right) and with smoothing

Fig. 7 Segmentation results for α = 10, at different levels of smooth-ing: Unconstrained (left), standard smoothing (λ = 1, middle) andstrong smoothing (λ = 5, right)

ments each. In general, the number of clusters increasesmonotonically with an increasing value of the DP scatter pa-rameter α. With smoothing activated, the average estimatebecomes more conservative, and more stable with respectto a changing α. The behavior of the estimate depends onthe class structure of the data. If the data is well-separated,estimation results become more stable, as is the case forthe MRI image (Fig. 8). With smoothing activated, the es-timated number of clusters stabilizes at NC = 4. In contrast,the data in Fig. 4 does not provide sufficient evidence fora particular number of classes, and no stabilization effectis observed. We thus conclude that, maybe not surprisingly,the reliability of MDP and MDP/MRF model selection re-sults depends on how well the parametric clustering modelused with the DP is able to separate the input features intodifferent classes. The effect of the base measure scatter, de-fied here by the parameter β , is demonstrated in Fig. 9. Thenumber of clusters selected is plotted over α at two differ-ent values of β = 50 and β = 200, each with and withoutsmoothing. The number of clusters is consistently decreasedby increasing β and activating the smoothing constraint.

The stabilizing effect of smoothing is particularly pro-nounced for large values of α, resulting in a large number

Fig. 8 MR frontal view image of a monkey’s head. Original image(upper left), unsmoothed MDP segmentation (upper right), smoothedMDP segmentation (lower left), original image overlaid with segmentboundaries (smoothed result, lower right)

Table 1 Average number ofclusters (with standarddeviations), chosen by thealgorithm on two images fordifferent values of thehyperparameter. Whensmoothing is activated (λ = 5,right column), the number ofclusters tends to be more stablewith respect to a changing α

α Image Fig. 4 Image Fig. 8

MDP Smoothed MDP Smoothed

1e-10 7.7 ± 1.1 4.8 ± 1.4 6.3 ± 0.2 2.0 ± 0.0

1e-8 12.9 ± 0.8 6.2 ± 0.4 6.5 ± 0.3 2.6 ± 0.9

1e-6 14.8 ± 1.7 8.0 ± 0.0 8.6 ± 0.9 4.0 ± 0.0

1e-4 20.6 ± 1.2 9.6 ± 0.7 12.5 ± 0.3 4.0 ± 0.0

1e-2 33.2 ± 4.6 11.8 ± 0.4 22.4 ± 1.8 4.0 ± 0.0

Segmentation w/o smoothing.

Int J Comput Vis

equipped with a prior probability. The prior is controlled bymeans of the hyperparameter α. The number of classes de-pends on α, but the influence of the hyperparameter can beoverruled by observed evidence. A question of particular in-terest is therefore the influence of the hyperparameter α onthe number of clusters. Table 1 shows the average number ofclusters selected by the model for a wide range of hyperpa-rameter values, ranging over several orders of magnitude.Averages are taken over ten randomly initialized experi-

Fig. 6 A SAR image with a high noise level and ambiguous segments(upper left). Solutions without (upper right) and with smoothing

Fig. 7 Segmentation results for α = 10, at different levels of smooth-ing: Unconstrained (left), standard smoothing (λ = 1, middle) andstrong smoothing (λ = 5, right)

ments each. In general, the number of clusters increasesmonotonically with an increasing value of the DP scatter pa-rameter α. With smoothing activated, the average estimatebecomes more conservative, and more stable with respectto a changing α. The behavior of the estimate depends onthe class structure of the data. If the data is well-separated,estimation results become more stable, as is the case forthe MRI image (Fig. 8). With smoothing activated, the es-timated number of clusters stabilizes at NC = 4. In contrast,the data in Fig. 4 does not provide sufficient evidence fora particular number of classes, and no stabilization effectis observed. We thus conclude that, maybe not surprisingly,the reliability of MDP and MDP/MRF model selection re-sults depends on how well the parametric clustering modelused with the DP is able to separate the input features intodifferent classes. The effect of the base measure scatter, de-fied here by the parameter β , is demonstrated in Fig. 9. Thenumber of clusters selected is plotted over α at two differ-ent values of β = 50 and β = 200, each with and withoutsmoothing. The number of clusters is consistently decreasedby increasing β and activating the smoothing constraint.

The stabilizing effect of smoothing is particularly pro-nounced for large values of α, resulting in a large number

Fig. 8 MR frontal view image of a monkey’s head. Original image(upper left), unsmoothed MDP segmentation (upper right), smoothedMDP segmentation (lower left), original image overlaid with segmentboundaries (smoothed result, lower right)

Table 1 Average number ofclusters (with standarddeviations), chosen by thealgorithm on two images fordifferent values of thehyperparameter. Whensmoothing is activated (λ = 5,right column), the number ofclusters tends to be more stablewith respect to a changing α

α Image Fig. 4 Image Fig. 8

MDP Smoothed MDP Smoothed

1e-10 7.7 ± 1.1 4.8 ± 1.4 6.3 ± 0.2 2.0 ± 0.0

1e-8 12.9 ± 0.8 6.2 ± 0.4 6.5 ± 0.3 2.6 ± 0.9

1e-6 14.8 ± 1.7 8.0 ± 0.0 8.6 ± 0.9 4.0 ± 0.0

1e-4 20.6 ± 1.2 9.6 ± 0.7 12.5 ± 0.3 4.0 ± 0.0

1e-2 33.2 ± 4.6 11.8 ± 0.4 22.4 ± 1.8 4.0 ± 0.0

Segmentation with MRF smoothing.

Advanced Machine Learning 68 / 212

SAMPLING AND INFERENCE

MRFs pose two main computational problems.

Problem 1: Sampling

Generate samples from the joint distribution of (Θ1, . . . ,Θn).

Problem 2: InferenceIf the MRF is used as a prior, we have to compute or approximate the posterior distribution.

Solution• MRF distributions on grids are not analytically tractable. The only known exception is the

Ising model in 1 dimension.• Both sampling and inference are based on Markov chain sampling algorithms.

Advanced Machine Learning 69 / 212

SAMPLING ALGORITHMS

SAMPLING ALGORITHMS

In general• A sampling algorithm is an algorithm that outputs samples X1,X2, . . . from a given

distribution P or density p.• Sampling algorithms can for example be used to approximate expectations:

Ep[ f (X)] ≈1n

n∑i=1

f (Xi)

Inference in Bayesian modelsSuppose we work with a Bayesian model whose posterior Qn := L(Θ|X1:n) cannot becomputed analytically.

• We will see that it can still be possible to sample from Qn.

• Doing so, we obtain samples Θ1,Θ2, . . . distributed according to Qn.• This reduces posterior estimation to a density estimation problem

(i.e. estimate Qn from Θ1,Θ2, . . .).

Advanced Machine Learning 71 / 212

PREDICTIVE DISTRIBUTIONS

Posterior expectationsIf we are only interested in some statistic of the posterior of the form EQn

[ f (Θ)] (e.g. theposterior mean), we can again approximate by

EQn[ f (Θ)] ≈

1m

m∑i=1

f (Θi) .

Example: Predictive distributionThe posterior predictive distribution is our best guess of what the next data point xn+1 lookslike, given the posterior under previous observations. In terms of densities:

p(xn+1|x1:n) :=

∫T

p(xn+1|θ)Qn(dθ|X1:n = x1:n) .

This is one of the key quantities of interest in Bayesian statistics.

Computation from samplesThe predictive is a posterior expectation, and can be approximated as a sample average:

p(xn+1|x1:n) = EQn[ p(xn+1|Θ)] ≈

1m

m∑i=1

p(xn+1|Θi)

Advanced Machine Learning 72 / 212

BASIC SAMPLING: AREA UNDER CURVE

Say we are interested in a probability density p on the interval [a, b].

x

p(y)

a b

A

Yi

Xi

Key observationSuppose we can define a uniform distribution UA on the blue area A under the curve. If wesample

(X1, Y1), (X2, Y2), . . . ∼iid UA

and discard the vertical coordinates Yi, the Xi are distributed according to p,

X1,X2, . . . ∼iid p .

Problem: Defining a uniform distribution is easy on a rectangular area, but difficult on anarbritrarily shaped one.

Advanced Machine Learning 73 / 212

REJECTION SAMPLING ON THE INTERVAL

Solution: Rejection samplingWe can enclose p in box, and sample uniformly from the box B.

x

p(x)

a b

c

B

• We can sample (Xi, Yi) uniformly on B by sampling

Xi ∼ Uniform[a, b] and Yi ∼ Uniform[0, c] .

• If (Xi, Yi) ∈ A, keep the sample.That is: If Yi ≤ p(Xi).

• Otherwise: Discard it ("reject" it).

Result: The remaining (non-rejected) samples are uniformly distributed on A.

Advanced Machine Learning 74 / 212

SCALING

This strategy still works if we scale the vertically by some constant k > 0.

xa b

c

B

xa b

k · c

B

We simply draw Yi ∼ Uniform[0, kc] instead of Yi ∼ Uniform[0, c].

Consequence

For sampling, it is sufficient if p is known only up to normalization(only the shape of p is known).

Advanced Machine Learning 75 / 212

DISTRIBUTIONS KNOWN UP TO SCALING

Sampling methods usually assume that we can evaluate the target distribution p up to a constant.That is:

p(x) =1Z

p(x) ,

and we can compute p(x) for any given x, but we do not know Z.

We have to pause for a moment and convince ourselves that there are useful examples wherethis assumption holds.

Example 1: Simple posteriorFor an arbitrary posterior computed with Bayes’ theorem, we could write

Π(θ|x1:n) =

∏ni=1 p(xi|θ)q(θ)

Zwith Z =

∫T

n∏i=1

p(xi|θ)q(θ)dθ .

Provided that we can compute the numerator, we can sample without computing thenormalization integral Z.

Advanced Machine Learning 76 / 212

DISTRIBUTIONS KNOWN UP TO SCALING

Example 2: Bayesian Mixture ModelRecall that the posterior of the BMM is (up to normalization):

qn(c1:K , θ1:K |x1:n) ∝n∏

i=1

( K∑k=1

ckp(xi|θk))( K∏

k=1

q(θk|λ, y))

qDirichlet(c1:K)

We already know that we can discard the normalization constant, but can we evaluate thenon-normalized posterior qn?• The problem with computing qn (as a function of unknowns) is that the term∏n

i=1

(∑Kk=1 . . .

)blows up into Kn individual terms.

• If we evaluate qn for specific values of c, x and θ,∑K

k=1 ckp(xi|θk) collapses to a singlenumber for each xi, and we just have to multiply those n numbers.

So: Computing qn as a formula in terms of unknowns is difficult; evaluating it for specificvalues of the arguments is easy.

Advanced Machine Learning 77 / 212

DISTRIBUTIONS KNOWN UP TO SCALING

Example 3: Markov random fieldIn a MRF, the normalization function is the real problem.

For example, recall the Ising model:

p(θ1:n) =1

Z(β)exp( ∑

(i,j) is an edge

βI{θi = θj})

The normalization function is

Z(β) =∑

θ1:n∈{0,1}n

exp( ∑

(i,j) is an edge

βI{θi = θj})

and hence a sum over 2n terms. The general Potts model is even more difficult.

On the other hand, evaluating

p(θ1:n) = exp( ∑

(i,j) is an edge

βI{θi = θj})

for a given configuration θ1:n is straightforward.

Advanced Machine Learning 78 / 212

REJECTION SAMPLING ON Rd

If we are not on the interval, sampling uniformly from an enclosing box is not possible (sincethere is no uniform distribution on all of R or Rd).

Solution: Proposal densityInstead of a box, we use another distribution r to enclose p:

x

p(x)

B

To generate B under r, we apply similar logic as before backwards:• Sample Xi ∼ r.• Sample Yi|Xi ∼ Uniform[0, r(Xi)].

r is always a simple distribution which we can sample and evaluate.

Advanced Machine Learning 79 / 212

REJECTION SAMPLING ON Rd

x

p(x)

B

• Choose a simple distribution r from which we know how to sample.• Scale p such that p(x) < r(x) everywhere.• Sampling: For i = 1, 2, . . . ,:

1. Sample Xi ∼ r.2. Sample Yi|Xi ∼ Uniform[0, r(Xi)].3. If Yi < p(Xi), keep Xi.4. Else, discard Xi and start again at (1).

• The surviving samples X1,X2, . . . are distributed according to p.

Advanced Machine Learning 80 / 212

FACTORIZATION PERSPECTIVE

The rejection step can be interpreted in terms of probabilities and densities.

FactorizationWe factorize the target distribution or density p as

p(x) = r(x) · A(x)

distribution from which weknow how to sample

probability function we can evaluateonce a specific value of x is given

Sampling from the factorization

X = X′ · Z

where X′ ∼ r and Z|X′ ∼ Bernoulli(A(X′))

Sampling Bernoulli variables with uniform variables

Z|X′ ∼ Bernoulli(A(X′)) ⇔ Z = I{U < A(X′)} where U ∼ Uniform[0, 1] .

Advanced Machine Learning 81 / 212

INDEPENDENCE

If we draw proposal samples Xi i.i.d. from r, the resulting sequence of accepted samplesproduced by rejection sampling is again i.i.d. with distribution p. Hence:

Rejection samplers produce i.i.d. sequences of samples.

Important consequenceIf samples X1,X2, . . . are drawn by a rejection sampler, the sample average

1m

m∑i=1

f (Xi)

(for some function f ) is an unbiased estimate of the expectation Ep[ f (X)].

Advanced Machine Learning 82 / 212

EFFICIENCY

The fraction of accepted samples is the ratio |A||B| of the areas under the curves p and r.

x

p(x)

If r is not a reasonably close approximation of p, we will end up rejecting a lot of proposalsamples.

Advanced Machine Learning 83 / 212

AN IMPORTANT BIT OF IMPRECISE INTUITION

Example figures for sampling methods tend to look like this. A high-dimensional distribution of correlated RVs will lookrather more like this.

Sampling is usually used in multiple dimensions. Reason, roughly speaking:• Intractable posterior distributions arise when there are several interacting random

variables. The interactions make the joint distribution complicated.• In one-dimensional problems (1 RV), we can usually compute the posterior analytically.• Independent multi-dimensional distributions factorize and reduce to one-dimensional case.

Warning: Avoid sampling if you can solve analytically.

Advanced Machine Learning 84 / 212

WHY IS NOT EVERY SAMPLER A REJECTION SAMPLER?

We can easily end up in situations where we accept only one in 106 (or 1010, or 1020,. . . )proposal samples. Especially in higher dimensions, we have to expect this to be not theexception but the rule.

Advanced Machine Learning 85 / 212

IMPORTANCE SAMPLING

The rejection problem can be fixed easily if we are only interested in approximating anexpectation Ep[ f (X)].

Simple case: We can evaluate pSuppose p is the target density and q a proposal density. An expectation under p can berewritten as

Ep[ f (X)] =

∫f (x)p(x)dx =

∫f (x)

p(x)

q(x)q(x)dx = Eq

[f (X)p(X)

q(X)

]

Importance samplingWe can sample X1,X2, . . . from q and approximate Ep[ f (X)] as

Ep[ f (X)] ≈1m

m∑i=1

f (Xi)p(Xi)

q(Xi)

There is no rejection step; all samples are used.

This method is called importance sampling. The coefficients p(Xi)q(Xi)

are called importanceweights.

Advanced Machine Learning 86 / 212

IMPORTANCE SAMPLING

General case: We can only evaluate pIn the general case,

p =1Zp

p and q =1Zq

q ,

and Zp (and possibly Zq) are unknown. We can write ZpZq

as

Zp

Zq=

∫p(x)dxZq

=

∫p(x)

q(x)q(x) dx

Zq=

∫p(x)

q(x)

Zq · q(x)dx = Eq

[p(X)

q(X)

]Approximating the constantsThe fraction Zp

Zqcan be approximated using samples x1:m from q:

Zp

Zq= Eq

[p(X)

q(X)

]≈

1m

m∑i=1

p(Xi)

q(Xi)

Approximating Ep[ f (X)]

Ep[ f (X)] ≈1m

m∑i=1

f (Xi)p(Xi)

q(Xi)=

1m

m∑i=1

f (Xi)Zq

Zp

p(Xi)

q(Xi)=

m∑i=1

f (Xi)p(Xi)q(Xi)∑m

j=1p(Xj)

q(Xj)

Advanced Machine Learning 87 / 212

IMPORTANCE SAMPLING IN GENERAL

Conditions• Given are a target distribution p and a proposal distribution q.

• p = 1Zp

p and q = 1Zq

q.

• We can evaluate p and q, and we can sample q.• The objective is to compute Ep[ f (X)] for a given function f .

Algorithm1. Sample X1, . . . ,Xm from q.

2. Approximate Ep[ f (X)] as

Ep[ f (X)] ≈

∑mi=1 f (Xi)

p(Xi)q(Xi)∑m

j=1p(Xj)

q(Xj)

Advanced Machine Learning 88 / 212

MARKOV CHAIN MONTE CARLO

MOTIVATION

Suppose we rejection-sample a distribution like this:

region of interest

Once we have drawn a sample in the narrow region of interest, we would like to continuedrawing samples within the same region. That is only possible if each sample depends on thelocation of the previous sample.

Proposals in rejection sampling are i.i.d. Hence, once we have found the region where pconcentrates, we forget about it for the next sample.

Advanced Machine Learning 90 / 212

MCMC: IDEA

Recall: Markov chain• A sufficiently nice Markov chain (MC) has an invariant distribution Pinv.• Once the MC has converged to Pinv, each sample Xi from the chain has marginal

distribution Pinv.

Markov chain Monte CarloWe want to sample from a distribution with density p. Suppose we can define a MC withinvariant distribution Pinv ≡ p. If we sample X1,X2, . . . from the chain, then once it hasconverged, we obtain samples

Xi ∼ p .

This sampling technique is called Markov chain Monte Carlo (MCMC).

Note: For a Markov chain, Xi+1 can depend on Xi, so at least in principle, it is possible for anMCMC sampler to "remember" the previous step and remain in a high-probability location.

Advanced Machine Learning 91 / 212

CONTINUOUS MARKOV CHAIN

The Markov chains we discussed so far had a finite state space X. For MCMC, state space nowhas to be the domain of p, so we often need to work with continuous state spaces.

Continuous Markov chainA continuous Markov chain is defined by an initial distribution Pinit and conditional probabilityt(y|x), the transition probability or transition kernel.

In the discrete case, t(y = i|x = j) is the entry pij of the transition matrix p.

Example: A Markov chain on R2

We can define a very simple Markov chain by sampling

Xi+1|Xi = xi ∼ g( . |xi, σ2)

where g(x|µ, σ2) is a spherical Gaussian with fixed variance. Inother words, the transition distribution is

t(xi+1|xi) := g(xi+1|xi, σ2) .

xi

A Gaussian (gray contours) is placedaround the current point xi to sample

Xi+1 .

Advanced Machine Learning 92 / 212

INVARIANT DISTRIBUTION

Recall: Finite case• The invariant distribution Pinv is a distribution on the finite state space X of the MC

(i.e. a vector of length |X|).• "Invariant" means that, if Xi is distributed according to Pinv, and we execute a step

Xi+1 ∼ t( . |xi) of the chain, then Xi+1 again has distribution Pinv.• In terms of the transition matrix p:

p · Pinv = Pinv

Continuous case• X is now uncountable (e.g. X = Rd).• The transition matrix p is substituted by the conditional probability t.• A distribution Pinv with density pinv is invariant if∫

Xt(y|x)pinv(x)dx = pinv(y)

This is simply the continuous analogue of the equation∑

i pij(Pinv)i = (Pinv)j.

Advanced Machine Learning 93 / 212

MARKOV CHAIN SAMPLING

We run the Markov chain n for steps.Each step moves from the current

location xi to a new xi+1 .

We "forget" the order and regard thelocations x1:n as a random set of

points.

If p (red contours) is both theinvariant and initial distribution, each

Xi is distributed as Xi ∼ p.

Problems we need to solve1. We have to construct a MC with invariant distribution p.

2. We cannot actually start sampling with X1 ∼ p; if we knew how to sample from p, all ofthis would be pointless.

3. Each point Xi is marginally distributed as Xi ∼ p, but the points are not i.i.d.

Advanced Machine Learning 94 / 212

CONSTRUCTING THE MARKOV CHAIN

Given is a continuous target distribution with density p.

Metropolis-Hastings (MH) kernel1. We start by defining a conditional probability q(y|x) on X.

q has nothing to do with p. We could e.g. choose q(y|x) = g(y|x, σ2), as in the previous example.

2. We define a rejection kernel A as

A(xn+1|xn) := min{

1,q(xi|xi+1)p(xi+1)

q(xi+1|xi)p(xi)

}The normalization of p cancels in the quotient, so knowing p is again enough.

3. We define the transition probability of the chain as

t(xi+1|xi) := q(xi+1|xi)A(xi+1|xi)+δxi (xi+1)c(xi) where c(xi) :=

∫q(y|xi)(1−A(y|xi))dy

Sampling from the MH chainAt each step i + 1, generate a proposal X∗ ∼ q( . |xi) and Ui ∼ Uniform[0, 1].• If Ui ≤ A(x∗|xi), accept proposal: Set xi+1 := x∗.• If Ui > A(x∗|xi), reject proposal: Set xi+1 := xi.

total probability thata proposal is sampled

and then rejected

Advanced Machine Learning 95 / 212

PROBLEM 1: INITIAL DISTRIBUTION

Recall: Fundamental theorem on Markov chainsSuppose we sample X1 ∼ Pinit and Xi+1 ∼ t( . |xi). This defines a distribution Pi of Xi, whichcan change from step to step. If the MC is nice (recall: recurrent and aperiodic), then

Pi → Pinv for i→∞ .

Note: Making precise what aperiodic means in a continuous state space is a bit more technical than in the finite case, but thetheorem still holds. We will not worry about the details here.

Implication• If we can show that Pinv ≡ p, we do not have to know how to sample from p.• Instead, we can start with any Pinit, and will get arbitrarily close to p for sufficiently large i.

Advanced Machine Learning 96 / 212

BURN-IN AND MIXING TIME

The number m of steps required until Pm ≈ Pinv ≡ p is called the mixing time of the Markovchain. (In probability theory, there is a range of definitions for what exactly Pm ≈ Pinv means.)

In MC samplers, the first m samples are also called the burn-in phase. The first m samples ofeach run of the sampler are discarded:

X1, . . . ,Xm−1,Xm,Xm+1, . . .

Burn-in;discard.

Samples from(approximately) p;

keep.

Convergence diagnosticsIn practice, we do not know how large j is. There are a number of methods for assessing whetherthe sampler has mixed. Such heuristics are often referred to as convergence diagnostics.

Advanced Machine Learning 97 / 212

PROBLEM 2: SEQUENTIAL DEPENDENCE

Even after burn-in, the samples from a MC are not i.i.d.

Strategy• Estimate empirically how many steps L are needed for xi and xi+L to be approximately

independent. The number L is called the lag.• After burn-in, keep only every Lth sample; discard samples in between.

Estimating the lagThe most commen method uses the autocorrelation function:

Auto(xi, xj) :=E[xi − µi] · E[xj − µj]

σiσj

We compute Auto(xi, xi+L) empirically from the sample fordifferent values of L, and find the smallest L for which theautocorrelation is close to zero.

Autocorrelation Plots

We can get autocorrelation plots using the autocorr.plot()function.

> autocorr.plot(mh.draws)

0 5 15 25

−1.0

−0.5

0.0

0.5

1.0

Lag

Autocorrelation

0 5 15 25

−1.0

−0.5

0.0

0.5

1.0

Lag

Autocorrelation

L

Aut

o(x i,

x i+

L)

Advanced Machine Learning 98 / 212

CONVERGENCE DIAGNOSTICS

There are about half a dozen popular convergence crieteria; the one below is an example.

Gelman-Rubin criterion• Start several chains at random. For each chain k, sample Xk

ihas a marginal distribution Pk

i .

• The distributions of Pki will differ between chains in early

stages.• Once the chains have converged, all Pi = Pinv are identical.• Criterion: Use a hypothesis test to compare Pk

i for different k(e.g. compare P2

i against null hypothesis P1i ). Once the test

does not reject anymore, assume that the chains are pastburn-in.

Reference: A. Gelman and D. B. Rubin: "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science, Vol. 7 (1992) 457-511.

Advanced Machine Learning 99 / 212

STOCHASTIC HILL-CLIMBING

The Metropolis-Hastings rejection kernel was defined as:

A(xn+1|xn) = min{

1,q(xi|xi+1)p(xi+1)

q(xi+1|xi)p(xi)

}.

Hence, we certainly accept if the second term is larger than 1, i.e. if

q(xi|xi+1)p(xi+1) > q(xi+1|xi)p(xi) .

That means:• We always accept the proposal value xi+1 if it increases the probability under p.• If it decreases the probability, we still accept with a probability which depends on the

difference to the current probability.

Hill-climbing interpretation• The MH sampler somewhat resembles a gradient ascent algorithm on p, which tends to

move in the direction of increasing probability p.• However:

• The actual steps are chosen at random.• The sampler can move "downhill" with a certain probability.• When it reaches a local maximum, it does not get stuck there.

Advanced Machine Learning 100 / 212

SELECTING A PROPOSAL DISTRIBUTION

Everyone’s favorite example: Two Gaussians

red = target distribution pgray = proposal distribution q

• Var[q] too large:Will overstep p; many rejections.

• Var[q] too small:Many steps needed to achieve goodcoverage of domain.

If p is unimodal and can be roughlyapproximated by a Gaussian, Var[q] should bechosen as smallest covariance component of p.

More generallyFor complicated posteriors (recall: small regions of concentration, large low-probability regionsin between) choosing q is much more difficult. To choose q with good performance, we alreadyneed to know something about the posterior.

There are many strategies, e.g. mixture proposals (with one component for large steps and onefor small steps).

Advanced Machine Learning 101 / 212

SUMMARY: MH SAMPLER

• MCMC samplers construct a MC with invariant distribution p.• The MH kernel is one generic way to construct such a chain from p and a proposal

distribution q.• Formally, q does not depend on p (but arbitrary choice of q usually means bad

performance).• We have to discard an initial number m of samples as burn-in to obtain samples

(approximately) distributed according to p.• After burn-in, we keep only every Lth sample (where L = lag) to make sure the xi are

(approximately) independent.

X1, . . . ,Xm−1,Xm,Xm+1, . . . ,Xm+L−1,Xm+L,Xm+L+1, . . .Xm+2L−1,Xm+2L, . . .

Burn-in;discard.

Samples correlatedwith Xj; discard.

Samples correlatedwith Xj+L ; discard.

Keep. Keep. Keep.

Advanced Machine Learning 102 / 212

THE GIBBS SAMPLER

GIBBS SAMPLING

By far the most widely used MCMC algorithm is the Gibbs sampler.

Full conditionalsSuppose L(X) is a distribution on RD, so X = (X1, . . . ,XD). The conditional probability of theentry Xd given all other entries,

L(Xd|X1, . . . ,Xd−1,Xd+1, . . . ,XD)

is called the full conditional distribution of Xd .

On RD, that means we are interested in a density

p(xd|x1, . . . , xd−1, xd+1, . . . , xD)

Gibbs samplingThe Gibbs sampler is the special case of the Metropolis-Hastings algorithm defined by

propsoal distribution for Xd = full conditional of Xd .

• Gibbs sampling is only applicable if we can compute the full conditionals for eachdimension d.

• If so, it provides us with a generic way to derive a proposal distribution.

Advanced Machine Learning 104 / 212

THE GIBBS SAMPLER

Proposal distributionSuppose p is a distribution on RD, so each sample is of the form Xi = (Xi,1, . . . ,Xi,D). Wegenerate a proposal Xi+1 coordinate-by-coordinate as follows:

Xi+1,1 ∼ p( . |xi,2, . . . , xi,D)

...Xi+1,d ∼ p( . |xi+1,1, . . . , xi+1,d−1, xi,d+1, . . . , xi,D)

...Xi+1,D ∼ p( . |xi+1,1, . . . , xi+1,D−1)

Note: Each new Xi+1,d is immediately used in the update of the next dimension d + 1.

A Metropolis-Hastings algorithms with proposals generated as above is called a Gibbs sampler.

No rejectionsIt is straightforward to show that the Metropolis-Hastings acceptance probability for eachxi+1,d+1 is 1, so proposals in Gibbs sampling are always accepted.

Advanced Machine Learning 105 / 212

EXAMPLE: MRF

In a MRF with D nodes, each dimension d corresponds to one vertex.

Full conditionalsIn a grid with 4-neighborhoods for instance, the Markovproperty implies that

p(θd|θ1, . . . , θd−1, θd+1, . . . , θD) = p(θd|θleft, θright, θup, θdown)

ΘdΘleft Θright

Θdown

Θup

Specifically: Potts model with binary weightsRecall that, for sampling, knowing only p (unnormalized) is sufficient:

p(θd|θ1, . . . , θd−1, θd+1, . . . , θD) =

exp(β(I{θd = θleft}+ I{θd = θright}+ I{θd = θup}+ I{θd = θdown}

)This is clearly very efficiently computable.

Advanced Machine Learning 106 / 212

EXAMPLE: MRF

Sampling the Potts modelEach step of the sampler generates a sample

θi = (θi,1, . . . , θi,D) ,

where D is the number of vertices in the grid.

Gibbs samplerEach step of the Gibbs sampler generates n updates according to

θi+1,d ∼ p( . |θi+1,1, . . . , θi+1,d−1, θi,d+1, . . . , θi,D)

∝ exp(β(I{θi+1,d = θleft}+ I{θi+1,d = θright}+ I{θi+1,d = θup}+ I{θi+1,d = θdown})

)

Advanced Machine Learning 107 / 212

BURN-IN MATTERS

This example is due to Erik Sudderth (UC Irvine).

MRFs as "segmentation" priors

!"#$%&'&!"#$%(&)*+,&

-..&/0"1$023%4&

)-+&5)-+&6127&+&%"$1"40&%"268931&"76"4&:&;&<&40$0"4&=3004&>30"%02$?4@&

).(...&/0"1$023%4&• MRFs where introduced as tools for image smoothing and segmentation by D. and S.

Geman in 1984.• They sampled from a Potts model with a Gibbs sampler, discarding 200 iterations as

burn-in.• Such a sample (after 200 steps) is shown above, for a Potts model in which each variable

can take one out of 5 possible values.• These patterns led computer vision researchers to conclude that MRFs are "natural" priors

for image segmentation, since samples from the MRF resemble a segmented image.

Advanced Machine Learning 108 / 212

EXAMPLE: BURN-IN MATTERS

E. Sudderth ran a Gibbs sampler on the same model and sampled after 200 iterations (as the Geman brothers did),and again after 10000 iterations:

!"#$%&'&!"#$%(&)*+,&

-..&/0"1$023%4&

)-+&5)-+&6127&+&%"$1"40&%"268931&"76"4&:&;&<&40$0"4&=3004&>30"%02$?4@&

).(...&/0"1$023%4&

200 iterations

!"#$%&'&!"#$%(&)*+,&

-..&/0"1$023%4&

)-+&5)-+&6127&+&%"$1"40&%"268931&"76"4&:&;&<&40$0"4&=3004&>30"%02$?4@&

).(...&/0"1$023%4&

10000 iterations

Chain 1 Chain 5

• The "segmentation" patterns are not sampled from the MRF distribution p ≡ Pinv, butrather from P200 6= Pinv.

• The patterns occur not because MRFs are "natural" priors for segmentations, but becausethe sampler’s Markov chain has not mixed.

• MRFs are smoothness priors, not segmentation priors.

Advanced Machine Learning 109 / 212

VARIATIONAL INFERENCE

VARIATIONAL INFERENCE: IDEA

ProblemWe have to solve an inference problem where the correct solution is an “intractable” distributionwith density p∗ (e.g. a complicated posterior in a Bayesian inference problem).

Variational approachApproximate p∗ as

q∗ := arg minq∈Q

φ(q, p∗)

whereQ is a class of simple distributions and φ is a cost function (small φ means good fit).That turns the inference problem into a constrained optimization problem

min φ(q, p∗)

s.t. q ∈ Q

Variational inference approximates a complicated distribution by minimizing the distance(or discrepancy) to a class of tractable distributions.

Advanced Machine Learning 111 / 212

BACKGROUND: VARIATIONAL METHODS

Recall: Optimization approach to problemsFormulate your problem such that the solution x∗∈ Rd is the minimum of some function f , andsolve

x∗ := arg minx∈Rd

f (x)

possibly under constraints.

Examples: Support vector machines, linear regression, logistic regression, . . .

Inference problem as above

q∗ := arg minq∈Q

φ(q, p∗)

• q is now a function (a density), not a point in Rd .• We have to optimize over a space of functions. Such spaces are in general

infinite-dimensional.• Often: Q is a parametric model, with parameter space T ⊂ Rd

→ reduces to optimization over Rd .• However: Optimization over infinite-dimensional spaces is in principle possible.

Advanced Machine Learning 112 / 212

OPTIMIZATION OF FUNCTIONALS

• Let F be a space of functions (e.g. all continuous functions on R).• A function φ : F → R (a function whose arguments are functions) is called a functional.• Examples: (1) The integral of a function. (2) The differential entropy of a density.

Recall: Derivatives (on R)The differential of f : Rd → R at point x is

δf (x) = limε↘0

f (x + ε)− f (x)

εif d = 1 or δf (x) = lim

‖x‖↘0

f (x + x)− f (x)

‖x‖in general.

The d-dimensional case works by reducing to the 1-dimensional case using a norm.

Derivatives of functionalsIf F is a function space and ‖ • ‖ a norm on F , we can apply the same idea to φ : F → R:

δφ( f ) := lim‖ f‖↘0

φ( f + f )− φ( f )

‖ f‖

δφ(f ) is called the Fréchet derivative of φ at f .f is a minimum of a Fréchet-differentiable functional φ only if δφ(f ) = 0.

Not examinable.Advanced Machine Learning 113 / 212

OPTIMIZATION OF FUNCTIONALS

OptimizationWe can in principle find a minimum of φ by gradient descent: Add increment functions ∆fk “inthe direction of” δφ(fk) to the current solution candidate fk .The maximum entropy problem is often cited as an example.

Horseshoes• We have to represent the infinite-dimensional quantities fk and ∆fk in some way.• Many interesting functionals φ are not Fréchet-differentiable as functionals on F . They

only become differentiable when constrained to a much smaller subspace.

One solution is “variational calculus”, an analytic technique that addresses both problems. (Wewill not need the details.)

Recall: Maximum entropy principle• The maximum entropy principle chooses a distribution within some set P of candidates

by selecting the one with the largest entropy.• That is: It solves the optimization problem

max H(p)

s.t. p ∈ P

• For example, if P are all those distributions under which some given statistic S takes agiven expected value, we obtain exponential family distributions with sufficient statistic S.

Not examinable.Advanced Machine Learning 114 / 212

OPTIMIZATION OF FUNCTIONALS

Maximum entropy as functional optimization• The entropy H assigns a scalar to a distribution→ functional!• Problem: The entropy as a functional e.g. on all distributions on R is concave, but it is not

differentiable; it is not even continuous.• The solution for exponential families can be determined using variational calculus.

Functional optimization in machine learning• We will be interested in problems of the form

minqφ(q)

s.t. q ∈ Q

whereQ is a parametric family.• That means each element ofQ is of the form q( • |θ), for θ ∈ T ⊂ Rd .• The problem then reduces back to optimization in Rd:

minθφ(q( • |θ))

s.t. θ ∈ T

• We can apply gradient descent, Newton, etc.

Not examinable.Advanced Machine Learning 115 / 212

KULLBACK-LEIBLER DIVERGENCE

RecallThe information in observing X = x under a probability mass function P is

JP(x) := log1

P(x)= − log P(x) .

Its expectation H(P) := EP[JP(X)] is the entropy of P.

The Kullback-Leibler divergence of P and Q is

DKL(P‖Q) := EP[JQ(X)]− H(P) =∑

x

P(x) logP(x)

Q(x)

Entropy and KL divergence for densitiesIf p and q are probability densities, then

H(p) := −∫

p(x) log p(x)dx and DKL(p‖q) :=

∫p(x) log

p(x)

q(x)dx

are the differential entropy of p and the Kullback-Leibler divergence of p and q.

Be careful• The differential entropy does not behave like the entropy (e.g. it can be negative).• The KL divergence for densities has properties analogous to the mass function case.

Advanced Machine Learning 116 / 212

VARIATIONAL INFERENCE WITH KL

Recall VI optimization problem

q∗ := arg minq∈Q

φ(q, p∗)

We have to choose a cost function.

The term “variational inference” in machine learning typically implies φ is a KL divergence,

q∗ := arg minq∈Q

DKL(q, p∗)

Order of the argumentsRecall that DKL is not symmetric, so

DKL(q, p∗) 6= DKL(p∗, q)

Which order should we use?• Recall DKL(p‖q) is an expectation with respect to p.• DKL(p∗‖q) emphasizes regions where the “true” model p∗ has high probability. That is

what we should use if possible.• We use VI because p∗ is intractable, so we can usually not compute expectations under it.• We use the expectation DKL(q, p∗) under the approximating simpler model instead.

We have to understand the implications of this choice.

Advanced Machine Learning 117 / 212

EXAMPLE

Approximating a Gaussian by a spherical Gaussian

What VI doesWhat VI would do if possible

z1

z2

(a)0 0.5 10

0.5

1

z1

z2

(b)0 0.5 10

0.5

1

DKL(q‖p∗) = DKL(red‖green)DKL(p∗‖q) = DKL(green‖red)

Illustration: Bishop (2006)Advanced Machine Learning 118 / 212

EXAMPLE

Approximating a Gaussian mixture by a single Gaussian

What VI doesWhat VI would do if possible

DKL(q‖p∗) = DKL(red‖blue)DKL(p∗‖q) = DKL(blue‖red)

Illustration: Bishop (2006)Advanced Machine Learning 119 / 212

VI FOR POSTERIOR DISTRIBUTIONS

Often: Target distribution is a posterior of a parameter or latent variable Z, given data x.

Basic approximation problemIf the posterior density is p∗(z) = p(z|x) =

p(x|z)p(z)p(x) , then

q∗( • ) = arg minq∈Q

DKL(q( • )|p( • |x)) .

Transforming the objective function

DKL(q( • )|p( • |x)) = E[log

q(Z)

p(Z|x)

]= E[log q(Z)]− E[log p(Z|x)]

= E[log q(Z)]− E[log p(Z, x)] + log p(x)

• The evidence p(x) is hard to compute (it is an integral over p∗(x|z)).• It depends only on x, so it is an additive constant w.r.t. the optimization problem.• Dropping it from the objective function does not change the location of the minimum.

F(q) := E[log q(Z)]− E[log p(Z, x)]

Advanced Machine Learning 120 / 212

VI FOR POSTERIOR DISTRIBUTIONS

Summary: VI approximation

min F(q)s.t. q ∈ Q where F(q) = E[log q(Z)]− E[log p(Z, x)]

Terminology• The function F is called a free energy in statistical physics.• Since there are different forms of free energies, various authors attach different adjectives

(variational free energy, Helmholtz free energy, etc).• Parts of the machine learning literature have renamed F, by maximizing the objective

function −F and calling it an evidence lower bound, since

−F(q) + DKL(q ‖ p( • |x)) = log p(x) hence e−F(q) ≤ p(x) ,

and p(x) is the “evidence” in the Bayes equation.

Advanced Machine Learning 121 / 212

MEAN FIELD APPROXIMATION

DefinitionA variational approximation of a probability distribution p on a d-dimensional space

q∗ := arg minq∈Q

DKL(q, p∗)

is called mean field approximation if each element ofQ is factorial,

q(z) = q1(z1) . . . qd(zd) .

In previous example

z1

z2

(a)0 0.5 10

0.5

1

Mean field (Gaussian spherical) Not a mean field

Advanced Machine Learning 122 / 212

EXAMPLE: MEAN FIELD FOR THE POTTS MODEL

ModelWe consider a MRF distribution for X1, . . . ,Xn with values in {−1,+1}, given by

P(X1, . . . ,Xn) =1

Z(β)exp(−βH(X1, . . . ,Xn)) where H = −

n∑i,j=1

wijXiXj −n∑

i=1

hiXi

“external field”

wij and hi are constant weights.

Physicists call this a “Potts model with an external magnetic field”.

Variational approximationWe chooseQ as the family

Q :={ n∏

i=1

Qmi

∣∣∣mi ∈ [−1, 1]}

where Qm(X) :=

{1+m

2 X = 11−m

2 X = −1

Each factor is a Bernoulli( 1+m

2

), except that the range is {−1, 1} not {0, 1}.

Advanced Machine Learning 123 / 212

EXAMPLE: MEAN FIELD FOR THE POTTS MODEL

Optimization Problem

min DKL

( n∏i=1

Qmi

∥∥∥P)

s.t. mi ∈ [−1, 1] for i = 1, . . . , n

Mean field solutionThe mean field approximation is given by the parameter values mi satisfying the equations

mi = tanh(β( n∑

j=1

wijmj + hi

)).

That is: For given values of wij and hi, we have to solve for the values mi.

Advanced Machine Learning 124 / 212

EXAMPLE: MEAN FIELD FOR THE POTTS MODEL

We have approximated the MRF

P(X1, . . . ,Xn) byn∏

i=1

Bernoulli(mi) satisfying mi = tanh(β( n∑

j=1

wijmj +hi

))(where we interpret a 0 generated by the Bernoulli as a −1).

Interpretation• In the MRF P, the random variables Xi interact.• There is no interaction in the approximation.• Instead, the effect of interactions is approximated by encoding them in the parameters.• This is somewhat like a single effect (“field”) acting on all variables simultanuously

(“mean field”).

How accurate is the approximation?• In physics, P is used to model a ferromagnet with an external magnetic field.• In this case, β is the inverse temperature.• These systems exhibit a phenomenon called spontanuous magnetization at certain

temperatures. The mean field solution predicts spontanuous magnetization, but at thewrong temperature values.

Advanced Machine Learning 125 / 212

DIRECTED GRAPHICAL MODELS:MIXTURES AND ADMIXTURES

NEXT: BAYESIAN MIXTURES AND ADMIXTURES

We will consider two variations on finite mixtures:• Bayesian mixture models (mixtures with priors).• Admixtures, in which the generation of each observation (e.g. document) can be

influenced by several components (e.g. topics).• One particular admixture model, called latent Dirichlet allocation, is one of the most

succesful machine learning models of the past ten years.

Advanced Machine Learning 127 / 212

FINITE MIXTURE AS A GRAPHICAL MODELS

π(x) =∑k≤K

ckp(x|θk)

Sampling from this model1. Fix ck and θk for k = 1, . . . ,K.

2. Generate Zi ∼ Multinomial(c1, . . . , cK).

3. Generate Xi|Zi ∼ p( • |θZi ).

As a graphical model

c

θ

Z1 . . . Zn

X1 . . . Xn

Box notation indicatesc and θ are not random

Advanced Machine Learning 128 / 212

PLATE NOTATION

If variables are sampled repeatedly in a graphical model, we enclose these variables in “plate”.

X

n

This means: Draw n (conditionally) independent samples from X.

Finite mixture with plate notation

c

θ

Z1 . . . Zn

X1 . . . Xn

=

c

Z

n

Advanced Machine Learning 129 / 212

BAYESIAN MIXTURE MODEL

Recall: Mixing distribution of a FMM

π(x) =K∑

k=1

ckp(x|θk) =

∫T

p(x|θ)m(θ)dθ with m :=K∑

k=1

ckδθk

All parameters are summarized in the mixing distribution m.

Bayesian mixture model: IdeaIn a Bayesian model, parameters are random variables. Here, that means a random mixingdistribution:

M( . ) =K∑

k=1

CkδΘk ( . )

Advanced Machine Learning 130 / 212

RANDOM MIXING DISTRIBUTION

How can we define a random distribution?Since M is discrete with finitely many terms, we only have to generate the random variables Ckand Θk:

M( . ) =K∑

k=1

CkδΘk ( . )

More preciselySpecifically, the term BMM implies that all priors are natural conjugate priors. That is:• The mixture components p(x|θ) are an exponential family model.• The prior on each Θk is a natural conjugate prior of p.• The prior of the vector (C1, . . . ,CK) is a Dirichlet distribution.

Explanation: Dirichlet distribution• When we sample from a finite mixture, we choose a component k from a multinomial

distribution with parameter vector (c1, . . . , ck).• The conjugate prior of the multinomial is the Dirichlet distribution.

Advanced Machine Learning 131 / 212

BAYESIAN MIXTURE MODELS

DefinitionA model of the form

π(x) =K∑

k=1

Ckp(x|Θk) =

∫T

p(x|θ)M(θ)dθ

is called a Bayesian mixture model if p(x|θ) is an exponential family model and M a randommixing distribution, where:• Θ1, . . . ,ΘK ∼iid q( . |λ, y), where q is a natural conjugate prior for p.• (C1, . . . ,CK) is sampled from a K-dimensional Dirichlet distribution.

Advanced Machine Learning 132 / 212

BAYESIAN MIXTURE AS A GRAPHICAL MODEL

Sampling from a Bayesian Mixture1. Draw C = (C1, . . . ,Ck) from a Dirichlet prior.

2. Draw Θ1, . . . ,ΘK ∼iid q, where q is the conjugate prior of p.

3. Draw Zi|C ∼ Multinomial(C).

4. Draw Xi|Zi,Θ ∼ p( • |ΘZi ).

As a graphical model

C

Z

n

Advanced Machine Learning 133 / 212

BAYESIAN MIXTURE: INFERENCE

Posterior distributionThe posterior density of a BMM under observations x1, . . . , xn is (up to normalization):

Π(c1:K , θ1:K |x1:n) ∝n∏

i=1

( K∑k=1

ckp(xi|θk))( K∏

k=1

q(θk|λ, y))

qDirichlet(c1:K)

The posterior is analytically intractable• Thanks to conjugacy, we can evaluate each term of the posterior.

• However: Due to the∏K

k=1

(∑ni=1 . . .

)bit, the posterior has Kn terms!

• Even for 10 clusters and 100 observations, that is impossible to compute.

Advanced Machine Learning 134 / 212

GIBBS SAMPLER FOR THE BMM

This Gibbs sampler is a bit harder to derive, so we skip the derivation and only look at the algorithm.

Recall: Bayesian mixture model• Exponential family likelihood p(x|θk) for each cluster k = 1, . . . ,K.• Natural conjugate prior q for all θk .• Dirichlet prior Dirichlet(α, g) for the mixture weights c1:K .

Assignment probabilitiesEach step of the Gibbs sampler computes an assignment matrix:

a =

a11 . . . a1K...

...an1 . . . anK

=(

Pr{xi in cluster k})

ik

Entries are computed as they are in the EM algorithm:

aik =Ckp(xi|Θk)∑Kl=1 Clp(xi|Θl)

In contrast to EM, the values Ck and Θk are random.

Advanced Machine Learning 135 / 212

GIBBS FOR BMM: ALGORITHM

In each iteration j, the algorithm cycles through these steps:1. For each xi, sample an assignment

Z ji ∼ Multinomial(a j

i1, . . . , ajiK) where a j

ik =C j−1

k p(xi|Θ j−1k )∑K

l=1 C j−1l p(xi|Θ j−1

l )

exactly as in EM

2. For each cluster k, sample a new value for Θ jk from the conjugate posterior Π(Θk) under

the observations currently assigned to k:

Θ jk ∼ Π

(Θ∣∣∣λ+

n∑i=1

I{Z ji = k}, y +

n∑i=1

I{Z ji = k}S(xi)

)# points currently

assigned to kaggregate S(xi) over

cluster k

3. Sample new cluster proportions C j1:K from the Dirichlet posterior (under all xi):

C j1:K ∼ Dirichlet(α+ n, g j

1:K) where g jk =

α · gk +∑n

i=1 I{Zj

i = k}α+ n

# points currentlyassigned to k

normalization

prior concentration

prior expectation of Ck

Advanced Machine Learning 136 / 212

COMPARISON: EM AND K-MEANS

The BMM Gibbs sampler looks very similar to the EM algorithm, with maximization steps (inEM) substituted by posterior sampling steps:

Representation of assignments Parameters

EM Assignment probabilities ai,1:K aik-weighted MLEK-means mi = arg maxk(ai,1:K) MLE for each cluster

Gibbs for BMM mi ∼ Multinomial(ai,1:K) Sample posterior for each cluster

Advanced Machine Learning 137 / 212

TOOLS: THE DIRICHLET DISTRIBUTION

THE DIRICHLET DISTRIBUTION

Recall: Probability simplexThe set of all probability distributions on K events is the simplex

∆K := {(c1, . . . , ck) ∈ RK | ck ≥ 0 and∑

k

cK = 1} .

e1

e2 e3

c1

c2

c3

Dirichlet distributionThe Dirichlet distribution is the distribution on ∆K with density

qDirichlet(c1:K |α, g1:K) :=1

Z(α, g1:K)exp( K∑

k=1

(αgk−1) log(ck))

Parameters:• g1:K ∈ ∆K : Mean parameter, i.e. E[c1:K ] = g1:K .

Note g1:K is a probability distribution on K categories.• α ∈ R+: Concentration.

Larger α→ sharper concentration around g1:K .

Advanced Machine Learning 139 / 212

THE DIRICHLET DISTRIBUTION

In all plots, g1:K =( 1

3 ,13 ,

13)

. Light colors = large density values.

Density plots

α = 1.8 α = 10

As heat maps

α = 0.8Large density values

at extreme points

α = 1Uniform distribution

on ∆K

α = 1.8Density peaks

around its mean

α = 10Peak sharpens

with increasing α

Advanced Machine Learning 140 / 212

MULTINOMIAL-DIRICHLET MODEL

ModelThe Dirichlet is the natural conjugate prior on the multinomial parameters. If we observe hkcounts in category k, the posterior is

Π(c1:K |h1, . . . , hk) = qDirichlet(c1:K |α+ n, (αg1 + h1, . . . , αgK + hK))

where n =∑

k hk is the total number of observations.

Illustration: One observationSuppose K = 3 and we obtain a single observation in category 3.

This extreme pointcorreponds to k = 3.

Prior: Mean at the center. Posterior: Shifted mean, increased concentration.

Advanced Machine Learning 141 / 212

THE DIRICHLET IS SPECIAL

• Rule of thumb: In a Bayesian model, stochastic dependence between dimensions in theprior amplifies in the posterior.

• To keep inference feasible: Keep variables in the prior “as independent as possible”.

Defining random probability distributions on K categories• If (Θ1, . . . ,ΘK) is a random probability distribution, the Θk cannot be independent.• How do we define Θ1:K so that components are as independent as possible?• Idea: Start with independent variables X1, . . . ,XK in (0,∞). If we define

Xk :=Xk∑Kj=1 Xj

then (X1, . . . , XK) ∈ ∆K

Dirichlet and gamma distributionsSuppose X1, . . . ,XK are independent random variables. If

Xk ∼ Gamma(γk, 1) then (X1, . . . , XK) ∼ Dirichlet(∑

k

γk;γ1∑

j γj, . . . ,

γK∑j γj

)

Advanced Machine Learning 142 / 212

THE DIRICHLET IS SPECIAL

Dependence in the priorIn general: Even if X1, . . . ,XK are independent,

Xk∑j Xj

and∑

j

Xj are stochastically dependent.

So: Components of the prior couple (1) through normalization and (2) through the latentvariable

∑j Xj.

Lukacs’ theorem: The gamma distribution is specialIf X and Y are independent random variables in (0,∞) (and not constant), then

XX + Y

⊥⊥ X + Y if and only if X, Y are gamma with the same shape parameter

Consequence: The Dirichlet is special• In the Dirichlet, components couple only through the normalization constraint.• Any other random probability defined by normalizing independent variables introduces

more dependence.

Advanced Machine Learning 143 / 212

TEXT MODELS

Recall: Multinomial text clusteringWe assume the corpus is generated by a multinomial mixture model of the form

π(H) =K∑

k=1

ckP(H|θk) ,

where P(H|θk) is multionmial.• A document is represented by a histogram H.• Topics θ1, . . . , θK .• θkj = Pr{ word j in topic k}.

ProblemEach document is generated by a single topic; that is a very restrictive assumption.

Advanced Machine Learning 144 / 212

SAMPLING DOCUMENTS

ParametersSuppose we consider a corpus with K topics and a vocubulary of d words.• φ ∈ ∆K topic proportions (φk = Pr{ topic k}).• θ1, . . . , θK ∈ ∆d topic parameter vectors (θkj = Pr{ word j in topic k}).

Note: For random generation of documents, we assume that φ and the topic parameters θk are given (they properties of thecorpus). To train the model, they have to be learned from data.

Model 1: Multinomial mixtureTo sample a document containing M words:

1. Sample topic Z ∼ Multinomial(φ).

2. For i = 1, . . . ,M: Sample wordi|Z ∼ Multinomial(θZ).The entire document is sample from topic Z.

Advanced Machine Learning 145 / 212

LATENT DIRICHLET ALLOCATION

Mixtures of topicsWhether we sample words or entire documents makes a big difference.• When we sample from the multinomial mixture, we choose a topic at random, then

sample the entire document from that topic.• For several topics to be represented in the document, we have to sample each word

individually (i.e. choose a new topic for each word).• Problem: If we do that in the mixture above, every document has the same topic

proportions.

Model 2: Admixture modelEach document explained as a mixture of topics, with mixture weights C1:K .

Fix a matrix θ of size #topics×#words, where

θkj := probability that word j occurs under topic k

1. Sample topic proportions c1:K ∼ Dirichlet(φ).

2. For i = 1, . . . ,M:

2.1 Sample topic for word i as Zi|C1:K ∼ Multinomial(C1:K).2.2 Sample wordi|Zi ∼ Multinomial(θZi ).

This model is known as Latent Dirichlet Allocation (LDA).

Advanced Machine Learning 146 / 212

COMPARISON: LDA AND BMM

ObservationLDA is almost a Bayesian mixture model: Both use multinomial components and a Dirichletprior on the mixture weights. However, they are not identical.

ComparisonBayesian MM Admixture (LDA)

Sample c1:K ∼ Dirichlet(φ). Sample c1:K ∼ Dirichlet(φ).Sample topic k ∼ Multinomial(c1:K). For i = 1, . . . ,M:For i = 1, . . . ,M: Sample topic ki ∼ Multinomial(c1:K).

Sample wordi ∼ Multinomial(θk). Sample wordi ∼ Multinomial(θki ).

In admixtures:• c1:K is generated at random, once for each document.• Each word is sampled from its own topic.

What do we learn in LDA?LDA explains each document by a separate parameter c1:K ∈ ∆K . That is, LDA modelsdocuments as topic proportions.

Advanced Machine Learning 147 / 212

LDA AS A GRAPHICAL MODEL

α

C

Z

wordΘ

N

Bayesian mixture(for M documents of N words each)

vector of length|vocabulary|

M

α

C

Z

wordΘ

N

M

topic proportions∼ Dirichlet(α)

topic of word∼ Multinomial(C)

observed word∼ Multinomial(row Z of Θ)

# words

# documents

#topics× |vocabulary| matrixeach row has sum 1

LDA

Advanced Machine Learning 148 / 212

PRIOR ON TOPIC PROBABILITIES

• The parameter matrix θ is of size

#topics × |vocabulary|.

• Meaning: θki = probability that term i isobserved under topic k.

• Note entries of θ are non-negative andeach row sums to 1.

• To learn the parameters θ along with theother parameters, we add a Dirichlet priorwith parameters η. The rows are drawni.i.d. from this prior.

α

C

Z

wordΘ

η

N

M

Θ is now random

Advanced Machine Learning 149 / 212

VARIATIONAL INFERENCE FOR LDA

Target distribution

Posterior L(Z,C,Θ|words, α, η)

Recall: Z|C ∼ Multinomial(C) and C,Θ1, . . . ,ΘK ∼ Dirichlet.

Variational approximation

Q = {q(z, c, θ|λ, φ, γ) |λ, φ, γ}where

q(z, c, θ|λ, φ) :=K∏

k=1

p(θk|λ)M∏

m=1

(q(cm|γm)

N∏n=1

p(zmn|φmn))

Dirichlet Multinomial

We have introduced a new set of “variational” parameters λ, γ, φ.

Variational inference problem

min DKL(q(z, c, θ|λ, φ)‖p(z, c, θ|α, η))

s.t. q ∈ Q

α

C

Z

wordΘ

η

N

M

Advanced Machine Learning 150 / 212

VARIATIONAL INFERENCE FOR LDA

q(z, c, θ|λ, φ) :=∏K

k=1 p(θk|λ)∏M

m=1

(q(cm|γm)

∏Nn=1 p(zmn|φmn)

C

Z

wordΘ

η

N

M

LDA

Θ

λ

K

γ

C

M

φ

Z

N

M

Variational approximation

“local” parameters“global” parameters

Advanced Machine Learning 151 / 212

ITERATIVE SOLUTION

Algorithmic solutionWe solve the minimization problem

min DKL(q(z, c, θ|λ, φ)‖p(z, c, θ|α, η))

s.t. q ∈ Q

by applying gradient descent to the resulting free energy.

VI algorithmIt can be shown that gradient descent amounts to the following algorithm:

repeat (using update equations on next page)

• update local parameters• update global parameters

until free energy has converged

Advanced Machine Learning 152 / 212

UPDATE EQUATIONS

In iteration t of the algorithm:

Local updates

φ(t+1)mn :=

φmn∑n φmn

where

φmn = exp(Eq[log(Cm1), . . . , log(Cmd)|γ(t)] + Eq[log(Θ1,wn ), . . . , log(Θd,wn )|λ(t)]

)γ(t+1) := α+

N∑n=1

φ(t+1)n

Global updates

λ(t+1)k = η +

∑m

∑n

wordmnφ(t+1)mn

Advanced Machine Learning 153 / 212

EXAMPLE: MIXTURE OF TOPICS

LATENT DIRICHLET ALLOCATION

TheWilliam Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropoli-tan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had areal opportunity to make a mark on the future of the performing arts with these grants an actevery bit as important as our traditional areas of support in health, medical research, educationand the social services,” Hearst Foundation President Randolph A. Hearst said Monday inannouncing the grants. Lincoln Center’s share will be $200,000 for its new building, whichwill house young artists and provide new public facilities. The Metropolitan Opera Co. andNew York Philharmonic will receive $400,000 each. The Juilliard School, where music andthe performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporterof the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000donation, too.

Figure 8: An example article from the AP corpus. Each color codes a different factor from whichthe word is putatively generated.

1009

From Blei, Ng, Jordan, "Latent Dirichlet Allocation", 2003Advanced Machine Learning 154 / 212

RESTRICTED BOLTZMANN MACHINES

BOLTZMANN MACHINE

DefinitionA Markov random field distribution of variables X1, . . . ,Xn with values in {0, 1} is called aBoltzmann machine if its joint law is

P(x1, . . . , xn) =1Z

exp(−∑

i<j≤n

wijxixj −∑i≤n

cixi

),

where wij are the edge weights of the MRF neighborhood graph, and c1, . . . , cn are scalarparameters.

Remarks• The Markov blanket of Xi are those Xj with wij 6= 0.• For x ∈ {−1, 1}n instead: Potts model with external

magnetic field.• This is an exponential family with sufficient statistics XiXj

and Xi.• As an exponential family, it is also a maximum entropy

model.

Advanced Machine Learning 156 / 212

WEIGHT MATRIX

P(x1, . . . , xn) =1Z

exp(−∑

i<j≤n

wijxixj −∑i≤n

cixi

)

Matrix representationWe collect the parameters in a matrix W := (wij) and a vector c = (c1, . . . , cn), and writeequivalently:

P(x1, . . . , xn) =1

Z(W, c)e−xtWx−ctx

• The matrix W is the adjacency matrix of the MRF neighborhood graph.• Because the MRF is undirected, the matrix is symmetric.

Advanced Machine Learning 157 / 212

RESTRICTED BOLTZMANN MACHINE

With observationsIf some some vertices represent observation variables Yi:

P(x1, . . . , xn, y1, . . . , ym) =e−(x,y)tW(x,y)−ctx−cty

Z(W, c, c)

Recall our hierarchical design approach• Only permit layered structure.• Obvious grouping: One layer for X, one for Y .• As before: No connections within layers.• Since the graph is undirected, that makes it bipartite.

Y1 Y2

Y3

Y4

X1

X2

X3

X4

X1 X2 X3

Y1 Y2 Y3

Advanced Machine Learning 158 / 212

RESTRICTED BOLTZMANN MACHINE

Bipartite graphsA graph is bipartite graph is a graph whose vertex set V can besubdivided into two sets A and B = V \ A such that all edges haveone end in A and one end in B.

X1 X2 X3

Y1 Y2 Y3

DefinitionA restricted Boltzmann machine (RBM) is a Boltzmann machine whose neighborhood graph

is bipartite.

That defines two layers. We usually think of one of these layers as observed, and one asunobserved.

Advanced Machine Learning 159 / 212

BOLTZMANN MACHINES AND SIGMOIDS

Consider a probability of the form

P(x) =1Z

e−θx for x ∈ {0, 1} and a fixed θ ∈ R .

This is what the probability of a single variable in a RBM looks like, if the distributionfactorizes.

P(x = 1) =e−θ

e−θ + e0=

11 + eθ

= σ(−θ)

where σ again denotes the sigmoid function.

Consequence

For random variables in {0, 1}:

P(x) =1Z

e−θx ⇔ X ∼ Bernoulli(σ(−θ))

Advanced Machine Learning 160 / 212

GIBBS SAMPLING BOLTZMANN MACHINES

Full conditionals: General case

P(X = x) =extWx+ctx

Z(W, c)

P(Xi = 1|x(i)) = σ(W ti x + ci)

Full conditionals: RBM• Variables in X-layer are conditionally

independent given Y-layer and vice versa• Two groups of conditionals: X|Y and Y|X• Blocked Gibbs samplers

P(X = x|Y = y) = σ(W ty + c)

P(Y = y|X = x) = σ(W tx + c)X1 X2 X3

Y1 Y2 Y3

Advanced Machine Learning 161 / 212

DEEP BELIEF NETWORKS

DIRECTED GRAPHICAL MODEL

. . .

. . .

......

...

. . .

. . .

Θ1 Θ2 ΘN

X1 X2 XN

Input layer

Data

Not examinable.Advanced Machine Learning 163 / 212

BAYESIAN VIEW

. . .

. . .

......

...

. . .

. . .

Θ1 Θ2 ΘN

X1 X2 XN

L(Θ1:N) = Prior

L(X1:N |Θ1:N) = Likelihood

Not examinable.Advanced Machine Learning 164 / 212

INFERENCE PROBLEM

• Task: Given X1, . . . ,XN , find L(Θ1:N |X1:N).• Problem: Recall “explaining away”.

X Y

Z

Conditioning on a common outcome makesvariables dependent.

• That means: Although each layer is conditionallyindependent given the previous one, conditioningon the subsequent one creates dependence withinthe layer.

. . .

. . .

......

...

. . .

. . .

Θ1 Θ2 ΘN

X1 X2 XN

Not examinable.Advanced Machine Learning 165 / 212

COMPLEMENTARY PRIOR: IDEA

The prior L(Θ1:N) can itself be represented asa directed (layered) graphical model:

. . .

......

...

. . .Θ1 Θ2 ΘN

Combining this prior with the likelihood stacksthe two networks on top of each other:

. . .

. . .

......

...

. . .

. . .X1 X2 XN

. . .

......

...

. . .Θ1 Θ2 ΘN

Not examinable.Advanced Machine Learning 166 / 212

COMPLEMENTARY PRIOR: IDEA

. . .

. . .

......

...

. . .

. . .X1 X2 XN

. . .

......

...

. . .Θ1 Θ2 ΘN

Idea• Invent a prior such that the dependencies

in the prior and likelihood “cancel out”.• Such a prior is called a complementary

prior.• We will see how to construct a

complementary prior below.

Not examinable.Advanced Machine Learning 167 / 212

COMPLEMENTARY PRIOR

Consider a layered directed graphical model.Denote the vector of variables in the kth layerX(k), so

X(k) = (X(k)1 , . . . ,X(k)

N )

ObservationX(1),X(2), . . . ,X(K) is a Markov chain.

Complementary prior idea• Suppose Markov chain is reversible.• Then all arrows can be reversed.• Now: Inference easy.

. . .

. . .

......

...

. . .

. . .

. . .

. . .

......

...

. . .

. . .

X(1)

X(2)

X(K)

Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006Advanced Machine Learning 168 / 212

COMPLEMENTARY PRIOR

Consider a layered directed graphical model.Denote the vector of variables in the kth layerX(k), so

X(k) = (X(k)1 , . . . ,X(k)

N )

ObservationX(1),X(2), . . . ,X(K) is a Markov chain.

Complementary prior idea• Suppose Markov chain is reversible.• Then all arrows can be reversed.• Now: Inference easy.

. . .

. . .

......

...

. . .

. . .

. . .

. . .

......

...

. . .

. . .

X(1)

X(2)

X(K)

Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006Advanced Machine Learning 168 / 212

BUILDING A COMPLEMENTARY PRIOR

• Find reversible Markov chain with

P(1) = P∞

• Let pT be its transition kernel• Choose

P(k+1)(•|X(k) = x) = pT(•|x)

• Then P(2) = . . . = P(k) = P∞• Since chain is reversible,

P(k)(•|X(k+1) = x) = pT(•|x)

and edges flip.

. . .

. . .

......

...

. . .

. . .

. . .

. . .

......

...

. . .

. . .

X(1)

X(2)

X(K)

Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006Advanced Machine Learning 169 / 212

BUILDING A COMPLEMENTARY PRIOR

• Find reversible Markov chain with

P(1) = P∞

• Let pT be its transition kernel• Choose

P(k+1)(•|X(k) = x) = pT(•|x)

• Then P(2) = . . . = P(k) = P∞• Since chain is reversible,

P(k)(•|X(k+1) = x) = pT(•|x)

and edges flip.

. . .

. . .

......

...

. . .

. . .

. . .

. . .

......

...

. . .

. . .

X(1)

X(2)

X(K)

Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006Advanced Machine Learning 169 / 212

WHERE DO WE GET THE MARKOV CHAIN?

Start with an RBM . . .

. . .

X

Y

Blocked Gibbs sampling alternates between X and Y

X(1) → Y(1) → X(2) → Y(2) → . . .→ X(K) → Y(K)

“Roll off” this chain into a graphical model

. . . . . .

...

...

...

. . . . . .

X(1) Y(1) X(K) Y(K)

The Gibbs sampler for the RBM becomes the model for the directed network.

Not examinable. See: G.E. Hinton, S. Osindero and Y.W. Teh. Neural Computation 18(7):1527-1554, 2006Advanced Machine Learning 170 / 212

WHERE DO WE GET THE MARKOV CHAIN?

. . .

. . .

X

Y

. . . . . .

...

...

...

. . . . . .

X(1) Y(1) X(K) Y(K)

For K →∞, we haveY(∞) d

= YNow suppose we use L(Y(∞)) as our prior distribution. Then:

Using an “infinitely deep” graphical model given by the Gibbs sampler for the RBM as aprior is equivalent to using the Y-layer of the RBM as a prior.

Not examinable.Advanced Machine Learning 171 / 212

DEEP BELIEF NETWORKS

. . .

. . .

. . .

......

...

. . .

. . .

First two layers: RBM (undirected)

Remaining layers: Directed

Θ1 Θ2 ΘN

X1 X2 XN

Not examinable.Advanced Machine Learning 172 / 212

DEEP BELIEF NETWORKS

Summary• The RBM consisting of the first two layers is

equivalent to an infinitely deep directed networkrepresenting the Gibbs samplers. (The rolled-offGibbs sampler on the previous slide.)

• That network is infinitely deep because a draw fromthe actual RBM distribution corresponds to a Gibbssampler that has reached its invariant distribution.

• When we draw from the RBM, the second layer isdistributed according to the invariant distribution ofthe Markov chain given by the RBM Gibbssampler.

• If the transition from each layer to the next in thedirected part is given by the Markov chain’stransition kernel pT, the RBM is a complementaryprior.

• We can then reverse all edges between the Θ-layerand the X-layer.

. . .

. . .

. . .

......

...

. . .

. . .

Θ1 Θ2 ΘN

X1 X2 XN

Not examinable.Advanced Machine Learning 173 / 212

NEURAL NETWORKS: BASICS

THE MOST IMPORTANT BIT

A neural network represents a function f : Rd1 → Rd2 .

Advanced Machine Learning 175 / 212

READING NEURAL NETWORKS

f : R3 → R3 with input x =

x1x2x2

w11w12

w13 w21

w22

w23w31

w32w33

f1(x) = φ1(w1tx) f2(x) = φ2(w2

tx) f3(x) = φ3(w3tx)

x1 x2 x3

φ1 φ2 φ3

f (x) =

f1(x)f2(x)f3(x)

with fi(x) = φi

( 3∑j=1

wijxj

)

Advanced Machine Learning 176 / 212

x1 x2

. . .

xd

. . .

. . .

......

...

. . .

. . .

= f (1)

= f (2)

= f (K)

f (x) = f (K)(· · · f (2)(f (1)(x))) = f (K) ◦ . . . ◦ f (1)(x)

Advanced Machine Learning 177 / 212

NEURAL NETWORKS

Layers• Each function f (k) is of the form

f (k) : Rdk → Rdk+1

• dk is the number of nodes in the kth layer. It is also called the width of the layer.• We mostly assume for simplicity: d1 = . . . = dK =: d.

Layered feed-forward networkA feed-forward network one organized into successive layers as on previous slide,

f (x) = f (K) ◦ . . . ◦ f (1)(x)

Each layer represents a function f (k). These functions are of the form:

f (k)( • ) =

φ

(k)1 (⟨

w(k)1 , •

⟩)

...φ

(k)d (⟨

w(k)d , •

⟩)

typically: φ(x) =

σ(x) (sigmoid)I{±x > τ} (threshold)c (constant)x (linear)

Advanced Machine Learning 178 / 212

HIDDEN LAYERS AND NONLINEAR FUNCTIONS

Hidden units• Any nodes (or “units”) in the network that are neither input nor output nodes are called

hidden.• Every network has an input layer and an output layer.• If there any additional layers (which hence consist of hidden units), they are called hidden

layers.

Linear and nonlinear networks• If a network has no hidden units, then

fi(x) = φi(⟨

wi, x⟩)

That means: f is a linear functions, except perhaps for the final application of φ.• For example: In a classification problem, a two layer network can only represent linear

decision boundaries.• Networks with at least one hidden layer can represent nonlinear decision surfaces.

Advanced Machine Learning 179 / 212

TWO VS THREE LAYERS

10 CHAPTER 6. MULTILAYER NEURAL NETWORKS

While we can be confident that a complete set of functions, such as all polynomi-als, can represent any function it is nevertheless a fact that a single functional formalso suffices, so long as each component has appropriate variable parameters. In theabsence of information suggesting otherwise, we generally use a single functional formfor the transfer functions.

While these latter constructions show that any desired function can be imple-mented by a three-layer network, they are not particularly practical because for mostproblems we know ahead of time neither the number of hidden units required, northe proper weight values. Even if there were a constructive proof, it would be of littleuse in pattern recognition since we do not know the desired function anyway — itis related to the training patterns in a very complicated way. All in all, then, theseresults on the expressive power of networks give us confidence we are on the righttrack, but shed little practical light on the problems of designing and training neuralnetworks — their main benefit for pattern recognition (Fig. 6.3).

Two"layer

Three"layer

x1 x2

x1

x2

...

x1 x2

fl

R1

R2

R1

R2

R2

R1

x2

x1

Figure 6.3: Whereas a two-layer network classifier can only implement a linear decisionboundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex, nor simply connected.

6.3 Backpropagation algorithm

We have just seen that any function from input to output can be implemented as athree-layer neural network. We now turn to the crucial problem of setting the weightsbased on training patterns and desired output.

10 CHAPTER 6. MULTILAYER NEURAL NETWORKS

While we can be confident that a complete set of functions, such as all polynomi-als, can represent any function it is nevertheless a fact that a single functional formalso suffices, so long as each component has appropriate variable parameters. In theabsence of information suggesting otherwise, we generally use a single functional formfor the transfer functions.

While these latter constructions show that any desired function can be imple-mented by a three-layer network, they are not particularly practical because for mostproblems we know ahead of time neither the number of hidden units required, northe proper weight values. Even if there were a constructive proof, it would be of littleuse in pattern recognition since we do not know the desired function anyway — itis related to the training patterns in a very complicated way. All in all, then, theseresults on the expressive power of networks give us confidence we are on the righttrack, but shed little practical light on the problems of designing and training neuralnetworks — their main benefit for pattern recognition (Fig. 6.3).

Two"layer

Three"layer

x1 x2

x1

x2

...

x1 x2

fl

R1

R2

R1

R2

R2

R1

x2

x1

Figure 6.3: Whereas a two-layer network classifier can only implement a linear decisionboundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex, nor simply connected.

6.3 Backpropagation algorithm

We have just seen that any function from input to output can be implemented as athree-layer neural network. We now turn to the crucial problem of setting the weightsbased on training patterns and desired output.

10 CHAPTER 6. MULTILAYER NEURAL NETWORKS

While we can be confident that a complete set of functions, such as all polynomi-als, can represent any function it is nevertheless a fact that a single functional formalso suffices, so long as each component has appropriate variable parameters. In theabsence of information suggesting otherwise, we generally use a single functional formfor the transfer functions.

While these latter constructions show that any desired function can be imple-mented by a three-layer network, they are not particularly practical because for mostproblems we know ahead of time neither the number of hidden units required, northe proper weight values. Even if there were a constructive proof, it would be of littleuse in pattern recognition since we do not know the desired function anyway — itis related to the training patterns in a very complicated way. All in all, then, theseresults on the expressive power of networks give us confidence we are on the righttrack, but shed little practical light on the problems of designing and training neuralnetworks — their main benefit for pattern recognition (Fig. 6.3).

Two"layer

Three"layer

x1 x2

x1

x2

...

x1 x2

fl

R1

R2

R1

R2

R2

R1

x2

x1

Figure 6.3: Whereas a two-layer network classifier can only implement a linear decisionboundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex, nor simply connected.

6.3 Backpropagation algorithm

We have just seen that any function from input to output can be implemented as athree-layer neural network. We now turn to the crucial problem of setting the weightsbased on training patterns and desired output.

10 CHAPTER 6. MULTILAYER NEURAL NETWORKS

While we can be confident that a complete set of functions, such as all polynomi-als, can represent any function it is nevertheless a fact that a single functional formalso suffices, so long as each component has appropriate variable parameters. In theabsence of information suggesting otherwise, we generally use a single functional formfor the transfer functions.

While these latter constructions show that any desired function can be imple-mented by a three-layer network, they are not particularly practical because for mostproblems we know ahead of time neither the number of hidden units required, northe proper weight values. Even if there were a constructive proof, it would be of littleuse in pattern recognition since we do not know the desired function anyway — itis related to the training patterns in a very complicated way. All in all, then, theseresults on the expressive power of networks give us confidence we are on the righttrack, but shed little practical light on the problems of designing and training neuralnetworks — their main benefit for pattern recognition (Fig. 6.3).

Two"layer

Three"layer

x1 x2

x1

x2

...

x1 x2

fl

R1

R2

R1

R2

R2

R1

x2

x1

Figure 6.3: Whereas a two-layer network classifier can only implement a linear decisionboundary, given an adequate number of hidden units, three-, four- and higher-layernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex, nor simply connected.

6.3 Backpropagation algorithm

We have just seen that any function from input to output can be implemented as athree-layer neural network. We now turn to the crucial problem of setting the weightsbased on training patterns and desired output.

Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001Advanced Machine Learning 180 / 212

THE XOR PROBLEM

6.2. FEEDFORWARD OPERATION AND CLASSIFICATION 7

biashidden j

output k

input i

111 1

.5

-1.5

.7-.4-1

x1 x2

x1

x2

z=1

z= -1

z= -1

-1

0

1-1

0

1

-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1

R2

R2

R1

y1 y2

z

zk

wkj

wji

Figure 6.1: The two-bit parity or exclusive-OR problem can be solved by a three-layernetwork. At the bottom is the two-dimensional feature space x1 − x2, and the fourpatterns to be classified. The three-layer network is shown in the middle. The inputunits are linear and merely distribute their (feature) values through multiplicativeweights to the hidden units. The hidden and output units here are linear thresholdunits, each of which forms the linear sum of its inputs times their associated weight,and emits a +1 if this sum is greater than or equal to 0, and −1 otherwise, as shownby the graphs. Positive (“excitatory”) weights are denoted by solid lines, negative(“inhibitory”) weights by dashed lines; the weight magnitude is indicated by therelative thickness, and is labeled. The single output unit sums the weighted signalsfrom the hidden units (and bias) and emits a +1 if that sum is greater than or equalto 0 and a -1 otherwise. Within each unit we show a graph of its input-output ortransfer function — f(net) vs. net. This function is linear for the input units, aconstant for the bias, and a step or sign function elsewhere. We say that this networkhas a 2-2-1 fully connected topology, describing the number of units (other than thebias) in successive layers.

6.2. FEEDFORWARD OPERATION AND CLASSIFICATION 7

biashidden j

output k

input i

111 1

.5

-1.5

.7-.4-1

x1 x2

x1

x2

z=1

z= -1

z= -1

-1

0

1-1

0

1

-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1

-1

0

1-1

0

1

-1

0

1

-1

0

1

R2

R2

R1

y1 y2

z

zk

wkj

wji

Figure 6.1: The two-bit parity or exclusive-OR problem can be solved by a three-layernetwork. At the bottom is the two-dimensional feature space x1 − x2, and the fourpatterns to be classified. The three-layer network is shown in the middle. The inputunits are linear and merely distribute their (feature) values through multiplicativeweights to the hidden units. The hidden and output units here are linear thresholdunits, each of which forms the linear sum of its inputs times their associated weight,and emits a +1 if this sum is greater than or equal to 0, and −1 otherwise, as shownby the graphs. Positive (“excitatory”) weights are denoted by solid lines, negative(“inhibitory”) weights by dashed lines; the weight magnitude is indicated by therelative thickness, and is labeled. The single output unit sums the weighted signalsfrom the hidden units (and bias) and emits a +1 if that sum is greater than or equalto 0 and a -1 otherwise. Within each unit we show a graph of its input-output ortransfer function — f(net) vs. net. This function is linear for the input units, aconstant for the bias, and a step or sign function elsewhere. We say that this networkhas a 2-2-1 fully connected topology, describing the number of units (other than thebias) in successive layers.

Solution regions we would like to represent Neural network representation

• Two ridges at different locations are substracted from each other.• That generates a region bounded on both sides.• A linear classifier cannot represent this decision region.• Note this requires at least one hidden layer.

Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001Advanced Machine Learning 181 / 212

6.2. FEEDFORWARD OPERATION AND CLASSIFICATION 9

input feature xi. Each hidden unit emits a nonlinear function Ξ of its total input; theoutput unit merely emits the sum of the contributions of the hidden units.

Unfortunately, the relationship of Kolmogorov’s theorem to practical neural net-works is a bit tenuous, for several reasons. In particular, the functions Ξj and ψij

are not the simple weighted sums passed through nonlinearities favored in neural net-works. In fact those functions can be extremely complex; they are not smooth, andindeed for subtle mathematical reasons they cannot be smooth. As we shall soonsee, smoothness is important for gradient descent learning. Most importantly, Kol-mogorov’s Theorem tells us very little about how to find the nonlinear functions basedon data — the central problem in network based pattern recognition.

A more intuitive proof of the universal expressive power of three-layer nets is in-spired by Fourier’s Theorem that any continuous function g(x) can be approximatedarbitrarily closely by a (possibly infinite) sum of harmonic functions (Problem 2). Onecan imagine networks whose hidden units implement such harmonic functions. Properhidden-to-output weights related to the coefficients in a Fourier synthesis would thenenable the full network to implement the desired function. Informally speaking, weneed not build up harmonic functions for Fourier-like synthesis of a desired function.Instead a sufficiently large number of “bumps” at different input locations, of differentamplitude and sign, can be put together to give our desired function. Such localizedbumps might be implemented in a number of ways, for instance by sigmoidal transferfunctions grouped appropriately (Fig. 6.2). The Fourier analogy and bump construc-tions are conceptual tools, they do not explain the way networks in fact function. Inshort, this is not how neural networks “work” — we never find that through train-ing (Sect. 6.3) simple networks build a Fourier-like representation, or learn to groupsigmoids to get component bumps.

y1

y2

y4

y3

y3 y4y2y1

x1 x2

z1

z1

x1

x2

Figure 6.2: A 2-4-1 network (with bias) along with the response functions at differentunits; each hidden and output unit has sigmoidal transfer function f(·). In the caseshown, the hidden unit outputs are paired in opposition thereby producing a “bump”at the output unit. Given a sufficiently large number of hidden units, any continuousfunction from input to output can be approximated arbitrarily well by such a network.

Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001Advanced Machine Learning 182 / 212

FEATURE EXTRACTION

Features• Raw measurement data is typically not used directly as input for a learning algorithm.

Some form of preprocessing is applied first.• We can think of this preprocessing as a function, e.g.

F : raw data space −→ Rd

(Rd is only an example, but a very common one.)• If the raw measurements are m1, . . . ,mN , the data points which are fed into the learning

algorithm are the images xn := F(mn).

Terminology• F is called a feature map.• Its dimensions (the dimensions of its range space) are called features.• The preprocessing step (= application of F to the raw data) is called feature extraction.

Advanced Machine Learning 183 / 212

EXAMPLE PROCESSING PIPELINE

This is what a typical processingpipeline for a supervided learningpropblem might look like.

Raw data (measurements)

Feature extraction(preprocessing)

Working data

Mark patterns

Split

Training data(patterns marked)

Test data(patterns marked)

Training(calibration) Trained model

Apply ontest data

Error estimate

Advanced Machine Learning 184 / 212

FEATURE EXTRACTION VS LEARNING

Where does learning start?• It is often a matter of definition where feature extraction stops and learning starts.• If we have a perfect feature extractor, learning is trivial.• For example:

• Consider a classfication problem with two classes.• Suppose the feature extractor maps the raw data measurements of class 1 to a single

point, and all data points in class to to a single distinct point.• Then classification is trivial.• That is of course what the classifier is supposed to do in the end (e.g. map to the

points 0 and 1).

Multi-layer networks and feature extraction• An interesting aspect of multi-layer neural networks is that their early layers can be

intepreted as feature extraction.• For certain types of problems (e.g. computer vision), features were long “hand-tuned” by

humans.• Features extracted by neural networks give much better results.• Several important problems, such as object recognition and face recognition, have

basically been solved in this way.

Advanced Machine Learning 185 / 212

DEEP NETWORKS AS FEATURE EXTRACTORS

• The network on the right is a classifierf : Rd → {0, 1}.

• Suppose we subdivide the network intothe first K − 1 layer and the final layer, bydefining

F(x) := f (K−1) ◦ . . . ◦ f (1)(x)

• The entire network is then

f (x) = f (K) ◦ F(x)

• The function f (K) is a two-class logisticregression classifier.

• We can hence think of f as a featureextraction F followed by linearclassification f (K).

x1 x2

. . .

xd

. . .

. . .

......

...

. . .

= f (1)

= f (2)

= f (K)

f (K)( • ) = σ(⟨

w(K), •⟩)

Advanced Machine Learning 186 / 212

A SIMPLE EXAMPLE

6.6. BACKPROPAGATION, BAYES THEORY AND PROBABILITY 25

sample training patterns

learned input-to-hidden weights

Figure 6.14: The top images represent patterns from a large training set used to traina 64-2-3 sigmoidal network for classifying three characters. The bottom figures showthe input-to-hidden weights (represented as patterns) at the two hidden units aftertraining. Note that these learned weights indeed describe feature groupings useful forthe classification task. In large networks, such patterns of learned weights may bedifficult to interpret in this way.

6.6.1 Bayes discriminants and neural networks

As we saw in Chap. ?? Sect. ??, the LMS algorithm computed the approximation tothe Bayes discriminant function for two-layer nets. We now generalize this result intwo ways: to multiple categories and to nonlinear functions implemented by three-layer neural networks. We use the network of Fig. 6.4 and let gk(x; w) be the outputof the kth output unit — the discriminant function corresponding to category ωk.Recall first Bayes’ formula,

P (ωk|x) =P (x|ωk)P (ωk)c∑

i=1

P (x|ωi)P (ωi)=

P (x,ωk)

P (x), (22)

and the Bayes decision for any pattern x: choose the category ωk having the largestdiscriminant function gk(x) = P (ωk|x).

Suppose we train a network having c output units with a target signal accordingto:

tk(x) =

{1 if x ∈ ωk

0 otherwise.(23)

(In practice, teaching values of ±1 are to be preferred, as we shall see in Sect. 6.8; weuse the values 0–1 in this derivation for computational simplicity.) The contributionto the criterion function based on a single output unit k for finite number of trainingsamples x is:

J(w) =∑

x

[gk(x; w)− tk]2

(24)

. . .x1 x64

f1 f2 f3

h1 h2

w11w12 w64,1

w64,2

• Problem: Classify characters into threeclasses (E, F and L).

• Each digit given as a 8× 8 = 64 pixelimage

• Neural network: 64 input units (=pixels)• 2 hidden units• 3 binary output units, where fi(x) = 1

means image is in class i.• Each hidden unit has 64 input weights,

one per pixel. The weight values can beplottes as 8× 8 images.

Advanced Machine Learning 187 / 212

A SIMPLE EXAMPLE6.6. BACKPROPAGATION, BAYES THEORY AND PROBABILITY 25

sample training patterns

learned input-to-hidden weights

Figure 6.14: The top images represent patterns from a large training set used to traina 64-2-3 sigmoidal network for classifying three characters. The bottom figures showthe input-to-hidden weights (represented as patterns) at the two hidden units aftertraining. Note that these learned weights indeed describe feature groupings useful forthe classification task. In large networks, such patterns of learned weights may bedifficult to interpret in this way.

6.6.1 Bayes discriminants and neural networks

As we saw in Chap. ?? Sect. ??, the LMS algorithm computed the approximation tothe Bayes discriminant function for two-layer nets. We now generalize this result intwo ways: to multiple categories and to nonlinear functions implemented by three-layer neural networks. We use the network of Fig. 6.4 and let gk(x; w) be the outputof the kth output unit — the discriminant function corresponding to category ωk.Recall first Bayes’ formula,

P (ωk|x) =P (x|ωk)P (ωk)c∑

i=1

P (x|ωi)P (ωi)=

P (x,ωk)

P (x), (22)

and the Bayes decision for any pattern x: choose the category ωk having the largestdiscriminant function gk(x) = P (ωk|x).

Suppose we train a network having c output units with a target signal accordingto:

tk(x) =

{1 if x ∈ ωk

0 otherwise.(23)

(In practice, teaching values of ±1 are to be preferred, as we shall see in Sect. 6.8; weuse the values 0–1 in this derivation for computational simplicity.) The contributionto the criterion function based on a single output unit k for finite number of trainingsamples x is:

J(w) =∑

x

[gk(x; w)− tk]2

(24)

6.6. BACKPROPAGATION, BAYES THEORY AND PROBABILITY 25

sample training patterns

learned input-to-hidden weights

Figure 6.14: The top images represent patterns from a large training set used to traina 64-2-3 sigmoidal network for classifying three characters. The bottom figures showthe input-to-hidden weights (represented as patterns) at the two hidden units aftertraining. Note that these learned weights indeed describe feature groupings useful forthe classification task. In large networks, such patterns of learned weights may bedifficult to interpret in this way.

6.6.1 Bayes discriminants and neural networks

As we saw in Chap. ?? Sect. ??, the LMS algorithm computed the approximation tothe Bayes discriminant function for two-layer nets. We now generalize this result intwo ways: to multiple categories and to nonlinear functions implemented by three-layer neural networks. We use the network of Fig. 6.4 and let gk(x; w) be the outputof the kth output unit — the discriminant function corresponding to category ωk.Recall first Bayes’ formula,

P (ωk|x) =P (x|ωk)P (ωk)c∑

i=1

P (x|ωi)P (ωi)=

P (x,ωk)

P (x), (22)

and the Bayes decision for any pattern x: choose the category ωk having the largestdiscriminant function gk(x) = P (ωk|x).

Suppose we train a network having c output units with a target signal accordingto:

tk(x) =

{1 if x ∈ ωk

0 otherwise.(23)

(In practice, teaching values of ±1 are to be preferred, as we shall see in Sect. 6.8; weuse the values 0–1 in this derivation for computational simplicity.) The contributionto the criterion function based on a single output unit k for finite number of trainingsamples x is:

J(w) =∑

x

[gk(x; w)− tk]2

(24)

training data (with random noise) weight values of h1 and h2 plotted as images

h1 h2

• Dark regions = large weight values.• Note the weights emphasize regions that distinguish characters.• We can think of weight (= each pixel) as a feature.• The features with large weights for h1 distinguish {E,F} from L.• The features for h2 distinguish {E,L} from F.

Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley 2001Advanced Machine Learning 188 / 212

WIDTH VS DEPTH

A neural network represents a (typically) complicated function f by simple functions φ(k)i .

What functions can be represented?A well-known result in approximation theory says: Every continuous function f : [0, 1]d → Rcan be represented in the form

f (x) =

2d+1∑j=1

ξj

( d∑i=1

τij(xi))

where ξi and τij are functions R→ R. A similar result shows one can approximate f toarbitrary precision using specifically sigmoids, as

f (x) ≈M∑

j=1

w(2)j σ

( d∑i=1

w(1)ij xi + ci

)for some finite M and constants ci.Note the representations above can both be written as neural networks with three layers (i.e.with one hidden layer).

Advanced Machine Learning 189 / 212

WIDTH VS DEPTH

Depth rather than width• The representations above can achieve arbitrary precision with a single hidden layer

(roughly: a three-layer neural network can represent any continuous function).• In the first representation, ξj and τij are “simpler” than f because they map R→ R.• In the second representation, the functions are more specific (sigmoids), and we typically

need more of them (M is large).• That means: The price of precision are many hidden units, i.e. the network grows wide.• The last years have shown: We can obtain very good results by limiting layer width, and

instead increasing depth (= number of layers).• There is no theory to properly explain this yet.

Limiting width• Limiting layer width means we limit the degrees of freedom of each function f (k).• That is a notion of parsimony.• Again: There seem to be a lot of interesting questions to study here, but so far, we have no

answers.

Advanced Machine Learning 190 / 212

EXAMPLE: AUTOENCODERS

An example for the effect of layer are autoencoders.• An autoencoder is a neural network that is trained on its own input: If the network has

weights W and represents a function fW, training solves the optimization problem

minW‖x− fW(x)‖2

or something similar for a different norm.• That seems pointless at first glance: The network tries to approximate the identity

function using its (possibly nonlinear) component functions.• However: If the layers in the middle have much fewer nodes that those at the top and

bottom, the network learns to compress the input.

Advanced Machine Learning 191 / 212

AUTOENCODERS

x

f (1)

f (2)

f (3)

f (x) ≈ x

Layers have same width: No effect

x

f (1)

f (2)

f (3)

f (x) ≈ x

Narrow middle layers: Compression effect

• Train network on many images.• Once trained: Input an image x.

• Store x′ := f (2)(x). Note x′ has fewer dimensions than x→ compression.

• To decompress x′: Input it into f (3) and apply the remaining layers of the network→ reconstruction f (x) ≈ x of x.

Advanced Machine Learning 192 / 212

AUTOENCODERS

W

W

W +�

W

W

W

W

W +�

W +�

W +�

W

W +�

W +�

W +�

+�

W

W

W

W

W

W

1

2000

RBM

2

2000

1000

500

500

1000

1000

500

1 1

2000

2000

500500

1000

1000

2000

500

2000

T

4T

RBM

Pretraining Unrolling

1000 RBM

3

4

30

30

Fine�tuning

4 4

2 2

3 3

4T

5

3T

6

2T

7

1T

8

Encoder

1

2

3

30

4

3

2T

1T

Code layer

Decoder

RBMTop

Illustration: K. Murphy, Machine Learning: A Bayesian perspective, MIT Press 2012Advanced Machine Learning 193 / 212

TRAINING NEURAL NETWORKS

Error measure• We assume each training data point x comes with a label y.• We specify an error measure D that compares y to a prediction.

Typical error measures• Classification problem:

D(y, y) := y log y

• Regression problem:D(y, y) := ‖y− y‖2

Training as an optimization problem• Given: Training data (x1, y1), . . . , (xN , yN) with labels yi.• We specify an error measure D that compares y to a prediction.

J(w) :=

N∑i=1

D(fw(xi), yi)

Advanced Machine Learning 194 / 212

BACKPROPAGATION

Neural network training optimization problem

minw

J(w)

The application of gradient descent to this problem is called backpropagation.

Backpropagation is gradient descent applied to J(w) in a feed-forward network.

Deriving backpropagation• We have to evaluate the derivative∇wJ(w).• Since J is additive over training points, J(w) =

∑n Jn(w), it suffices to derive∇wJn(w).

Advanced Machine Learning 195 / 212

Recall from calculus: Chain ruleConsider a composition of functions f ◦ g(x) = f (g(x)).

d(f ◦ g)

dx=

dfdg

dgdx

If the derivatives of f and g are f ′ and g′, that means: d(f◦g)dx (x) = f ′(g(x))g′(x)

Neural networkLet w(k) denote the weights in layer k. The function represented by the network is

fw(x) = f (K)w ◦ · · · ◦ f (1)

w (x) = f (K)

w(K) ◦ · · · ◦ f (1)w(1) (x)

To solve the optimization problem, we have to compute derivatives of the form

ddw

D(fw(xn), yn) =dD( • , yn)

dfw

dfwdw

Advanced Machine Learning 196 / 212

CONSIDERING THE DERIVATIVES

• We will compute the derivates layer by layer.• Suppose we are only interested in the weights of layer k, and keep all other weights fixed.

The function f represented by the network is then

fw(k) (x) = f (K) ◦ · · · ◦ f (k+1) ◦ f (k)w(k) ◦ f (k−1) ◦ · · · ◦ f (1)(x)

• The first k − 1 layers enter only as the function value of x, so we define

z(k) := f (k−1) ◦ · · · ◦ f (1)(x)

and getfw(k) (x) = f (K) ◦ · · · ◦ f (k+1) ◦ f (k)

w(k) (z(k))

• If we differentiate with respect to w(k), the chain rule gives

ddw(k)

fw(k) (x) =df (K)

df (K−1)· · ·

df (k+1)

df (k)·

df (k)w(k)

dw(k)

Not examinable.Advanced Machine Learning 197 / 212

WITHIN A SINGLE LAYER

• Each f (k) is a vector-valued function f (k) : Rdk → Rdk+1 .• It is parametrized by the weights w(k) of the kth layer and takes an input vector z ∈ Rdk .

• We write f (k)(z,w(k)).

Layer-wise derivativeSince f (k) and f (k−1) are vector-valued, we get a Jacobian matrix

df (k+1)

df (k)=

∂f (k+1)1

∂f (k)1

. . .∂f (k+1)

1

∂f (k)dk

......

∂f (k+1)dk+1

∂f (k)1

. . .∂f (k+1)

dk+1

∂f (k)dk

=: ∆(k)(z,w(k+1))

• ∆(k) is a matrix of size dk+1 × dk .

• The derivatives in the matrix quantify how f (k+1) reacts to changes in the argument off (k) if the weights w(k+1) and w(k) of both functions are fixed.

Not examinable.Advanced Machine Learning 198 / 212

BACKPROPAGATION ALGORITHM

Let w(1), . . . ,w(K) be the current settings of the layer weights. These have either beencomputed in the previous iteration, or (in the first iteration) are initialized at random.

Step 1: Forward passWe start with an input vector x and compute

z(k) := f (k) ◦ · · · ◦ f (1)(x)

for all layers k.

Step 2: Backward pass• Start with the last layer. Update the weights w(K) by performing a gradient step on

D(

f (K)(z(K),w(K)), y)

regarded as a function of w(K) (so z(K) and y are fixed). Denote the updated weights w(K).• Move backwards one layer at a time. At layer k, we have already computed updates

w(K), . . . , w(k+1). Update w(k) by a gradient step, where the derivative is computed as

∆(K−1)(z(K−1), w(K)) · . . . ·∆(k)(z(k), w(k+1))df (k)

dw(k)(z,w(k))

On reaching level 1, go back to step 1 and recompute the z(k) using the updated weights.

Not examinable.Advanced Machine Learning 199 / 212

SUMMARY: BACKPROPAGATION

• Backpropagation is a gradient descent method for the optimization problem

minw

J(w) =N∑

i=1

D(fw(xi), yi)

D must be chosen such that it is additive over data points.

• It alternates between forward passes that update the layer-wise function values z(k) giventhe current weights, and backward passes that update the weights using the current z(k).

• The layered architecture means we can (1) compute each z(k) from z(k−1) and (2) we canuse the weight updates computed in layers K, . . . , k + 1 to update weights in layer k.

Advanced Machine Learning 200 / 212

MACHINE ARITHMETIC ANDPSEUDORANDOM NUMBERS

BINARY REPRESENTATION OF NUMBERS

Binary numbersOur usual (decimal) representation of integers represents a number x ∈ N0 by digitsd1, d2, . . . ∈ {0, . . . , 9} as

[x]10 = di · 10i + di−1 · 10i−1 + . . .+ d1 · 100

where i is the largest integer with 10i ≤ x.The binary representation of x is similarly

[x]2 = 2j · 10j + bj−1 · 2j−1 + . . .+ b1 · 20

where b1, b2, . . . ∈ {0, 1} and j is the largest integer with 2j ≤ x.

Non-integer numbersBinary numbers can have fractional digits just as decimal numbers. The post-radix digitscorrespond to inverse powers of two:

[10.125]10 = [101.001]2 = 1 · 23 + 0 · 22 + 1 · 2 + 0 ·12

+ 0 ·122

+ 1 ·123

Numbers that look “simple” may become more complicated in binary representation:

[0.1]10 = [0.00011]2 = [0.000110011001100110011 . . .]2

Not examinable.Advanced Machine Learning 202 / 212

FLOATING POINT ARITHMETIC

Basic FP representation

x = (−1)s · 2e · m

where:• s, the sign, is a single bit, s ∈ {0, 1}.• e, the exponent, is a (binary) non-negative integer e ∈ {−(2n−1 + 1), . . . , 2n−1}• m, the mantissa, is a (binary) integer m ∈ {0, . . . , 2k − 1}.

There are 2n possible values for e, and 2k possible values for m.

The IEEE 754 standard• The standard floating point system is specified by an industry standard called IEEE 754.• This standard is implemented by all (well-written) modern software.• The standard was first published in 1985. There were no proper, widely implemented

rules for numerical computation before then.

Not examinable.Advanced Machine Learning 203 / 212

SINGLE AND DOUBLE PRECISION

Single precision numbers (in IEEE 754)

s e1 · · · en m1 · · · mk0 1 8 9 31

• A single precision number is represented by 32 bits.• It is customary to enumerate starting with 0, as 0, . . . , 31.• 8 bits are invested in the exponent, 1 bit is needed for the sign; the remainder form the

mantissa.

Double precision numbers (in IEEE 754)

s e1 · · · en m1 · · · mk0 1 11 12 63

• A single precision number is represented by 64 bits.• 1 bit for the sign, 11 for the exponent, 52 for the mantissa.• This is the FP number format used in most numerical computations.

Not examinable.Advanced Machine Learning 204 / 212

ROUNDING ERRORS

Rounding• Choose a set of FP numbers M, (say double precision numbers).• Start with a number x ∈ R. Unless x happens to be in the finite set M, we cannot represent

it. It gets rounded to the closest FP number,

x := arg miny∈M|x− y|

• Suppose we multiply to machine numbers x, y ∈ M. The product z = x · y has twice asmany digtits, and is typically not in M. It gets rounded to z.

Rounding errorsWe distinguish two types of errors:• The absolute rounding error z− z.

• The relative rounding error z−zz .

FP arithmetic is based on the premise that the relative rounding error is more important.

Not examinable.Advanced Machine Learning 205 / 212

ROUNDING ERRORS

Distribution of FP numbersThe distance between two neighboring FP numbers depends on the value of the exponent. Thatmeans:

An interval of length 1 (or any fixed length) contains more FP numbers if it is close to 0than if it is far away from 0.

In other words, the relative rounding error does not depend on the size of a number, but theabsolute rounding error is larger for large numbers.

Normalized FP numberFP numbers are not unique: Note that

2 = (−1)0 · 21 · 1 = (−1)0 · 20 · 2 = . . .

A machine number is normalized if the mantissa represents a number with exactly one digitbefore the radix, and this digit is 1:

(−1)s · 2e · [1.m1 . . .mk]2

IEEE 754 only permits normalized numbers. That means every number is unique.

Not examinable.Advanced Machine Learning 206 / 212

MACHINE PRECISION

Error guaranteesIEEE 754 guarantees that

x⊕y = (x⊕ y)(1 + r) satisfies |r| < eps

where• r is the relative rounding error.• eps is machine precision.• ⊕ is one of the operations +, −, × or /, and ⊕ its machine version.

Machine precisionThe machine precision, denoted eps, is the smallest machine number for which 1+eps isdistinguishable from 1:

eps = min1+e>1

{e ∈ M}

This is not the same as the smallest representable number. In double precision:• Machine precision is eps = 2−52

• The smallest (normalized) FP number is 2−1022.That is so because the intervals between FP numbers around 1 are much larger than around2−1022: If we evaluate 1 + 2−1022, the result is 1.

Not examinable.Advanced Machine Learning 207 / 212

CANCELLATION ERRORS

Cancellation• Suppose two machine numbers x = 1.2345e0 and y = 1.2346e0 are obtained by some

numerical computation.• That means they have already been rounded, perhaps several times.• Previous rounding means the smallest digits in the mantissa carry errors.• If we substract the numbers, x− y = −0.0001e0, only the those small digits remain.• Effectively, we have deleted the reliable information and only kept the errors.

The example illustrates a fundamental problem: The FP principle (keep the relative error undercontrol) does not work well with substraction.

Example: Linear equations• One application where cancellation errors are very problematic are linear equation

systems.• If a matrix is inverted by Gaussian elimination, a lot of terms have to be scaled and

substracted.• If the input matrices are not well-conditioned, Gaussian elimination can produce results

that consist only of “noise”.• The Householder and Givens algorithms that most computer packages used have been

developed to avoid this problem.

Not examinable.Advanced Machine Learning 208 / 212

RANDOM NUMBERS

Pseudorandom numbers• All implementations of sampling algorithms, random variables, randomized algorithms,

etc on a computer require the generation of random numbers.• Computers are deterministic machines and have no access to “real” randomness.• The “random” numbers we use on computers are not random; rather, they are generated in

a way that makes the look like random numbers, in the sense that they do not contain anyobvious deterministic “patterns”. Such numbers are called pseudorandom numbers.

Reproducibility• Scientific experiments are required to be reproducible.• Using actual random numbers in a computer experiment or simulation would make it

non-reproducible.• In this sense, pseudorandomness is an advantage rather than disadvantage.

The so-called random numbers generated by a computer are not random.

If you restart the generation process of a pseudorandom sequence with the same initialconditions, you reproduce exactly the same sequence.

Not examinable.Advanced Machine Learning 209 / 212

RANDOM NUMBER GENERATION

Recursive generationMost PRN generators use a form of recursion: A sequence x1, x2, . . . of PRNs is generated as

xn+1 = f (xn)

Modulo operationThe mod operation outputs the “remainder after division”:

m mod n = m− kn for largest k ∈ N0 such that kn < m

For example: 13 mod 5 = 3

Linear Congruence GeneratorAn example are generators of the form

xn+1 = C · xn mod D for fixed constants C,D ∈ N .

On a 32-bit machine, a simple generator would be

xn+1 = 16807 · xn mod (232 − 1)

Not examinable.Advanced Machine Learning 210 / 212

PERIODICITY

Period length• For a recursive generator xn+1 = f (xn), each xn+1 depends only on xn.• Once xn takes the same value as x1, the sequence (x1, . . . , xn−1) repeats:

(x1, . . . , xn−1) = (xn, . . . , x2n−1) = . . .

• If so, n− 1 is called the period length.• Note a period need not start at 1: Once a value reoccurs for the first time, the generator has

become periodic.• Since there are finitely many possible numbers, that must happen sooner or later.

Mersenne Twisters• Almost all software tools we use (Python, Matlab etc) use a generator called a Mersenne

twister.• This algorithm also uses a recursion, but the recursion is matrix-valued. That makes the

algorithm a little more complicated.• Mersenne twisters have very long period lengths (e.g. 219937 − 1 for the standard

implementation for 32-bit machines).

Not examinable.Advanced Machine Learning 211 / 212

RANDOM SEEDS

Random seed of a generator• A recursive generator xn+1 = f (xn) can be started at any intial value x0 (so the first

generated number is x1 = f (x0).• This initial value x0 is called the random seed.

In practice• Every time a generator is started or reset (e.g. when you start Matlab), it starts with a

default seed.• For example, Matlab’s default generator is a Mersenne twister with seed 0.• The user can usually pass a different seed to the generator.• When you run experiments or simulations, always re-run them several times with different

seeds.• To ensure reproducibility, record the seeds you used along with the code.

Not examinable.Advanced Machine Learning 212 / 212