Advanced Probabilistic Machine LearningLecture 1 – Introduction, probabilistic modeling

Niklas Wahlström
Division of Systems and Control
Department of Information Technology
Uppsala University


What is the course about?

Previous course - Statistical machine learning

What was that course about? Supervised machine learning

Learning a model from labeled data.

Labels e.g. mat,mirror, boat, . . .

Training data



Predicting output of newdata based on this model.

Unseen data


We learned multiple methods for finding such models:• Linear/logistic regression, discriminant analysis, trees, k-NN,

neural networks, ensemble methods , ......, and strategies for improving them (cross-validation)

What is this course about? (I/II)

This course extends the SML course in two aspects:1. Probabilistic machine learning We will have a probabilistic

(aka. Bayesian) viewpoint on machine learning problems2. Beyond supervised machine learning We will consider other

ML problems than just supervised ML

1. Probabilistic machine learning

Probabilistic? You talked about noise, random variables and stuff al-ready in the SML course!?

• Previously we treated the output data y as random variables.• We now treat the model itself as a random variable.• Advantage: Probabilistic models express the uncertainty of

What is this course about? (I/II)

2. Beyond supervised machine learning

We consider problems where we for example want to ...• ... rank objects based on data (miniproject)• ... generate more data similar to the training data• ... compress or summarize the data

We will also learn about universal models/methods that are useful inprobabilistic machine learning, but also elsewhere• Graphical models• Monte Carlo methods• Variational inference

In this sense this course is both broader, deeper and more researchoriented than the SML course.

Example - Building magnetic field maps

From my own research: Build a map of the indoor magnetic fieldusing Gaussian processes.


More about Gaussian processes in lecture 7 and 8.

[1] Niklas Wahlström, Manon Kok, Thomas B. Schön and Fredrik Gustafsson, Modeling magnetic fields using Gaussian processes The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, 2013.
[2] Arno Solin, Manon Kok, Niklas Wahlström, Thomas B. Schön and Simo Särkkä, Modeling and interpolation of the ambient magnetic field by Gaussian processes ArXiv e-prints, September 2015. arXiv:1509.04634.

Example – indoor localization

MSc thesis project: Compute the position of a person moving aroundindoors using sensors (inertial, magnetometer and radio) and a map.

Show movie

Johan Kihlberg, Simon Tegelid, Manon Kok and Thomas B. Schön. Map aided indoor positioning using particle filters. Reglermöte (Swedish Control Conference), Linköping, Sweden, June 2014.

Example - Probabilistic ranking

Aim: Estimate skill of chess players throughout the history

Pierre Dangauthier, Ralf Herbrich, Tom Minka, Thore Graepel. TrueSkill Through Time: Revisiting the History of Chess. NIPS, 2007.

NIPS, 2007.

You will work with this ranking model in the mini-project (but notnecessarily applied to chess)

Course information

Course elements

• 11 lectures• 10 problem solving sessions• 1 mini project (3-4 students, written report)• 1 computer lab (4h, no report)• Complete course information (including lecture slides) is available

from the course home page:


Teachers involved in the course (in approximate order of appearance):

NiklasWahlströmRoom: 2319

AndreasLindholmRoom: 2340

RiccardoRisuleoRoom: 2237

ThomasSchönRoom: 2209

All room numbers are at ITC Polacksbacken.

Lecture outline

1. Introduction, probabilistic modeling2. Bayesian linear regression3. Bayesian graphical models4. Monte Carlo methods5. Factor graphs6. Variational inference7. Gaussian processes I8. Gaussian processes II9. Unsupervised learning10. Variational autoencoders11. Summary and guest lecture by James Hensman

Problem solving sessions

10 problem solving sessions:• Solve problems, discuss and ask questions! (”räknestuga”)• 5 pen-and-paper sessions• 5 computer-based sessions (using Python)• Feel free to use your own laptops – Python is freely available• Exercises available via homepage or the student portal

The computer-based sessions are scheduled in 1 computer room + 1normal class room. The latter is intended for students who choose towork on their own laptops.

A great opportunity to discuss and ask questions!

Examination (I/II)

Mini project:

• Solved in groups of 3 or 4 students (no later than September 10)• Written report (deadline: October 3)• Peer-review: read and review another group’s report

(anonymously)• Material most relevant for the mini project presented at lectures

3–6, but you can start working on the solution after lecture 2• Graded U/G


• 4 h computer laboration, solved in groups of 2 students, gradedU/G• 4 sessions available – sign up for one of these• Solve the preparatory exercises before the lab session!

Oral Examination (II/II)

Instead of a written exam, we have an oral examination in the end ofthe course

• The exam is individual• 25 minutes discussion with teacher(s) about the course.• You start with a 7 minutes presentation about the course.• After the presentation the teacher(s) will lead the discussion.• The exam will be graded as U, 3, 4, or 5.• Time-slots for the oral exam will be in week 43 and 44.

For more information about the oral exam, see the course homepage.

Course literature

We recommend two books• Barber, D., Bayesian Reasoning and Machine Learning, 2012

• Christopher M. Bishop. Pattern Recognition and MachineLearning, Springer, 2006.

Both of them are freely available online, linked from the coursehomepage.

For some lecuture(s) we will use additional resources which will beavailable from the homepage.

Probability fundamentals

Medical inference (Hamburgers) I/III

Ex 1.2 (Barber)• 90% of people with Kreuzfeld-Jacob disease ate hamburgers.• One in 100,000 has this disease• Assume half of the population eat hamburgers

What is the probability that a hamburger eater will haveKreuzfeld-Jacob disease?Define the following events

KJ = Having Kreuzfeld-Jacob diseaseH = Eating hamburger

We know that

p(H = Yes|KJ = Yes) = 90%

p(KJ = Yes) = 0.001%

p(H = Yes) = 50%

Q: What is p(KJ = Yes|H = Yes) ?18 / 36 niklas.wahlstrom@it.uu.se Introduction

Medical inference (Hamburgers) II/III

Consider a population of 1 000 000.• p(KJ = Yes) = 0.001% has KJ disease, i.e. 10 people.

• p(H = Yes|KJ = Yes) = 90%, .i.e nine of them eat hamburgers• One of them doesn’t

• p(H = Yes) = 50% eat hamburgers, i.e. 500 000 people.• 499’991 of them does not have KJ disease.

H = Yes H = No

KJ = Yes 9 1KJ = No 499 991 499 999

19 / 36 niklas.wahlstrom@it.uu.se Introduction

Medical inference (Hamburgers) III/III

H = Yes H = No

KJ = Yes 9 1KJ = No 499 991 499 999

p(KJ = Yes|H = Yes) is the proportion of all hamburger eater howhave KJ disease.

p(KJ = Yes|H = Yes) =9

9 + 499 991= 0.0018%

This can also be written as

p(KJ = Y|H = Y) =p(KJ = Y, H = Y)

p(KJ = Y, H = Y) + p(KJ = N, H = Y)

=p(KJ = Y, H = Y)

p(H = Y)

This is an example of conditional probability and marginalization20 / 36 niklas.wahlstrom@it.uu.se Introduction

Conditioning, marginalization (discrete)

Conditional probability is defined as

p(x|y) =p(x, y)

p(y)where p(y) 6= 0

Marginalization is defined as

p(x) =∑y

p(x, y)

Much of the probability theory can be derived from these two rules.

Bayes’ theorem is derived by using the def. of conditional probabilitytwice

p(x|y)p(y) = p(x, y) = p(y|x)p(x) ⇒ p(x|y) =p(y|x)p(x)


Medical inference (Hamburgers), revisited

Consider again the hamburger/Kreuzfeld-Jacob disease problem

KJ = Having Kreuzfeld-Jacob diseaseH = Eating hamburger

We know that

p(H = Yes|KJ = Yes) = 90%

p(KJ = Yes) = 0.001%

p(H = Yes) = 50%

By applying Bayes’ theorem we get

p(KJ = Y|H = Y) =p(H = Y|KJ = Y)p(KJ = Y)

p(H = Y)=

910 × 1

100 00012

= 1.8 · 10−5

= 0.0018%

Continuous random variablesProbability distribution

The probability distribution p(x) describes the probability for acontinuous random variable falling into a given interval

p(a < x < b) =

∫ b


a b0




p(x) is also called the probability density

Continuous random variablesConditioning and marginalization

Consider the joint distribution p(x, y)





x y


Conditional probability Marginalizationp(x|y) = p(x,y)

p(y) , p(y) 6= 0 p(x) =∫y p(x, y)

Probabilistic/Bayesian inference

In this course most of the solutions to the problems can be stated as

p(θ|D) =p(D|θ)p(θ)p(D)

• D : observed data• θ : parameters of some model explaining the data• p(θ): prior belief of parameters before we collected any data• p(θ|D): posterior belief of parameters after inferring data• p(D|θ): likelihood of the data in view of the parameters• p(D): The marginal likelihood

Probabilistic/Bayesian inference

In this course most of the solutions to the problems can be stated as

p(θ|D) =p(D|θ)p(θ)p(D)

• If we view the quantities as functions of θ, we can write

p(θ|D)︸ ︷︷ ︸posterior

∝ p(D|θ)︸ ︷︷ ︸likelihood


∝ means: "proportional to with respect to the parameters θ".Hence, p(D) can be viewed as a normalization constant.

• Using marginalization, we can express p(D) in terms of thelikelihood and the prior

p(D) =

∫p(D, θ)dθ =


Example: Flipping of a coin

Consider a binary random variable x ∈ {0, 1} representing theoutcome of flipping of a coin• x = 1 represents "head" and x = 0 "tail".• The probability of x = 1 is denoted by the parameter µ

p(x = 1|µ) = µ, 0 ≤ µ ≤ 1

(assume damaged coin, so not necessary µ = 0.5)

Question:Given a dataset D = {x1, . . . , xN}, what is p(µ|D)?

Solution: Bayes’ theorem state that

p(µ|D)︸ ︷︷ ︸posterior

∝ p(D|µ)︸ ︷︷ ︸likelihood


Example: Flipping of a coinSolution - The likelihood (I/II)We know that

p(x = 1|µ) = µ, and consequentlyp(x = 0|µ) = 1− µ.

The distribution for one observation x can be written as

p(x|µ) = Bern (x; µ) = µx(1− µ)1−x

This is the Bernoulli distribution.

0 10







E[x] = µ

Var[x] = µ(1− µ)

Example: Flipping of a coinSolution - The likelihood (II/II)TheN observations are drawn independently. This gives the likelihood

p(D|µ) =


p(xn|µ) =


µxn(1− µ)1−xn

= µ∑N

n=1 xn(1− µ)N−∑N

n=1 xn = µm(1− µ)N−m

wherem =∑N

n=1 xn, i.e. the number of heads.

Note: The likelihood only depend on the data D viam.

The likelihood ofm is proportional to p(D|µ)

p(m|µ) = Bin (m; N, µ) =



)µm(1− µ)N−m,


)= N !

(N−m)!m! is the number of sequences givingm heads.This is the binomial distribution .

Binomial distribution

m ∼ Bin (m; N, µ) =



)µm(1− µ)N−m

0 1 2 3 4 5 6 7 8 9100





The binomial distribution forN = 10 and µ = 0.25.

E[m] = Nµ

Var[m] = Nµ(1− µ)

Example: Flipping of a coinSolution - The prior (I/II)

Remember Bayes’ theorem p(µ|m)︸ ︷︷ ︸posterior

∝ p(m|µ)︸ ︷︷ ︸likelihood


• Multiple possible prior distributions p(µ) exist.• We opt for a prior which has attractive analytical properties.

We choose a prior such that the posterior will be of the same functionalform as the prior. We call this a conjugate prior.

The conjugate prior of the Binomial distribution is the Beta distribution

Beta (µ; a, b) =Γ(a+ b)

Γ(a)Γ(b)µa−1(1− µ)b−1

where Γ(a) is the Gamma function.

Example: Flipping of a coinSolution - The posteriorThe posterior can now be computed

p(µ|m) ∝ p(m|µ)p(µ)

∝ Bin (m; N, µ)Beta (µ; a, b)

= µm(1− µ)N−mµa−1(1− µ)b−1

= µm+a−1(1− µ)N−m+b−1.

Hence, the posterior is also a Beta distribution

p(µ|m) = Beta (µ; a∗, b∗) ,


a∗ = m+ a,

b∗ = N −m+ b.

The Beta distribution

Beta (µ; a, b) =Γ(a+ b)

Γ(a)Γ(b)µa−1(1− µ)b−1,

where Γ(a) =

∫ ∞0


0 0.2 0.4 0.6 0.8 10





a = 1 b = 1a = 0.1 b = 0.1a = 2 b = 3

E[µ] =a

a+ b, Var[µ] =


Example: Flipping of a coinBayesian inference

Prior Likelihoodfunction Posterior

p(µ) p(m|µ) p(µ|m)Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 1, N = 1 a∗ = 2, b∗ = 1

0 0.5 10





0 0.5 10




0 0.5 10



µ• If you don’t know anything about the coin, start with an

uninformative prior. Beta (µ; 1, 1) = 1.• Assume we get one data point x1 = 1• Posterior ∝ likelihood × prior

Example: Flipping of a coinBayesian inference

Prior Likelihoodfunction Posterior

p(µ) p(m|µ) p(µ|m)Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 4, N = 5 a∗ = 5, b∗ = 2

0 0.5 10





0 0.5 10




0 0.5 10




Assume you get N = 5 data points, of whichm = 4 are heads,D = {1, 0, 1, 1, 1}.

Example: Flipping of a coinSequential Bayesian inference

• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior

Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)

Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 1, b = 1 m = 1 , N = 1 a∗ = 2, b∗ = 1

0 0.5 10



µ0 0.5 1




µ0 0.5 1





D1 = {1},

D2 = {0}, D3 = {1}, D4 = {1}, D5 = {1}

Example: Flipping of a coinSequential Bayesian inference

• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior

Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)

Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 2, b = 1 m = 0 , N = 1 a∗ = 2, b∗ = 2

0 0.5 10



µ0 0.5 1




µ0 0.5 1





D1 = {1}, D2 = {0},

D3 = {1}, D4 = {1}, D5 = {1}

Example: Flipping of a coinSequential Bayesian inference

• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior

Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)

Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 2, b = 2 m = 1 , N = 1 a∗ = 3, b∗ = 2

0 0.5 10



µ0 0.5 1




µ0 0.5 1





D1 = {1}, D2 = {0}, D3 = {1},

D4 = {1}, D5 = {1}

Example: Flipping of a coinSequential Bayesian inference

• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior

Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)

Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 3, b = 2 m = 1 , N = 1 a∗ = 4, b∗ = 2

0 0.5 10



µ0 0.5 1




µ0 0.5 1





D1 = {1}, D2 = {0}, D3 = {1}, D4 = {1},

D5 = {1}

Example: Flipping of a coinSequential Bayesian inference

• Bayesian inference admits for sequential inference.• After observing new data we use the posterior as the new prior

Prior Likelihood function Posteriorp(µ) p(m|µ) p(µ|m)

Beta (µ; a, b) Bin (m; N, µ) Beta (µ; a∗, b∗)a = 4, b = 2 m = 1 , N = 1 a∗ = 5, b∗ = 2

0 0.5 10



µ0 0.5 1




µ0 0.5 1





Concluding remarks

Probabilistic/Bayesian inference is a flexible way of dealing withmachine learning problemsProperties:• Treat not only the data, but also the model and its parameters

(if parametric) as random variables.• After learning, you not only get a single model, you get a

distribution of likely models.• You can encode prior knowledge you might have about the

model and its parameters.

A few concepts to summarize lecture 1

Probability distribution Function that describes the likelihood of obtaining thepossible values that a random variable can assume.Conditioning and marginalization Tow basic rules for manipulating probabilitydistributions.Bayes’ theorem p(x|y) = p(y|x)p(x)/p(y)Prior Belief of parameter before we have seen any dataLikelihood Belief of data in view of the parametersPosterior Belief of parameter after inferring dataBernoulli distribution Distribution for a binary random variableBinomial distribution Distribution for the sum of multiple binary random variables.Beta distribution Conjugate prior for the Binomial distributionConjugate prior A prior ensuring that the posterior and the prior belong to the sameprobability distribution family.

36 / 36 niklas.wahlstrom@it.uu.se Introduction