Post on 26-May-2020
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Bayesian Machine Learning - Lecture 1
Guido Sanguinetti
Institute for Adaptive and Neural ComputationSchool of Informatics
University of Edinburghgsanguin@inf.ed.ac.uk
February 23, 2015
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Welcome
Broad introduction to statistical machine learning conceptswithin the Bayesian probabilistic framework
Focus on using statistics as a modelling tool, and algorithmsfor efficient inference
Key objective: theoretical and practical familiarity with somefundamental ML methods
Structure: four two hours lectures and one two hours lab eachweek
Assessment: coinciding with the labs. NB: I believe PhDstudents have already demonstrated their ability to passexams.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Main refs
D. Barber, Bayesian Reasoning and Machine Learning, CUP2010Some slides also taken from the teaching material attached tothe book (thanks David!)Other good books: C.M. Bishop, Pattern Recognition andMachine Learning (Springer 2006); K. Murphy, MachineLearning - a probabilistic perspective (MIT Press 2012).Rasmussen and Williams, Gaussian Processes for MachineLearning (MIT Press 2007) for Lecture 3Lecture 4 (Active Learning and Bayesian Optimisation): B.Settles, Active learning literature survey, sections 2 and 3 andBrochu et al, http://arxiv.org/abs/1012.2599Wikipedia also has good pages for most of the materialIMPORTANT FACT: these slides are not a book!!!!
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Today’s lecture
1 Some facts
2 Philosophy and road map
3 Basics of probability theory
4 Some probability distributions
5 Fitting distributions
6 Basics of learning
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
A few things worth considering
Mobile traffic in 2013 = 18 × total internet traffic in 2000
UK National Health Service plans to sequence genome of750.000 cancer patients in the next ten years
Google purchased DeepMind (after 1 year of operation) for450M GBP
Number of job postings for data scientists increased globallyby 15.000% between 2011 and 2012
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
The problem
Vast amounts of quantitative data arising from every aspectof life
Advanced informatics tools necessary just to handle the data
Widespread belief that data is valuable, yet worthless withoutanalytic tools
Converting data to knowledge is the challenge
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
A memorable quote
If you ignore philosophy, it comes back and bites your bottom(Dr R. Shillcock, Informatics, Edinburgh)
What is a model? Discuss for 5 minutes and provide 3examples
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
My own answer
A model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.
A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).
A stochastic model is a mathematical model where theobjects are probability distributions.
All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.
Machine learning deals with algorithms for automatic selectionof a model from observations of the system.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
My own answer
A model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.
A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).
A stochastic model is a mathematical model where theobjects are probability distributions.
All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.
Machine learning deals with algorithms for automatic selectionof a model from observations of the system.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
My own answer
A model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.
A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).
A stochastic model is a mathematical model where theobjects are probability distributions.
All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.
Machine learning deals with algorithms for automatic selectionof a model from observations of the system.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
My own answer
A model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.
A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).
A stochastic model is a mathematical model where theobjects are probability distributions.
All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.
Machine learning deals with algorithms for automatic selectionof a model from observations of the system.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
My own answer
A model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.
A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).
A stochastic model is a mathematical model where theobjects are probability distributions.
All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.
Machine learning deals with algorithms for automatic selectionof a model from observations of the system.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Course content
References to Barber 2010 book.
Lecture 1: Statistical basics. Probability refresher, probabilitydistributions, entropy and KL divergence (Ch 1, Ch 8.2, 8.3).Multivariate Gaussian (8.4). Estimators and maximumlikelihood (8.6 and 8.7.3). Supervised and unsupervisedlearning (13.1)Lecture 2: Linear models. Regression with additive noise andlogistic regression (probabilistic perspective): maximumlikelihood and least squares (18.1 and 17.4.1). Duality andkernels (17.3).Lecture 3: Bayesian regression models and GaussianProcesses. Bayesian models and hyperparameters (18.1.1,18.1.2). Gaussian Process regression (19.1-19.4).Lecture 4: Active learning and Bayesian optimisation. Activelearning, basic concepts and types of active learning. Bayesianoptimisation and the GP-UCB algorithm.Lab 1: GP regression and Bayesian Optimisation.
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Course content cont’d
Lecture 5: Latent variables and mixture models. Latentvariables and the EM algorithm (11.1 and 11.2.1). Gaussianmixture models and mixture of experts (20.3, 20.4).
Lecture 6: Graphical models. Belief networks and Markovnetworks (3.3 and 4.2). Factor graphs (4.4).
Lecture 7: Exact inference in trees. Message passing andbelief propagation (5.1 and 28.7.1).
Lecture 8: Approximate inference in graphical models.Variational inference: Gaussian and mean field approximations(28.3, 28.4). Sampling methods and Gibbs sampling (27.4and 27.3).
Lab 2: Bayesian Gaussian Mixture Models
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Definitions
Random variables: results of non exactly reproducibleexperiments
Either intrinsically random (e.g. quantum mechanics) or thesystem is incompletely known, cannot be controlled precisely
The probability pi of an experiment taking a certain value i isthe frequency with which that value is taken in the limit ofinfinite experimental trials
Alternatively, we can take probability to be our belief that acertain value will be taken
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
More definitions
Let x be a random variable, the set of possible values of x isthe sample space Ω
Let x and y be two random variables, p(x = i , y = j) is thejoint probability of x taking value i and y taking value j (withi and j in the respective sample spaces. Often just writtenp(x , y) to indicate the function (as opposed to its evaluationover the outcomes i and j)
p(x |y) is the conditional probability, i.e. the probability of x ifyou know y has a certain value
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Rules
Normalisation: the sum of the probabilities of all possibleexperimental outcomes must be 1,
∑x∈Ω p(x) = 1
Sum rule: the marginal probability p(x) is given by summingthe joint p(x , y) over all possible values of y ,
p(x) =∑y∈Ω
p(x , y)
Product rule: the joint is the product of the conditional andthe marginal, p(x , y) = p(x |y)p(y)
Bayes rule: the posterior is the ratio of the joint and themarginal
p(y |x) =p(x |y)p(y)
p(x)
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Distributions and expectations
A probability distribution is a rule associating a number0 ≤ p(x) ≤ 1 to each state x ∈ Ω, such that
∑x∈Ω p(x) = 1
For finite state space can be given by a table, in general isgiven by a functional form
Probability distributions (over numerical objects) are useful tocompute expectations of functions
〈f 〉 =∑x∈Ω
f (x)p(x)
Important expectations are the mean 〈x〉 and variancevar(x) = 〈(x − 〈x〉)2〉. For more variables, also the covariancecov(x , y) = 〈(x − 〈x〉)(y − 〈y〉)〉 or its scaled relative thecorrelation corr(x , y) = cov(x , y)/
√var(x)var(y)
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Computing expectations
If you know analytically the probability distribution and cancompute the sums (integrals), no problem
If you know the distribution but cannot compute the sums(integrals), enter the magical realm of approximate inference(fun but out of scope)
If you know nothing but have NS samples, then use a sampleapproximation
Approximate the probability of an outcome with the frequencyin the sample
〈f (x)〉 '∑x
nx
NSf (x) =
1
NS
NS∑i=1
f (xi )
(prove the last equality)
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Independence
Two random variables x and y are independent if their jointprobability factorises in terms of marginals
p(x , y) = p(x)p(y)
Using the product rule, this is equivalent to the conditionalbeing equal to the marginal
p(x , y) = p(x)p(y)⇔ p(x |y) = p(x)
Exercise: if two variables are independent, then theircorrelation is zero. NOT TRUE viceversa (no correlationdoes not imply independence)
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Continuous states
If the state space Ω is continuous some of the previousdefinitions must be modified
The general case is mathematically difficult; we restrictourselves to Ω = Rn and to distributions which admit adensity, a function
p : Ω→ R s.t. p(x) ≥ 0∀x and∫
Ωp(x)dx = 1
It can be shown that the rules of probability distributions holdalso for probability densities
Notice that p(x) is NOT the probability of the randomvariable being in state x (that is always zero for boundeddensities); probabilities are only defined as integrals oversubsets of Ω
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Entropy and divergence
Probability theory is the basis of information theory(interesting, but not the topic of this course).An important quantity is the entropy of a distribution
H[p] = −∑
i
pi log2 pi
Entropy measures the level of disorder of a distribution; fordiscrete distributions, it is always ≥ 0 and 0 only fordeterministic distirbutionsThe relative entropy or Kullback-Leibler (KL) divergencebetween two distributions is
KL[q‖p] =∑
i
qi logqi
pi
Fact: KL is convex and ≥ 0Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Basic distributions
Discrete distribution: a random variable can take N distinctvalues with probability pi , i = 1, . . . ,N . Formally
p(x = i) =∏j
pδijj
δij is the Kronecker delta and the pi s are parameters.Poisson distribution: a distribution over non-negative integers
p (n|µ) =µn
n!exp[−µ]
The parameter µ is often called the rate of the distribution.The Poisson distribution is often used for rare events, e.g.decaying of particles of binding of DNA fragments to a probe(more later!)Exercise: compute mean and variance of a Poisson distribution
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Basic distributions
Multivariate normal: distribution over vectors x, density
p (x|µ,Σ) =1√
2π|Σ|exp
[−1
2(x− µ)T Σ−1 (x− µ)
]µ is the mean and Σ is the covariance matrix. Often useful toparametrise it in terms of the precision matrix Σ−1. Howmany parameters does a multivariate normal have?
Gamma distribution: distribution over positive real numbers,density
p (x |k , θ) =xk−1 exp(−x/θ)
θkΓ(θ)
with shape parameter k and scale parameter θ
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Interesting exercise
This exercise illustrates the pitfalls of working in high dimensions.
1 Curse of dimensionality: Suppose you want to exploreuniformly a region by gridding it. How many grid points doyou need?
2 Even worse: Suppose you sample from a spherical Gaussiandistribution. Where do the points lie as the dimensionsincrease?
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Mixtures: how to build more distributions
More general distributions can be built via mixtures: e.g.
p(x |µ1...,n, σ21,...,n) =
∑i
πiN (µi , σ2i )
where the mixing coefficients πi are discretely distributedYou can interpret this as a two stage hierarchical process:choose one component out of a discrete distribution, thenchoose the distribution for that componentIMPORTANT CONCEPT: this is an example of latentvariable model, with a latent class variable and an observedcontinuous value. The mixture is the marginal distribution forthe observationsThe probability of the latent variables given the observationscan be obtained using Bayes’ theorem: see next week
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Continuous mixtures: some cool distributions
No need for the mixing distribution (latent variable) to bediscrete
Suppose you are interested in the means of normallydistributed samples (possibly with different variances/precisions)
Marginalising the precision in a Gaussian using a Gammamixing distribution yields a Student t-distribution
Suppose you have multiple rare event processes happeningwith slightly different rates
Marginalising the rate in a Poisson distribution using aGamma mixing distribution yields a negative binomialdistribution
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Parameters?
Many distributions are written as conditional probabilitiesgiven the parametersOften the values of the parameters are not knownGiven independent and identically distributed (i.i.d.)observations, we can estimate them; e.g., we pick θ bymaximum likelihood
θ = argmaxθ
[∏i
p(xi |θ)
]Alternatively, you could have a prior over the parameters p(θ)and take the maximum a posteriori (MAP) estimate
θMAP = argmaxθ
[p(θ)
∏i
p(xi |θ)
]Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Justification for maximum likelihood
Given a data set xi, i = 1, . . . ,N, let the empiricaldistribution be
pemp(x) =1
N
N∑i=1
I(xi )
with I the indicator function of a set
To find a suitable distribution q to model the data, one maywish to minimize the Kullback-Leibler divergence
KL[pemp‖q] = H[pemp]− 〈log q(x)〉pemp = − 1
N
∑log q(xi )
Maximum likelihood is equivalent to minimizing a KLdivergence with the empirical distirbution
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Exercise: fitting a discrete distribution
We have independent observations x1, . . . , xN each taking oneof D possible values, giving a likelihood
L =N∏
i=1
p (xi |p)
Compute the Maximum Likelihood estimate of p. What is theintuitive meaning of the result? What happens if one of the Dvalues is not represented in your sample?
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Exercise II: fitting a Gaussian distribution
We have independent, real valued observations x1, . . . , xN
Find the parameters of the optimal Gaussian fit by maximumlikelihood
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Bayesian estimation
The Bayesian approach quantifies uncertainty at every step
The parameters are treated as additional random variableswith their own prior distribution p(θ)
The observation likelihood is combined with the prior toobtain a posterior distribution via Bayes’ theorem
p(θ|xI) =p(xI |θ)p(θ)
p(xI)
where I is the set indexing the observations
The distribution of the observable x (predictive distribution) isobtained as
p(x |xI) =
∫dθp(x |θ)p(θ|xI)
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Exercise: Bayesian fitting of Gaussians
Let data xi i = 1, . . . ,N be distributed according to aGaussian with mean µ and variance σ2
Let the prior distribution over the mean µ be a Gaussian withmean m and variance v 2
Compute the posterior and predictive distribution
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Estimators
A procedure to calculate an expectation is called an estimator
e.g., fitting a Gaussian to data by maximum likelihoodprovides the M.L. estimator for mean and variance, orBayesian posterior mean
An estimator will be a noisy estimate of the true value, due tofinite sample effects
An estimator f is unbiased if its expectation (under the jointdistribution of the data set) coincides with the true value
Exercise: show that the ML estimator of variance is biased
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Learning as fitting distributions
The world-view of this course is that a model consists of a setof random variables and probabilistic relationships describingtheir interactions
Learning refers to computing conditional distributions w.r.t.some observations of subsets of the model
Predictions are then carried out using the Bayesian predictivedistribution
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Supervised and unsupervised learning
Slightly meaningless terms still heavily used
Focus on models involving more than one random variable
In particular, supervised learning applies when data is in theform of input-output pairs
Supervised learning aims at learning the (probabilistic)functional relationship between the output and the input
Unsupervised learning refers to purely learn the structure ofthe probability distribution underlying the data
Guido Sanguinetti Bayesian Machine Learning - Lecture 1
Some factsPhilosophy and road map
Basics of probability theorySome probability distributions
Fitting distributionsBasics of learning
Generative and discriminative models
Supervised learning can have two flavours
Two different types of question can be asked:
what is the joint probability of input/ output pairs?given a new input, what will be the output?
The first question requires a model of the population structureof the inputs, and of the conditional probability of the outputgiven the input → generative modelling
The second question is more parsimonious but lessexplanatory → discriminative learning
Notice that the difference between generative supervisedlearning and unsupervised learning is moot
Guido Sanguinetti Bayesian Machine Learning - Lecture 1