Post on 21-Dec-2015
Simple Bayesian Supervised Models
Saskia Klein & Steffen Bollmann
1
Content
Saskia Klein & Steffen Bollmann 2
Recap from last weak Bayesian Linear Regression
What is linear regression? Application of the Bayesian Theory on Linear
Regression Example Comparison to Conventional Linear Regression
Bayesian Logistic Regression Naive Bayes classifier
Source: Bishop (ch. 3,4); Barber (ch. 10)
Maximum a posterior estimation• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize posterior distribution.
• It allows to account for the prior information.evidence
likelihoodprior
posterior
Conjugate prior• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate prior that can be written in the form
• Important conjugate pairs include:Binomial – BetaMultinomial – DirichletGaussian – Gaussian (for mean)Gaussian – Gamma (for precision)Exponential – Gamma
Linear Regression
Saskia Klein & Steffen Bollmann 5
goal: predict the value of a target variable given the value of a D-dimensional vector of input variables
linear regression models: linear functions of the adjustable parameters𝐱
𝑡
for example:
Linear Regression
Saskia Klein & Steffen Bollmann 6
Training … training data set comprising observations,
where … corresponding target values compute the weights
Prediction goal: predict the value of for a new value of = model the predictive distribution and make predictions of in such a way as to
minimize the expected value of a loss function
Examples of linear regression models
Saskia Klein & Steffen Bollmann 7
simplest linear regression model: linear function of the weights/parameters and the
data linear regression models using basis
functions :
Bayesian Linear Regression
Saskia Klein & Steffen Bollmann 8
model: … target variable … model … data … weights/parameters … additive Gaussian noise: with zero mean and
precision (inverse variance)
Maximum a posterior estimation• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize posterior distribution.
• It allows to account for the prior information.evidence
likelihoodprior
posterior
Bayesian Linear Regression - Likelihood
Saskia Klein & Steffen Bollmann 10
likelihood function:
observation of N training data sets of inputs and target values (independently drawn from the distribution)
Maximum a posterior estimation• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize posterior distribution.
• It allows to account for the prior information.evidence
likelihoodprior
posterior
Conjugate prior• In general, for a given probability distribution p(x|η), we can seek a
prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior.
• For any member of the exponential family, there exists a conjugate prior that can be written in the form
• Important conjugate pairs include:Binomial – BetaMultinomial – DirichletGaussian – Gaussian (for mean)Gaussian – Gamma (for precision)Exponential – Gamma
Bayesian Linear Regression - Prior
Saskia Klein & Steffen Bollmann 13
prior probability distribution over the model parameters
conjugate prior: Gaussian distribution
mean and covariance
Maximum a posterior estimation• The bayesian approach to estimate parameters of the
distribution given a set of observations is to maximize posterior distribution.
• It allows to account for the prior information.evidence
likelihoodprior
posterior
Bayesian Linear Regression – Posterior Distribution
Saskia Klein & Steffen Bollmann 15
due to the conjugate prior, the posterior will also be Gaussian
(derivation: Bishop p.112)
Example Linear Regression
Saskia Klein & Steffen Bollmann 16
matlab
Predictive Distribution
Saskia Klein & Steffen Bollmann 17
making predictions of for new values of predictive distribution:
variance of the distribution:
first term represents the noise in the data second term reflects the uncertainty
associated with the parameters optimal prediction, for a new value of , would
be the conditional mean of the target variable:
Common Problem in Linear Regression: Overfitting/model complexitiy
Saskia Klein & Steffen Bollmann 18
Least Squares approach (maximizing the likelihood): point estimate of the weights Regularization: regularization term and value
needs to be chosen Cross-Validation: requires large datasets and high
computational power Bayesian approach:
distribution of the weights good prior model comparison: computationally demanding,
validation data not required
From Regression to Classification
Saskia Klein & Steffen Bollmann 19
for regression problems: target variable was the vector of real numbers
whose values we wish to predict in case of classification:
target values represent class labels two-class problem: K > 2: class 2
Classification
Saskia Klein & Steffen Bollmann 20
goal: take an input vector and assign it to one of discrete classes
decision boundary
Bayesian Logistic Regression
Saskia Klein & Steffen Bollmann 21
model the class-conditional densities and the prior probabilities and apply Bayes Theorem:
Bayesian Logistic Regression
Saskia Klein & Steffen Bollmann 22
exact Bayesian inference for logistic regression is intractable
Laplace approximation aims to find a Gaussian approximation to a
probability density defined over a set of continuous variables
posterior distribution is approximated around
Example
Saskia Klein & Steffen Bollmann 23
Barber: DemosExercises\demoBayesLogRegression.m
Example
Saskia Klein & Steffen Bollmann 24
Barber: DemosExercises\demoBayesLogRegression.m
Naive Bayes classifier
Saskia Klein & Steffen Bollmann 25
Why naive? strong independence assumptions assumes that the presence/absence of a feature of
a class is unrelated to the presence/absence of any other feature, given the class variable
Ignores relation between features and assumes that all feature contribute independently to a class
[http://en.wikipedia.org/wiki/Naive_Bayes_classifier]
Saskia Klein & Steffen Bollmann
Thank you for your attention
26