CS 595-052 Machine Learning and Statistical Natural Language Processing
CS 59000 Statistical Machine learning Lecture 15
description
Transcript of CS 59000 Statistical Machine learning Lecture 15
![Page 1: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/1.jpg)
CS 59000 Statistical Machine learningLecture 15
Yuan (Alan) QiPurdue CS
Oct. 21 2008
![Page 2: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/2.jpg)
Outline
• Review of Gaussian Processes (GPs)• From linear regression to GP • GP for regression
• Learning hyperparameters• Automatic Relevance Determination• GP for classification
![Page 3: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/3.jpg)
Gaussian Processes
How kernels arise naturally in a Bayesian setting?
Instead of assigning a prior on parameters w, we assign a prior on function value y.Infinite space in theory
Finite space in practice (finite number of training set and test set)
![Page 4: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/4.jpg)
Linear Regression Revisited
Let
We have
![Page 5: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/5.jpg)
From Prior on Parameter to Prior on Function
The prior on function value:
![Page 6: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/6.jpg)
Stochastic Process
A stochastic process is specified by giving the joint distribution for any finite set of values in a consistent manner (Loosely speaking, it means that a marginalized joint distribution is the same as the joint distribution that is defined in the subspace.)
![Page 7: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/7.jpg)
Gaussian Processes
The joint distribution of any variables is a multivariable Gaussian distribution.
Without any prior knowledge, we often set mean to be 0. Then the GP is specified by the covariance :
![Page 8: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/8.jpg)
Impact of Kernel FunctionCovariance matrix : kernel function
Application economics & finance
![Page 9: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/9.jpg)
Gaussian Process for Regression
Likelihood:
Prior:
Marginal distribution:
![Page 10: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/10.jpg)
Samples of Data Points
![Page 11: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/11.jpg)
Predictive Distribution
is a Gaussian distribution with mean and variance:
![Page 12: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/12.jpg)
Predictive Mean
is the nth component ofWe see the same form as kernel ridge
regression and kernel PCA.
![Page 13: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/13.jpg)
GP Regression
Discussion: the difference between GP regression and Bayesian regression with Gaussian basis functions?
![Page 14: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/14.jpg)
Computational Complexity
GP prediction for a new data point:
GP: O(N3) where N is number of data pointsBasis function model: O(M3) where M is the
dimension of the feature expansionWhen N is large: computationally expensive.Sparsification: make prediction based on only a few
data points (essentially make N small)
![Page 15: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/15.jpg)
Learning Hyperparameters
Empirical Bayes Methods
![Page 16: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/16.jpg)
Automatic Relevance Determination
Consider two-dimensional problems:
Maximizing the marginal likelihood will make certain small, reducing its relevance to prediction.
![Page 17: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/17.jpg)
Example
t = sin(2 π x1)
x2 = x1 +n
x3 = e
![Page 18: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/18.jpg)
Gaussian Processes for Classification
Likelihood:
GP Prior:
Covariance function:
![Page 19: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/19.jpg)
Sample from GP Prior
![Page 20: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/20.jpg)
Predictive Distribution
No analytical solution.Approximate this integration:
Laplace’s methodVariational BayesExpectation propagation
![Page 21: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/21.jpg)
Laplace’s method for GP Classification (1)
![Page 22: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/22.jpg)
Laplace’s method for GP Classification (2)
Taylor expansion:
![Page 23: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/23.jpg)
Laplace’s method for GP Classification (3)
Newton-Raphson update:
![Page 24: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/24.jpg)
Laplace’s method for GP Classification (4)
Gaussian approximation:
![Page 25: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/25.jpg)
Laplace’s method for GP Classification (4)
Question: How to get the mean and the variance above?
![Page 26: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/26.jpg)
Predictive Distribution
![Page 27: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/27.jpg)
Example
![Page 28: CS 59000 Statistical Machine learning Lecture 15](https://reader036.fdocuments.net/reader036/viewer/2022062314/56813c6e550346895da5ff8b/html5/thumbnails/28.jpg)