Introduction to Machine Learning - Brown...
Transcript of Introduction to Machine Learning - Brown...
![Page 1: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/1.jpg)
Introduction to Machine Learning
Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth
Lecture 3: Bayesian Learning, MAP & ML Estimation
Classification: Naïve Bayes
Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective
![Page 2: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/2.jpg)
Bayes Rule (Bayes Theorem) unknown parameters (many possible models)
observed data available for learning
prior distribution (domain knowledge)
likelihood function (measurement model)
posterior distribution (learned information)
θD
p(θ)
p(D | θ)p(θ | D)
p(θ,D) = p(θ)p(D | θ) = p(D)p(θ | D)
∝ p(D | θ)p(θ)
p(θ | D) =p(θ,D)
p(D)=
p(D | θ)p(θ)∑θ′∈Θ p(D | θ′)p(θ′)
![Page 3: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/3.jpg)
The Number Game •! I am thinking of some arithmetical concept,
such as: •!Prime numbers •!Numbers between 1 and 10 •!Even numbers •!!
•! I give you a series of randomly chosen positive examples from the chosen class •!Question: Are other test digits also in the
class?
Tenenbaum 1999
![Page 4: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/4.jpg)
Predictions of 20 Humans
D = {16}
D = {16,8,2,64}
D = {16,23,19,20}
![Page 5: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/5.jpg)
A Bayesian Model Likelihood:
•! Assume examples are sampled uniformly at random from all numbers that are consistent with the hypothesis
•! Size principle: Favors smallest consistent hypotheses
Prior: •! Based on prior experience, some hypotheses are more
probable ( natural ) than others •! Powers of 2 •! Powers of 2 except 32, plus 37
•! Subjectivity: May depend on observer’s experience
D = {5, 34, 2, 89, 1, 13}?
![Page 6: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/6.jpg)
Posterior Distributions
D = {16} D = {16,8,2,64}
![Page 7: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/7.jpg)
Posterior Estimation
•! As the amount of data becomes large, weak conditions on the hypothesis space and measurement process imply that
•! This is a maximum a posteriori (MAP) estimate:
•! With a large amount of data, and/or an (almost) uniform prior, we approach the maximum likelihood (ML) estimate:
•! More theory to come later!
, pp (
![Page 8: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/8.jpg)
Posterior Predictions
•! Suppose we want to predict the next number that will be revealed to us. One option is to use the MAP estimate:
p(x | D) ≈ p(x | h)
•! But if we correctly apply Bayes rule to the specified model, we instead obtain the posterior predictive distribution:
•! This is sometimes called Bayesian model averaging.
p(x | D) =∑
h∈Hp(x | h)p(h | D)
x ∈ {1, 2, 3, . . .}
![Page 9: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/9.jpg)
Posterior Predictive Distribution
![Page 10: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/10.jpg)
Experiment: Human Judgments
![Page 11: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/11.jpg)
Experiment: Bayesian Predictions
![Page 12: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/12.jpg)
Machine Learning Problems
Supervised Learning Unsupervised Learning
Dis
cret
e C
ontin
uous
classification or categorization
regression
clustering
dimensionality reduction
![Page 13: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/13.jpg)
Generative Classifiers class label in {1,!,C}, observed in training
observed features to be used for classification
parameters indexing family of models
yx ∈ X
θp(y, x | θ) = p(y | θ)p(x | y, θ)
prior distribution
likelihood function
•! Compute class posterior distribution via Bayes rule:
•! Inference: Find label distribution for some input example •! Classification: Make decision based on inferred distribution •! Learning: Estimate parameters from labeled training data
p(y = c | x, θ) = p(y = c | θ)p(x | y = c, θ)∑C
c′=1 p(y = c′ | θ)p(x | y = c′, θ)
θ
![Page 14: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/14.jpg)
Specifying a Generative Model p(y, x | θ) = p(y | θ)p(x | y, θ)
prior distribution
likelihood function
•! For generative classification, we take the prior to be some categorical (multinoulli) distribution:
•! The likelihood must be matched to the domain of the data
p(y | θ) = Cat(y | θ)
•! Suppose x is a vector of D different features •! The simplest generative model assumes that these features
are conditionally independent, given the class label:
•! This is a so-called naïve Bayes model for classification
p(x | y = c, θ) =
D∏
j=1
p(xj | y = c, θjc)
![Page 15: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/15.jpg)
Learning a Generative Model
•! Assume we have N training examples independently sampled from an unknown naïve Bayes model:
p(θ | y, x) ∝ p(θ, y, x) = p(θ)p(y | θ)p(x | y, θ)model class features
p(θ | y, x) ∝ p(θ)
N∏
i=1
p(yi | θ)p(xi | yi, θ)
∝ p(θ)
N∏
i=1
p(yi | θ)D∏
j=1
p(xij | yi, θ)
observed class label for training example i
value of feature j for training example i
yixij
•! Learning: ML estimate, MAP estimate, or posterior prediction
![Page 16: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012 · Introduction to Machine Learning Brown University CSCI 1950-F,](https://reader036.fdocuments.net/reader036/viewer/2022062506/5fbb8fcb11987e3394532faf/html5/thumbnails/16.jpg)
Naïve Bayes: ML Estimation
Nc number of examples of training class c
•! Even if we are doing discrete categorization based on discrete features, this is a continuous optimization problem!
•! Bayesian reasoning about models also requires continuous probability distributions, even in this simple case
•! Next week we will show that the ML class estimates are:
•! For binary features, we have: