Statistical Learning (From data to distributions)
-
Upload
kylynn-cantu -
Category
Documents
-
view
23 -
download
0
description
Transcript of Statistical Learning (From data to distributions)
![Page 1: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/1.jpg)
Statistical Learning(From data to distributions)
![Page 2: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/2.jpg)
Reminders
• HW5 deadline extended to Friday
![Page 3: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/3.jpg)
Agenda
• Learning a probability distribution from data
• Maximum likelihood estimation (MLE)
• Maximum a posteriori (MAP) estimation
• Expectation Maximization (EM)
![Page 4: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/4.jpg)
Motivation
• Agent has made observations (data)
• Now must make sense of it (hypotheses)– Hypotheses alone may be important (e.g., in
basic science)– For inference (e.g., forecasting)– To take sensible actions (decision making)
• A basic component of economics, social and hard sciences, engineering, …
![Page 5: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/5.jpg)
Candy Example
• Candy comes in 2 flavors, cherry and lime, with identical wrappers
• Manufacturer makes 5 (indistinguishable) bags
• Suppose we draw• What bag are we holding? What flavor will we draw next?
H1C: 100%L: 0%
H2C: 75%L: 25%
H3C: 50%L: 50%
H4C: 25%L: 75%
H5C: 0%L: 100%
![Page 6: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/6.jpg)
Machine Learning vs. Statistics
• Machine Learning automated statistics
• This lecture– Bayesian learning, the more “traditional”
statistics (R&N 20.1-3)– Learning Bayes Nets
![Page 7: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/7.jpg)
Bayesian Learning
• Main idea: Consider the probability of each hypothesis, given the data
• Data d:
• Hypotheses: P(hi|d)
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
![Page 8: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/8.jpg)
Using Bayes’ Rule
• P(hi|d) = P(d|hi) P(hi) is the posterior
– (Recall, 1/ = i P(d|hi) P(hi))
• P(d|hi) is the likelihood
• P(hi) is the hypothesis prior
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
![Page 9: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/9.jpg)
Computing the Posterior
• Assume draws are independent
• Let P(h1),…,P(h5) = (0.1,0.2,0.4,0.2,0.1)
• d = { 10 x }
P(d|h1) = 0P(d|h2) = 0.2510
P(d|h3) = 0.510
P(d|h4) = 0.7510
P(d|h5) = 110
P(d|h1)P(h1)=0P(d|h2)P(h2)=9e-8P(d|h3)P(h3)=4e-4P(d|h4)P(h4)=0.011P(d|h5)P(h5)=0.1
P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90
Sum = 1/ = 0.1114
![Page 10: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/10.jpg)
Posterior Hypotheses
![Page 11: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/11.jpg)
Predicting the Next Draw
• P(X|d) = i P(X|hi,d)P(hi|d)
= i P(X|hi)P(hi|d)
P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90
H
D X
P(X|h1) =0P(X|h2) =0.25P(X|h3) =0.5P(X|h4) =0.75P(X|h5) =1
Probability that next candy drawn is a lime
P(X|d) = 0.975
![Page 12: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/12.jpg)
P(Next Candy is Lime | d)
![Page 13: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/13.jpg)
Other properties of Bayesian Estimation
• Any learning technique trades off between good fit and hypothesis complexity
• Prior can penalize complex hypotheses– Many more complex hypotheses than simple
ones– Ockham’s razor
![Page 14: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/14.jpg)
Hypothesis Spaces often Intractable
• A hypothesis is a joint probability table over state variables– 2n entries => hypothesis space is [0,1]^(2n)– 2^(2n) deterministic hypotheses
6 boolean variables => over 1022 hypotheses
• Summing over hypotheses is expensive!
![Page 15: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/15.jpg)
Some Common Simplifications
• Maximum a posteriori estimation (MAP)– hMAP = argmaxhi P(hi|d)
– P(X|d) P(X|hMAP)
• Maximum likelihood estimation (ML)– hML = argmaxhi P(d|hi)
– P(X|d) P(X|hML)
• Both approximate the true Bayesian predictions as the # of data grows large
![Page 16: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/16.jpg)
Maximum a Posteriori
• hMAP = argmaxhi P(hi|d)
• P(X|d) P(X|hMAP)
hMAP = h3 h4 h5
P(X|hMAP)
P(X|d)
![Page 17: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/17.jpg)
Maximum a Posteriori
• For large amounts of data,P(incorrect hypothesis|d) => 0
• For small sample sizes, MAP predictions are “overconfident” P(X|hMAP)
P(X|d)
![Page 18: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/18.jpg)
Maximum Likelihood
• hML = argmaxhi P(d|hi)
• P(X|d) P(X|hML)
hML = undefined h5
P(X|hML)
P(X|d)
![Page 19: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/19.jpg)
Maximum Likelihood
• hML= hMAP with uniform prior
• Relevance of prior diminishes with more data
• Preferred by some statisticians– Are priors “cheating”?– What is a prior anyway?
![Page 20: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/20.jpg)
Advantages of MAP and MLE over Bayesian estimation
• Involves an optimization rather than a large summation– Local search techniques
• For some types of distributions, there are closed-form solutions that are easily computed
![Page 21: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/21.jpg)
Learning Coin Flips (Bernoulli distribution)
• Let the unknown fraction of cherries be • Suppose draws are independent and
identically distributed (i.i.d)
• Observe that c out of N draws are cherries
![Page 22: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/22.jpg)
Maximum Likelihood
• Likelihood of data d={d1,…,dN} given
– P(d|) = j P(dj|) = c (1-)N-c
i.i.d assumption Gather c cherries together, then N-c limes
![Page 23: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/23.jpg)
Maximum Likelihood
• Same as maximizing log likelihood
• L(d|)= log P(d|) = c log (N-c) log(1-)
• max L(d|)=> dL/d = 0=> 0 = c/ – (N-c)/(1-)=> = c/N
![Page 24: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/24.jpg)
Maximum Likelihood for BN
• For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data
Alarm
Earthquake Burglar
E 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380
E B P(A|E,B)
T T 0.95
F T 0.95
T F 0.34
F F 0.003
![Page 25: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/25.jpg)
Maximum Likelihood for Gaussian Models
• Observe a continuous variable x1,…,xN
• Fit a Gaussian with mean , std – Standard procedure: write log likelihood
L = N(C – log ) – j (xj-)2/(22)
– Set derivatives to zero
![Page 26: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/26.jpg)
• Observe a continuous variable x1,…,xN
• Results: = 1/N xj (sample mean)2 = 1/N (xj-)2 (sample variance)
Maximum Likelihood for Gaussian Models
![Page 27: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/27.jpg)
• Y is a child of X
• Data (xj,yj)
• X is gaussian, Y is a linear Gaussian function of X– Y(x) ~ N(ax+b,)
• ML estimate of a, b is given by least squares regression, by standard errors
Maximum Likelihood for Conditional Linear Gaussians
X
Y
![Page 28: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/28.jpg)
Back to Coin Flips
• What about Bayesian or MAP learning?
• Motivation– I pick a coin out of my pocket– 1 flip turns up heads– Whats the MLE?
![Page 29: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/29.jpg)
Back to Coin Flips
• Need some prior distribution P()
• P(|d) = P(d|)P() = c (1-)N-c P()
Define, for all , the probability that I believe in
10
P()
![Page 30: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/30.jpg)
MAP estimate
• Could maximize c (1-)N-c P() using some optimization
• Turns out for some families of P(), the MAP estimate is easy to compute
10
P()
Beta distributions
(Conjugate prior)
![Page 31: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/31.jpg)
Beta Distribution
• Betaa,b() = a-1 (1-)b-1
– a, b hyperparameters– is a normalization
constant– Mean at a/(a+b)
![Page 32: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/32.jpg)
Posterior with Beta Prior
• Posterior c (1-)N-c P()= c+a-1 (1-)N-c+b-1
• MAP estimate=(c+a)/(N+a+b)
• Posterior is also abeta distribution!– See heads, increment a– See tails, increment b– Prior specifies a “virtual count” of a heads, b tails
![Page 33: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/33.jpg)
Does this work in general?
• Only specific distributions have the right type of prior– Bernoulli, Poisson, geometric, Gaussian,
exponential, …
• Otherwise, MAP needs a (often expensive) numerical optimization
![Page 34: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/34.jpg)
How to deal with missing observations?
• Very difficult statistical problem in general
• E.g., surveys– Did the person not fill out political affiliation
randomly?– Or do independents do this more often than
someone with a strong affiliation?
• Better if a variable is completely hidden
![Page 35: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/35.jpg)
Expectation Maximization for Gaussian Mixture models
Data have labels to which Gaussian they belong to, but label is a hidden variable
Clustering: N gaussian distributions
E step: compute probability a datapoint belongs to each gaussian
M step: compute ML estimates of each gaussian, weighted by the probability that each sample belongs to it
![Page 36: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/36.jpg)
Learning HMMs
Want to find transition and observation probabilities
Data: many sequences {O1:t(j) for 1jN}
Problem: we don’t observe the X’s!
X0 X1 X2 X3
O1 O2 O3
![Page 37: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/37.jpg)
Learning HMMs
X0 X1 X2 X3
O1 O2 O3
• Assume stationary markov chain, discrete states x1,…,xm
• Transition parametersij = P(Xt+1=xj|Xt=xi)
• Observation parametersi = P(O|Xt=xi)
![Page 38: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/38.jpg)
• Assume stationary markov chain, discrete states x1,…,xm
• Transition parametersij = P(Xt+1=xj|Xt=xi)
• Observation parametersi = P(O|Xt=xi)
• Initial statesi = P(X0=xi)
Learning HMMs
x1
x3x2
O
13, 31
32
![Page 39: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/39.jpg)
Expectation Maximization
• Initialize parameters randomly• E-step: infer expected probabilities of hidden
variables over time, given current parameters• M-step: maximize likelihood of data over
parametersx1
x3x2
O
13, 31
323233, 1,2,3)
P(initial state) P(transition ij) P(emission)
![Page 40: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/40.jpg)
Expectation Maximization
x1
x3x2
O
13, 31
32
Initialize
E: Compute E[P(Z=z| (0),O)]
x1 x2 x3 x2 x2 x1
x1 x2 x2 x1 x3 x2
Z: all combinations of hidden sequences
Result: probability distribution over hidden state at time t
M: compute (1) = ML estimate of transition / obs. distributions
3233, 1,2,3)
![Page 41: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/41.jpg)
Expectation Maximization
x1
x3x2
O
13, 31
32
3233, 1,2,3)Initialize
E: Compute E[P(Z=z| (0),O)]
x1 x2 x3 x2 x2 x1
x1 x2 x2 x1 x3 x2
Z: all combinations of hidden sequences
Result: probability distribution over hidden state at time t
M: compute (1) = ML estimate of transition / obs. distributions
This is the hard part…
![Page 42: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/42.jpg)
E-Step on HMMs
• Computing expectations can be done by:– Sampling– Using the forward/backward algorithm on the
unrolled HMM (R&N pp. 546)
• The latter gives the classic Baum-Welch algorithm
• Note that EM can still get stuck in local optima or even saddle points
![Page 43: Statistical Learning (From data to distributions)](https://reader034.fdocuments.net/reader034/viewer/2022051620/56813499550346895d9b8cd8/html5/thumbnails/43.jpg)
Next Time
• Machine learning