Statistical Learning(From data to distributions)
Reminders
• HW5 deadline extended to Friday
Agenda
• Learning a probability distribution from data
• Maximum likelihood estimation (MLE)
• Maximum a posteriori (MAP) estimation
• Expectation Maximization (EM)
Motivation
• Agent has made observations (data)
• Now must make sense of it (hypotheses)– Hypotheses alone may be important (e.g., in
basic science)– For inference (e.g., forecasting)– To take sensible actions (decision making)
• A basic component of economics, social and hard sciences, engineering, …
Candy Example
• Candy comes in 2 flavors, cherry and lime, with identical wrappers
• Manufacturer makes 5 (indistinguishable) bags
• Suppose we draw• What bag are we holding? What flavor will we draw next?
H1C: 100%L: 0%
H2C: 75%L: 25%
H3C: 50%L: 50%
H4C: 25%L: 75%
H5C: 0%L: 100%
Machine Learning vs. Statistics
• Machine Learning automated statistics
• This lecture– Bayesian learning, the more “traditional”
statistics (R&N 20.1-3)– Learning Bayes Nets
Bayesian Learning
• Main idea: Consider the probability of each hypothesis, given the data
• Data d:
• Hypotheses: P(hi|d)
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
Using Bayes’ Rule
• P(hi|d) = P(d|hi) P(hi) is the posterior
– (Recall, 1/ = i P(d|hi) P(hi))
• P(d|hi) is the likelihood
• P(hi) is the hypothesis prior
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
Computing the Posterior
• Assume draws are independent
• Let P(h1),…,P(h5) = (0.1,0.2,0.4,0.2,0.1)
• d = { 10 x }
P(d|h1) = 0P(d|h2) = 0.2510
P(d|h3) = 0.510
P(d|h4) = 0.7510
P(d|h5) = 110
P(d|h1)P(h1)=0P(d|h2)P(h2)=9e-8P(d|h3)P(h3)=4e-4P(d|h4)P(h4)=0.011P(d|h5)P(h5)=0.1
P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90
Sum = 1/ = 0.1114
Posterior Hypotheses
Predicting the Next Draw
• P(X|d) = i P(X|hi,d)P(hi|d)
= i P(X|hi)P(hi|d)
P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90
H
D X
P(X|h1) =0P(X|h2) =0.25P(X|h3) =0.5P(X|h4) =0.75P(X|h5) =1
Probability that next candy drawn is a lime
P(X|d) = 0.975
P(Next Candy is Lime | d)
Other properties of Bayesian Estimation
• Any learning technique trades off between good fit and hypothesis complexity
• Prior can penalize complex hypotheses– Many more complex hypotheses than simple
ones– Ockham’s razor
Hypothesis Spaces often Intractable
• A hypothesis is a joint probability table over state variables– 2n entries => hypothesis space is [0,1]^(2n)– 2^(2n) deterministic hypotheses
6 boolean variables => over 1022 hypotheses
• Summing over hypotheses is expensive!
Some Common Simplifications
• Maximum a posteriori estimation (MAP)– hMAP = argmaxhi P(hi|d)
– P(X|d) P(X|hMAP)
• Maximum likelihood estimation (ML)– hML = argmaxhi P(d|hi)
– P(X|d) P(X|hML)
• Both approximate the true Bayesian predictions as the # of data grows large
Maximum a Posteriori
• hMAP = argmaxhi P(hi|d)
• P(X|d) P(X|hMAP)
hMAP = h3 h4 h5
P(X|hMAP)
P(X|d)
Maximum a Posteriori
• For large amounts of data,P(incorrect hypothesis|d) => 0
• For small sample sizes, MAP predictions are “overconfident” P(X|hMAP)
P(X|d)
Maximum Likelihood
• hML = argmaxhi P(d|hi)
• P(X|d) P(X|hML)
hML = undefined h5
P(X|hML)
P(X|d)
Maximum Likelihood
• hML= hMAP with uniform prior
• Relevance of prior diminishes with more data
• Preferred by some statisticians– Are priors “cheating”?– What is a prior anyway?
Advantages of MAP and MLE over Bayesian estimation
• Involves an optimization rather than a large summation– Local search techniques
• For some types of distributions, there are closed-form solutions that are easily computed
Learning Coin Flips (Bernoulli distribution)
• Let the unknown fraction of cherries be • Suppose draws are independent and
identically distributed (i.i.d)
• Observe that c out of N draws are cherries
Maximum Likelihood
• Likelihood of data d={d1,…,dN} given
– P(d|) = j P(dj|) = c (1-)N-c
i.i.d assumption Gather c cherries together, then N-c limes
Maximum Likelihood
• Same as maximizing log likelihood
• L(d|)= log P(d|) = c log (N-c) log(1-)
• max L(d|)=> dL/d = 0=> 0 = c/ – (N-c)/(1-)=> = c/N
Maximum Likelihood for BN
• For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data
Alarm
Earthquake Burglar
E 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380
E B P(A|E,B)
T T 0.95
F T 0.95
T F 0.34
F F 0.003
Maximum Likelihood for Gaussian Models
• Observe a continuous variable x1,…,xN
• Fit a Gaussian with mean , std – Standard procedure: write log likelihood
L = N(C – log ) – j (xj-)2/(22)
– Set derivatives to zero
• Observe a continuous variable x1,…,xN
• Results: = 1/N xj (sample mean)2 = 1/N (xj-)2 (sample variance)
Maximum Likelihood for Gaussian Models
• Y is a child of X
• Data (xj,yj)
• X is gaussian, Y is a linear Gaussian function of X– Y(x) ~ N(ax+b,)
• ML estimate of a, b is given by least squares regression, by standard errors
Maximum Likelihood for Conditional Linear Gaussians
X
Y
Back to Coin Flips
• What about Bayesian or MAP learning?
• Motivation– I pick a coin out of my pocket– 1 flip turns up heads– Whats the MLE?
Back to Coin Flips
• Need some prior distribution P()
• P(|d) = P(d|)P() = c (1-)N-c P()
Define, for all , the probability that I believe in
10
P()
MAP estimate
• Could maximize c (1-)N-c P() using some optimization
• Turns out for some families of P(), the MAP estimate is easy to compute
10
P()
Beta distributions
(Conjugate prior)
Beta Distribution
• Betaa,b() = a-1 (1-)b-1
– a, b hyperparameters– is a normalization
constant– Mean at a/(a+b)
Posterior with Beta Prior
• Posterior c (1-)N-c P()= c+a-1 (1-)N-c+b-1
• MAP estimate=(c+a)/(N+a+b)
• Posterior is also abeta distribution!– See heads, increment a– See tails, increment b– Prior specifies a “virtual count” of a heads, b tails
Does this work in general?
• Only specific distributions have the right type of prior– Bernoulli, Poisson, geometric, Gaussian,
exponential, …
• Otherwise, MAP needs a (often expensive) numerical optimization
How to deal with missing observations?
• Very difficult statistical problem in general
• E.g., surveys– Did the person not fill out political affiliation
randomly?– Or do independents do this more often than
someone with a strong affiliation?
• Better if a variable is completely hidden
Expectation Maximization for Gaussian Mixture models
Data have labels to which Gaussian they belong to, but label is a hidden variable
Clustering: N gaussian distributions
E step: compute probability a datapoint belongs to each gaussian
M step: compute ML estimates of each gaussian, weighted by the probability that each sample belongs to it
Learning HMMs
Want to find transition and observation probabilities
Data: many sequences {O1:t(j) for 1jN}
Problem: we don’t observe the X’s!
X0 X1 X2 X3
O1 O2 O3
Learning HMMs
X0 X1 X2 X3
O1 O2 O3
• Assume stationary markov chain, discrete states x1,…,xm
• Transition parametersij = P(Xt+1=xj|Xt=xi)
• Observation parametersi = P(O|Xt=xi)
• Assume stationary markov chain, discrete states x1,…,xm
• Transition parametersij = P(Xt+1=xj|Xt=xi)
• Observation parametersi = P(O|Xt=xi)
• Initial statesi = P(X0=xi)
Learning HMMs
x1
x3x2
O
13, 31
32
Expectation Maximization
• Initialize parameters randomly• E-step: infer expected probabilities of hidden
variables over time, given current parameters• M-step: maximize likelihood of data over
parametersx1
x3x2
O
13, 31
323233, 1,2,3)
P(initial state) P(transition ij) P(emission)
Expectation Maximization
x1
x3x2
O
13, 31
32
Initialize
E: Compute E[P(Z=z| (0),O)]
x1 x2 x3 x2 x2 x1
x1 x2 x2 x1 x3 x2
Z: all combinations of hidden sequences
Result: probability distribution over hidden state at time t
M: compute (1) = ML estimate of transition / obs. distributions
3233, 1,2,3)
Expectation Maximization
x1
x3x2
O
13, 31
32
3233, 1,2,3)Initialize
E: Compute E[P(Z=z| (0),O)]
x1 x2 x3 x2 x2 x1
x1 x2 x2 x1 x3 x2
Z: all combinations of hidden sequences
Result: probability distribution over hidden state at time t
M: compute (1) = ML estimate of transition / obs. distributions
This is the hard part…
E-Step on HMMs
• Computing expectations can be done by:– Sampling– Using the forward/backward algorithm on the
unrolled HMM (R&N pp. 546)
• The latter gives the classic Baum-Welch algorithm
• Note that EM can still get stuck in local optima or even saddle points
Next Time
• Machine learning
Top Related