Introduction to Machine Learning - CmpE WEBethem/i2ml3e/3e_v1-0/i2ml3...Introduction to Machine...
Transcript of Introduction to Machine Learning - CmpE WEBethem/i2ml3e/3e_v1-0/i2ml3...Introduction to Machine...
INTRODUCTION
TO
MACHINE
LEARNING 3RD EDITION
ETHEM ALPAYDIN
© The MIT Press, 2014
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
Lecture Slides for
CHAPTER 4:
PARAMETRIC METHODS
Parametric Estimation 3
X = { xt }t where xt ~ p (x)
Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient
statistics, using X
e.g., N ( μ, σ2) where q = { μ, σ2}
Maximum Likelihood Estimation 4
Likelihood of q given the sample X l (θ|X) = p (X |θ) = ∏
t p (xt|θ)
Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
Examples: Bernoulli/Multinomial 5
Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po )
(1 – x)
L (po|X) = log ∏t po
xt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pi
xi
L(p1,p2,...,pK|X) = log ∏t ∏
i pi
xit
MLE: pi = ∑t xi
t / N
Gaussian (Normal) Distribution
2
2
2exp
2
1 x-xp
N
mx
s
N
x
m
t
t
t
t
2
2
p(x) = N ( μ, σ2)
MLE for μ and σ2:
6
μ σ
2
2
22
1
xxp exp
Bias and Variance 7
Unknown parameter q
Estimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
Mean square error:
r (d,q) = E [(d–q)2]
= (E [d] – q)2 + E [(d–E [d])2]
= Bias2 + Variance
q
Bayes’ Estimator 8
Treat θ as a random var with prior p (θ)
Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)
Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
Maximum a Posteriori (MAP):
θMAP = argmaxθ p(θ|X)
Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Bayes’ Estimator: Example
xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
θML = m
θMAP = θBayes’ =
q
22
0
2
22
0
2
0
1
1
1 //
/
//
/|
Nm
N
NE X
9
Parametric Classification
iii
iii
CPCxpxg
CPCxpxg
log| log
or
|
ii
iii
i
i
i
i
CPx
xg
xCxp
log log log
exp|
2
2
2
2
22
2
1
22
1
10
11
Given the sample
ML estimates are
Discriminant
Nt
tt ,rx 1}{ X
x
, i f
i f
ijx
xr
jt
it
ti C
C
0
1
t
ti
t
tii
t
i
t
ti
t
ti
t
it
ti
ir
rmx
sr
rx
mN
r
CP
2
2 ˆ
ii
iii CP
s
mxsxg ˆ log log log
2
2
22
2
1
12
Equal variances
Single boundary at
halfway between means
13
Variances are different
Two boundaries
14
Regression
2
20
q
q
,N
,N
|~|
~
|:estimator
xgxrp
xg
xfr
N
t
tN
t
tt
N
t
tt
xpxrp
rxp
11
1
log| log
, log|XL q
15
Regression: From LogL to Error 16
2
1
2
12
2
2
1
2
1
2
12
22
1
N
t
tt
N
t
tt
ttN
t
xgrE
xgrN
xgr
q
q
q
||
|log
|exp log|
X
XL
Linear Regression
0101 wxwwwxg tt ,|
t
t
t
tt
t
t
t
t
t
t
xwxwxr
xwNwr
2
10
10
t
t
tt
t
t
t
t
t
t
t
xr
r
w
w
xx
xN
yw 1
0
2A
yw1 A
17
Polynomial Regression
01
2
2012 wxwxwxwwwwwxg ttktkk
t ,,,,|
NNNN
k
k
r
r
r
xxx
xxx
xxx
2
1
22
2222
1211
1
1
1
r D
rwTT DDD
1
18
Other Error Measures 19
2
12
1
N
t
tt xgrE qq ||X Square Error:
Relative Square Error:
Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)
2
1
2
1
N
t
t
N
t
tt
rr
xgr
E
q
q
|
|X
Bias and Variance 20
222||| xgExgExgExrExxgxrEE XXXX
bias variance
222 xgxrExxrErExxgrE ||||
noise squared error
Estimating Bias and Variance 21
ti
t i
tti
t
tt
xgM
xg
xgxgNM
g
xfxgN
g
1
1
1
2
22
Variance
Bias
M samples Xi={xti , r
ti}, i=1,...,M
are used to fit gi (x), i =1,...,M
Bias/Variance Dilemma 22
Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rt
i/N has lower bias with variance
As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
Bias/Variance dilemma: (Geman et al., 1992)
23
bias
variance
f
gi g
f
Polynomial Regression 24
Best fit “min error”
25
Best fit, “elbow”
Model Selection 26
Cross-validation: Measure generalization accuracy by testing on data unused during training
Regularization: Penalize complex models
E’=error on data + λ model complexity
Akaike’s information criterion (AIC), Bayesian information criterion (BIC)
Minimum description length (MDL): Kolmogorov complexity, shortest description of data
Structural risk minimization (SRM)
Bayesian Model Selection 27
data
model model|datadata|model
p
ppp
Prior on models, p(model)
Regularization, when prior favors simpler models
Bayes, MAP of the posterior, p(model|data)
Average over a number of models with high posterior (voting, ensembles: Chapter 17)
Regression example 28
Coefficients increase in
magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657,
0.0080]
3: [0.4238, -2.5778,
3.4675, -0.0002
4: [-0.1093, 1.4356,
-5.5007, 6.0454, -0.0019]
i i
N
t
tt wxgrE 2
2
12
1ww ||XRegularization (L2):