Introduction to Machine Learning - CmpE WEBethem/i2ml3e/3e_v1-0/i2ml3...Introduction to Machine...

INTRODUCTION

TO

MACHINE

LEARNING 3RD EDITION

ETHEM ALPAYDIN

© The MIT Press, 2014

[email protected]

http://www.cmpe.boun.edu.tr/~ethem/i2ml3e

Lecture Slides for

CHAPTER 4:

PARAMETRIC METHODS

Parametric Estimation 3

X = { xt }t where xt ~ p (x)

Parametric estimation:

Assume a form for p (x |q ) and estimate q , its sufficient

statistics, using X

e.g., N ( μ, σ2) where q = { μ, σ2}

Examples: Bernoulli/Multinomial 5

Bernoulli: Two states, failure/success, x in {0,1}

P (x) = pox (1 – po )

(1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}

P (x1,x2,...,xK) = ∏i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

i pi

xit

MLE: pi = ∑t xi

t / N

Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

N

mx

s

N

x

m

t

t

t

t

2

2

p(x) = N ( μ, σ2)

MLE for μ and σ2:

6

μ σ

2

2

22

1

xxp exp

Bias and Variance 7

Unknown parameter q

Estimator di = d (Xi) on sample Xi

Bias: bq(d) = E [d] – q

Variance: E [(d–E [d])2]

Mean square error:

r (d,q) = E [(d–q)2]

= (E [d] – q)2 + E [(d–E [d])2]

= Bias2 + Variance

q

Bayes’ Estimator: Example

xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)

θML = m

θMAP = θBayes’ =

q

22

0

2

22

0

2

0

1

1

1 //

/

//

/|

Nm

N

NE X

9

Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log

or

|

ii

iii

i

i

i

i

CPx

xg

xCxp

log log log

exp|

2

2

2

2

22

2

1

22

1

10

11

Given the sample

ML estimates are

Discriminant

Nt

tt ,rx 1}{ X

x

, i f

i f

ijx

xr

jt

it

ti C

C

0

1

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

ir

rmx

sr

rx

mN

r

CP

2

2 ˆ

ii

iii CP

s

mxsxg ˆ log log log

2

2

22

2

1

12

Equal variances

Single boundary at

halfway between means

13

Variances are different

Two boundaries

Regression

2

20

q

q

,N

,N

|~|

~

|:estimator

xgxrp

xg

xfr

N

t

tN

t

tt

N

t

tt

xpxrp

rxp

11

1

log| log

, log|XL q

15

Regression: From LogL to Error 16

2

1

2

12

2

2

1

2

1

2

12

22

1

N

t

tt

N

t

tt

ttN

t

xgrE

xgrN

xgr

qq

q

q

q

||

|log

|exp log|

X

XL

Linear Regression

0101 wxwwwxg tt ,|

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

t

t

t

xr

r

w

w

xx

xN

yw 1

0

2A

yw1 A

17

Polynomial Regression

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NNNN

k

k

r

r

r

xxx

xxx

xxx

2

1

22

2222

1211

1

1

1

r D

rwTT DDD

1

18

Other Error Measures 19

2

12

1

N

t

tt xgrE qq ||X Square Error:

Relative Square Error:

Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|

ε-sensitive Error:

E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)

2

1

2

1

N

t

t

N

t

tt

rr

xgr

E

q

q

|

|X

Bias and Variance 20

222||| xgExgExgExrExxgxrEE XXXX

bias variance

222 xgxrExxrErExxgrE ||||

noise squared error

Estimating Bias and Variance 21

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1

1

1

2

22

Variance

Bias

M samples Xi={xti , r

ti}, i=1,...,M

are used to fit gi (x), i =1,...,M

Bias/Variance Dilemma 22

Example: gi(x)=2 has no variance and high bias

gi(x)= ∑t rt

i/N has lower bias with variance

As we increase complexity,

bias decreases (a better fit to data) and

variance increases (fit varies more with data)

Bias/Variance dilemma: (Geman et al., 1992)

23

bias

variance

f

gi g

f

Polynomial Regression 24

Best fit “min error”

25

Best fit, “elbow”

Model Selection 26

Cross-validation: Measure generalization accuracy by testing on data unused during training

Regularization: Penalize complex models

E’=error on data + λ model complexity

Akaike’s information criterion (AIC), Bayesian information criterion (BIC)

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Structural risk minimization (SRM)

Bayesian Model Selection 27

data

model model|datadata|model

p

ppp

Prior on models, p(model)

Regularization, when prior favors simpler models

Bayes, MAP of the posterior, p(model|data)

Average over a number of models with high posterior (voting, ensembles: Chapter 17)

Regression example 28

Coefficients increase in

magnitude as order

increases:

1: [-0.0769, 0.0016]

2: [0.1682, -0.6657,

0.0080]

3: [0.4238, -2.5778,

3.4675, -0.0002

4: [-0.1093, 1.4356,

-5.5007, 6.0454, -0.0019]

i i

N

t

tt wxgrE 2

2

12

1ww ||XRegularization (L2):

Introduction to Machine Learning - CmpE WEBethem/i2ml3e/3e_v1-0/i2ml3...Introduction to Machine...

Documents

Transcript of Introduction to Machine Learning - CmpE WEBethem/i2ml3e/3e_v1-0/i2ml3...Introduction to Machine...