INTRODUCTION TO Machine Learning · INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT...

INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN© The MIT Press, 2004

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 12:

Local Models

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

3

Introduction

Divide the input space into local regions and learn simple (constant/linear) models in each patch

Unsupervised: Competitive, online clusteringSupervised: Radial-basis func, mixture of experts


4

Competitive Learning

{ }( )

( )ijtjtiij

t

ij

t

ti

t

tti

i

lt

li

tti

t i itt

ik

ii

mxbmE

m

k

b

bk

b

bE

−=∂∂

−=∆

=

⎩⎨⎧ −=−

=

−=

∑∑

∑ ∑=

ηη

:means- Batch

:means- Batch

otherwise0

minif 1

1

xm

mxmx

mxm X


5

Winner-take-allnetwork


6

Adaptive Resonance Theory

Incremental; add a new cluster if not covered; defined by vigilance, ρ

(Carpenter and Grossberg, 1988)

( )⎩⎨⎧

−η=ρ>←

−−=−=

+

=

otherwise

if

min

1

1

it

i

it

k

lt

k

litt

i

b

b

mxm

xm

mxmx

∆


7

Self-Organizing Maps

Units have a neighborhood defined; mi is “between”mi-1 and mi+1, and are all updated together

One-dim map:

( )( )( ) ( ) ⎥

⎦

⎤⎢⎣

⎡

σ−

−πσ

=

−η=

2

2

2exp

2

1 ili,le

i,le lt

l mxm∆

(Kohonen, 1990)


8

Radial-Basis Functions

Locally-tuned units:

01

wpwyH

h

thh

t ∑=

+=

⎥⎥

⎦

⎤

⎢⎢

⎣

⎡ −−=

2

2

2exp

h

ht

th s

pmx


9

Local vs Distributed Representation


10

Training RBF

Hybrid learning:

First layer centers and spreads:

Unsupervised k-means

Second layer weights: Supervised gradient-descent

Fully supervised

(Broomhead and Lowe, 1988; Moody and Darken, 1989)


11

Regression

{ }( ) ( )

( )

( ) ( )

( )3

2

2

01

2

21

|

h

ht

th

t iih

ti

tih

h

hjtjt

ht i

ihti

tihj

t

th

ti

tiih

i

H

h

thih

ti

t i

ti

tih,iihhh

spwyrs

s

mxpwyrm

pyrw

wpwy

yrw,s,E

mx

m

−⎥⎦

⎤⎢⎣

⎡−η=

−⎥⎦

⎤⎢⎣

⎡−η=

−η=

+=

−=

∑ ∑

∑ ∑

∑

∑

∑∑

=

∆

∆

∆

X


12

Classification

{ }( )[ ][ ]∑ ∑∑

∑∑

+

+=

−=

k h kthkh

h ithiht

i

t i

ti

tih,iihhh

wpw

wpwy

yrw,s,E

0

0

exp

exp

log | Xm


13

01

vpwy tTH

h

thh

t ++= ∑=

xv

Rules and Exceptions

Default ruleExceptions


14

Rule-Based Knowledge

Incorporation of prior knowledge (before training)Rule extraction (after training) (Tresp et al., 1997)Fuzzy membership functions and fuzzy rules

( ) ( )( ) ( )( ) ( )

( )10 with

2exp

10 with2

exp2

exp

10 THEN OR AND IF

223

23

2

122

22

21

21

1

321

.ws

cxp

.ws

bxs

axp

.ycxbxax

=⎥⎥⎦

⎤

⎢⎢⎣

⎡ −−=

=⎥⎦

⎤⎢⎣

⎡ −−⋅⎥

⎦

⎤⎢⎣

⎡ −−=

=≈≈≈


15

Normalized Basis Functions

( )

( )( ) ( )2

1

22

22

1

2exp

2exp

h

hjtj

t i

th

tiih

ti

tihj

th

t

ti

tiih

H

h

thih

ti

l llt

hht

H

l

tl

tht

h

s

mxgywyrm

gyrw

gwy

s/

s/

p

pg

−−−η=

−η=

=

⎥⎦⎤

⎢⎣⎡ −−

⎥⎦⎤

⎢⎣⎡ −−

=

=

∑∑

∑

∑

∑

∑

=

=

∆

∆

mx

mx


16

Competitive Basis Functions

Mixture model: ( ) ( ) ( )∑=

=H

h

ttttt ,hphpp1

||| xrxxr

( ) ( ) ( )( ) ( )

∑

∑

⎥⎦⎤

⎢⎣⎡ −−

⎥⎦⎤

⎢⎣⎡ −−

=

=

l llt

l

hht

hth

l

t

tt

s/a

s/ag

lplphphp

hp

22

22

2exp

2exp

||

|

mx

mx

xx

x


17

Regression ( )( )

⎥⎦

⎤⎢⎣

⎡σ−

−πσ

=∏ 22exp21

|ti

ti

i

tt yrp xr

{ }( ) ( )

( ) ( )( )

( ) ( )[ ]( ) ( )[ ]

( ) ( ) ( )( ) ( )∑

∑ ∑∑

∑∑

∑ ∑ ∑

=

−−

−−=

−−=∆−=∆

=

⎥⎦

⎤⎢⎣

⎡−−=

l

l i

til

ti

tl

i

tih

ti

tht

h

t h

hjtjt

ht

hhjt

th

tih

tiih

ihtih

t h i

tih

ti

thh,iihhh

,lplp,hphp

,hp

yr/g

yr/gf

s

mxgfmfyrw

wy

yrgw,s,

xrxxrx

xr

m

||||

|

21exp

21exp

fit constant the is

21

explog|

2

2

2

2

ηη

XL


18

Classification

{ }( ) ( )

[ ][ ]∑ ∑∑

∑

∑ ∑ ∑

∑ ∑ ∏

=

=

⎥⎦

⎤⎢⎣

⎡=

=

l i

til

ti

tl

i

tih

ti

tht

h

k kh

ihtih

t h i

tih

ti

th

t h i

rtih

thh,iihhh

yrg

yrgf

ww

y

yrg

ygw,s,ti

log exp

log exp

exp exp

log exp log

log| XL m


19

EM for RBF (Supervised EM)

E-step:

M-step:

( )tth ,hpf xr |≡

( )( )

∑∑

∑∑∑∑

=

−−=

=

t

th

t

ti

th

ih

t

th

T

ht

t htt

hh

t

th

t

tth

h

f

rfw

f

fs

f

f

mxmx

xm


20

Learning Vector Quantization

H units per class prelabeled (Kohonen, 1990)

Given x, mi is the closest:

x

mi mj

( )( )⎩⎨

⎧

−η−=−η=

otherwise

label class same the have and if

it

i

it

it

i

mxm

mxmxm

∆∆


21

Mixture of Experts

In RBF, each local fit is a constant, wih, second layer weight

In MoE, each local fit is a linear function of x, a local expert:

(Jacobs et al., 1991)

ttih

tihw xv=


22

MoE as Models Combined

Radial gating:

Softmax gating:

∑ ⎥⎦⎤

⎢⎣⎡ −−

⎥⎦⎤

⎢⎣⎡ −−

=

l llt

hht

th

s/

s/g

22

22

2exp

2exp

mx

mx

[ ][ ]∑= l tTl

tTht

hg xmxm

expexp


23

Cooperative MoE

Regression

{ }( ) ( )( )( )( ) tjth

t

ti

tih

tih

tihj

t

tth

tih

tiih

t i

ti

tih,iihhh

xgywyrm

gyr

yrw,s,E

∑

∑

∑∑

−−η=

−η=

−=

∆

∆ xv

m2

21

| X


24

Competitive MoE: Regression

{ }( ) ( )

( )( ) t

t

th

thh

tth

t

tih

tiih

tihih

tih

t h i

tih

ti

thh,iihhh

gf

fyr

wy

yrgw,s,

xm

xv

xv

m

∑

∑

∑ ∑ ∑

−η=

−η=

==

⎥⎦

⎤⎢⎣

⎡−−=

∆

∆

21

exp log|2

XL


25

Competitive MoE: Classification

{ }( ) ( )

∑∑

∑ ∑ ∑

∑ ∑ ∏

==

⎥⎦

⎤⎢⎣

⎡=

=

k

tkh

tih

k kh

ihtih

t h i

tih

ti

th

t h i

rtih

thh,iihhh

ww

y

yrg

ygw,s,ti

xvxv

m

exp exp

exp exp

log exp log

log| XL


26

Hierarchical Mixture of Experts

Tree of MoE where each MoE is an expert in a higher-level MoE

Soft decision tree: Takes a weighted (gating) average of all leaves (experts), as opposed to using a single path and a single leaf

Can be trained using EM (Jordan and Jacobs, 1994)

INTRODUCTION TO Machine Learning · INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT...

Documents

Transcript of INTRODUCTION TO Machine Learning · INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT...