INTRODUCTION TO Machine Learning · INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT...

26
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml Lecture Slides for

Transcript of INTRODUCTION TO Machine Learning · INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT...

  • INTRODUCTION TO

    Machine Learning

    ETHEM ALPAYDIN© The MIT Press, 2004

    [email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

    Lecture Slides for

  • CHAPTER 12:

    Local Models

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    3

    Introduction

    Divide the input space into local regions and learn simple (constant/linear) models in each patch

    Unsupervised: Competitive, online clusteringSupervised: Radial-basis func, mixture of experts

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    4

    Competitive Learning

    { }( )

    ( )ijtjtiij

    t

    ij

    t

    ti

    t

    tti

    i

    lt

    li

    tti

    t i itt

    ik

    ii

    mxbmE

    m

    k

    b

    bk

    b

    bE

    −=∂∂

    −=∆

    =

    ⎩⎨⎧ −=−

    =

    −=

    ∑∑

    ∑ ∑=

    ηη

    :means- Batch

    :means- Batch

    otherwise0

    minif 1

    1

    xm

    mxmx

    mxm X

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    5

    Winner-take-allnetwork

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    6

    Adaptive Resonance Theory

    Incremental; add a new cluster if not covered; defined by vigilance, ρ

    (Carpenter and Grossberg, 1988)

    ( )⎩⎨⎧

    −η=ρ>←

    −−=−=

    +

    =

    otherwise

    if

    min

    1

    1

    it

    i

    it

    k

    lt

    k

    litt

    i

    b

    b

    mxm

    xm

    mxmx

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    7

    Self-Organizing Maps

    Units have a neighborhood defined; mi is “between”mi-1 and mi+1, and are all updated together

    One-dim map:

    ( )( )( ) ( ) ⎥

    ⎤⎢⎣

    σ−

    −πσ

    =

    −η=

    2

    2

    2exp

    2

    1 ili,le

    i,le lt

    l mxm∆

    (Kohonen, 1990)

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    8

    Radial-Basis Functions

    Locally-tuned units:

    01

    wpwyH

    h

    thh

    t ∑=

    +=

    ⎥⎥

    ⎢⎢

    ⎡ −−=

    2

    2

    2exp

    h

    ht

    th s

    pmx

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    9

    Local vs Distributed Representation

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    10

    Training RBF

    Hybrid learning:

    First layer centers and spreads:

    Unsupervised k-means

    Second layer weights: Supervised gradient-descent

    Fully supervised

    (Broomhead and Lowe, 1988; Moody and Darken, 1989)

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    11

    Regression

    { }( ) ( )

    ( )

    ( ) ( )

    ( )3

    2

    2

    01

    2

    21

    |

    h

    ht

    th

    t iih

    ti

    tih

    h

    hjtjt

    ht i

    ihti

    tihj

    t

    th

    ti

    tiih

    i

    H

    h

    thih

    ti

    t i

    ti

    tih,iihhh

    spwyrs

    s

    mxpwyrm

    pyrw

    wpwy

    yrw,s,E

    mx

    m

    −⎥⎦

    ⎤⎢⎣

    ⎡−η=

    −⎥⎦

    ⎤⎢⎣

    ⎡−η=

    −η=

    +=

    −=

    ∑ ∑

    ∑ ∑

    ∑∑

    =

    X

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    12

    Classification

    { }( )[ ][ ]∑ ∑∑

    ∑∑

    +

    +=

    −=

    k h kthkh

    h ithiht

    i

    t i

    ti

    tih,iihhh

    wpw

    wpwy

    yrw,s,E

    0

    0

    exp

    exp

    log | Xm

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    13

    01

    vpwy tTH

    h

    thh

    t ++= ∑=

    xv

    Rules and Exceptions

    Default ruleExceptions

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    14

    Rule-Based Knowledge

    Incorporation of prior knowledge (before training)Rule extraction (after training) (Tresp et al., 1997)Fuzzy membership functions and fuzzy rules

    ( ) ( )( ) ( )( ) ( )

    ( )10 with

    2exp

    10 with2

    exp2

    exp

    10 THEN OR AND IF

    223

    23

    2

    122

    22

    21

    21

    1

    321

    .ws

    cxp

    .ws

    bxs

    axp

    .ycxbxax

    =⎥⎥⎦

    ⎢⎢⎣

    ⎡ −−=

    =⎥⎦

    ⎤⎢⎣

    ⎡ −−⋅⎥

    ⎤⎢⎣

    ⎡ −−=

    =≈≈≈

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    15

    Normalized Basis Functions

    ( )

    ( )( ) ( )2

    1

    22

    22

    1

    2exp

    2exp

    h

    hjtj

    t i

    th

    tiih

    ti

    tihj

    th

    t

    ti

    tiih

    H

    h

    thih

    ti

    l llt

    hht

    H

    l

    tl

    tht

    h

    s

    mxgywyrm

    gyrw

    gwy

    s/

    s/

    p

    pg

    −−−η=

    −η=

    =

    ⎥⎦⎤

    ⎢⎣⎡ −−

    ⎥⎦⎤

    ⎢⎣⎡ −−

    =

    =

    ∑∑

    =

    =

    mx

    mx

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    16

    Competitive Basis Functions

    Mixture model: ( ) ( ) ( )∑=

    =H

    h

    ttttt ,hphpp1

    ||| xrxxr

    ( ) ( ) ( )( ) ( )

    ⎥⎦⎤

    ⎢⎣⎡ −−

    ⎥⎦⎤

    ⎢⎣⎡ −−

    =

    =

    l llt

    l

    hht

    hth

    l

    t

    tt

    s/a

    s/ag

    lplphphp

    hp

    22

    22

    2exp

    2exp

    ||

    |

    mx

    mx

    xx

    x

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    17

    Regression ( )( )

    ⎥⎦

    ⎤⎢⎣

    ⎡σ−

    −πσ

    =∏ 22exp21

    |ti

    ti

    i

    tt yrp xr

    { }( ) ( )

    ( ) ( )( )

    ( ) ( )[ ]( ) ( )[ ]

    ( ) ( ) ( )( ) ( )∑

    ∑ ∑∑

    ∑∑

    ∑ ∑ ∑

    =

    −−

    −−=

    −−=∆−=∆

    =

    ⎥⎦

    ⎤⎢⎣

    ⎡−−=

    l

    l i

    til

    ti

    tl

    i

    tih

    ti

    tht

    h

    t h

    hjtjt

    ht

    hhjt

    th

    tih

    tiih

    ihtih

    t h i

    tih

    ti

    thh,iihhh

    ,lplp,hphp

    ,hp

    yr/g

    yr/gf

    s

    mxgfmfyrw

    wy

    yrgw,s,

    xrxxrx

    xr

    m

    ||||

    |

    21exp

    21exp

    fit constant the is

    21

    explog|

    2

    2

    2

    2

    ηη

    XL

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    18

    Classification

    { }( ) ( )

    [ ][ ]∑ ∑∑

    ∑ ∑ ∑

    ∑ ∑ ∏

    =

    =

    ⎥⎦

    ⎤⎢⎣

    ⎡=

    =

    l i

    til

    ti

    tl

    i

    tih

    ti

    tht

    h

    k kh

    ihtih

    t h i

    tih

    ti

    th

    t h i

    rtih

    thh,iihhh

    yrg

    yrgf

    ww

    y

    yrg

    ygw,s,ti

    log exp

    log exp

    exp exp

    log exp log

    log| XL m

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    19

    EM for RBF (Supervised EM)

    E-step:

    M-step:

    ( )tth ,hpf xr |≡

    ( )( )

    ∑∑

    ∑∑∑∑

    =

    −−=

    =

    t

    th

    t

    ti

    th

    ih

    t

    th

    T

    ht

    t htt

    hh

    t

    th

    t

    tth

    h

    f

    rfw

    f

    fs

    f

    f

    mxmx

    xm

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    20

    Learning Vector Quantization

    H units per class prelabeled (Kohonen, 1990)

    Given x, mi is the closest:

    x

    mi mj

    ( )( )⎩⎨

    −η−=−η=

    otherwise

    label class same the have and if

    it

    i

    it

    it

    i

    mxm

    mxmxm

    ∆∆

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    21

    Mixture of Experts

    In RBF, each local fit is a constant, wih, second layer weight

    In MoE, each local fit is a linear function of x, a local expert:

    (Jacobs et al., 1991)

    ttih

    tihw xv=

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    22

    MoE as Models Combined

    Radial gating:

    Softmax gating:

    ∑ ⎥⎦⎤

    ⎢⎣⎡ −−

    ⎥⎦⎤

    ⎢⎣⎡ −−

    =

    l llt

    hht

    th

    s/

    s/g

    22

    22

    2exp

    2exp

    mx

    mx

    [ ][ ]∑= l tTl

    tTht

    hg xmxm

    expexp

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    23

    Cooperative MoE

    Regression

    { }( ) ( )( )( )( ) tjth

    t

    ti

    tih

    tih

    tihj

    t

    tth

    tih

    tiih

    t i

    ti

    tih,iihhh

    xgywyrm

    gyr

    yrw,s,E

    ∑∑

    −−η=

    −η=

    −=

    ∆ xv

    m2

    21

    | X

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    24

    Competitive MoE: Regression

    { }( ) ( )

    ( )( ) t

    t

    th

    thh

    tth

    t

    tih

    tiih

    tihih

    tih

    t h i

    tih

    ti

    thh,iihhh

    gf

    fyr

    wy

    yrgw,s,

    xm

    xv

    xv

    m

    ∑ ∑ ∑

    −η=

    −η=

    ==

    ⎥⎦

    ⎤⎢⎣

    ⎡−−=

    21

    exp log|2

    XL

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    25

    Competitive MoE: Classification

    { }( ) ( )

    ∑∑

    ∑ ∑ ∑

    ∑ ∑ ∏

    ==

    ⎥⎦

    ⎤⎢⎣

    ⎡=

    =

    k

    tkh

    tih

    k kh

    ihtih

    t h i

    tih

    ti

    th

    t h i

    rtih

    thh,iihhh

    ww

    y

    yrg

    ygw,s,ti

    xvxv

    m

    exp exp

    exp exp

    log exp log

    log| XL

  • Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

    26

    Hierarchical Mixture of Experts

    Tree of MoE where each MoE is an expert in a higher-level MoE

    Soft decision tree: Takes a weighted (gating) average of all leaves (experts), as opposed to using a single path and a single leaf

    Can be trained using EM (Jordan and Jacobs, 1994)