VII. Radial Basis Function Networks (RBFN)faculty.nps.edu/fargues/teaching/ec4460/ec4460-VII.pdf ·...

8/27/06 EC4460.SuFy06/MPF 1

VII. Radial Basis Function Networks (RBFN)

• Radial function definition• Radial basis function

– introduction

– examples

– scalar output

– vector output

• Linear model and radial basis functions

• Radial basis function network (RBFN)

– definition

– simple example

– Bayes theorem and MAP algorithm

– EM procedure

– comments

– Example

• Modifications needed to generate a RBFN • Model complexity issues • How to select the number of RBFs to use • Regularization theory • Network training • How to select the basis function parameters • RBFN and BPNN comparisons • The EM algorithm

Ref: [Hagan,Bishop, Nabney]

8/27/06 EC4460.SuFy06/MPF 2

Radial Basis Function Networks(RBFN)• Linear model

• Radial functions

– def: linear model for a function t(x) is of the form:f(x) =

– basis function– Ex: f(x) = ax + b

what are the basis function associated?

– def: defined as functions with responses decreasing (orincreasing) monotonically with the distance from acentral point

– Ex: Normal (Gaussian) function for scalar & vector case

8/27/06 EC4460.SuFy06/MPF 3

{ } 1

N

ix

=

• Introduction

* Radial Basis Function

– allow to perform exact interpolation of a data set

which requires every input vector to be mapped exactly into a specific target vector

– Ex: Given N input vectors with associatedtarget ti

find the transfunction h(-) so that: i ih x t→

– h(-) is a linear model of the form

( ) ( )1

N

j jj

h x w g x=

= ∑– The problem comes down to finding { }

1

N

j jw

=

( ) 1, ,i ih x t i N= =

• How to solve for ? { }1

N

j jw

=

8/27/06 EC4460.SuFy06/MPF 4

1

1 1 1 1 1

1

( ), 1,...,

( ) ( )

( ) ( )

N

i j j ij

N

N N N N N

t w g x i N

t g x g x w

t g x g x wt G w

=

= =

⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥=⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

=

==>

∑

8/27/06 EC4460.SuFy06/MPF 5

{ }x 1, 2,3=

• Ex: – Assume we have the following set

t

xassociated targets;

{ }t 1.1,1.8,3.1=

– Assume we use the 2 basis functions

( ) ( )1 21 ;g x g x x= =

model of the form:

f(x)

8/27/06 EC4460.SuFy06/MPF 6

• Ex: Assume we use the following 3basis functions:

( ) ( ) ( ) 21 2 31; ;g x g x x g x x= = =

8/27/06 EC4460.SuFy06/MPF 7

* What have we done in the previous example?

* applied a mapping from one input space to another

0

{ } {1, 2,3}

1 1 1to { } , ,

1 2 3

and find weights which satisfy

g(y)=w T

x

y

w y

→

⎧ ⎫⎛ ⎞ ⎛ ⎞ ⎛ ⎞→ ⎨ ⎬⎜ ⎟ ⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠ ⎝ ⎠⎩ ⎭

+

8/27/06 EC4460.SuFy06/MPF 8

Example: XOR gate

××

1

2

1 2

{ }x

x y xx x

⎧ ⎫⎪ ⎪→ = ⎨ ⎬⎪ ⎪⎩ ⎭

8/27/06 EC4460.SuFy06/MPF 9

• Linear model and radial basis functions

– Given N input vectors with associatedtarget ti

{ } 1

N

ix

=

( ) ( ) ( )1

2N

j jj

h x w xφ=

=∑

–We need to solve for { }1

N

j jw

=

8/27/06 EC4460.SuFy06/MPF 10

( )( )x

x

φ

φ

=

=

* Which φ to use ?

• most commonly used is the Gaussian function

• alternatives:( ) ( )( ) ( )( )

2 2

2

3

, 0

Ln

, ,

x x

x x x

x x x

αφ σ α

φ

φ

−= + >

=

=

* Generalization to Vector Output t

input vectors { } 1

N

jx

=are associated with output

vectors [ ].......... Tjt = dimension 1K ×

( ) ( ) ; 1, , ; 1, ,i j i ijh x t j t i K j N⇒ = = = =

where ( )1

( ); 1,...,i

N

ik kk

h x w x i Kφ=

= =∑ N: Nb of ptsK: output dimension

8/27/06 EC4460.SuFy06/MPF 11

How to solve for the weights ?

For each I=1,…K, we need to solve

[ ] [ ]

1 1 1 2 21

1 1 1 1 1

1

1 1

( ) ( ), ( )

( ) ( )1,...,

( ) ( ) 1,...,

... ... T =

N

i i ik k i ik

i N i

iN N N N iN

i i

K K

t h x w x t h x

t x x wi K

t x x wt w i K

t t w w

φ

φ φ

φ φ

=

= = =

⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

= Φ =

==> = Φ

Φ

∑

W

N K× N N× N K×

8/27/06 EC4460.SuFy06/MPF 12

• Radial basis functions provide aninterpolating function which passesexactly through each data point

We need to compute theweights W which lead to this

* Radial Basis Function Network

• Ex:

Figure 5.1. A simple example of exact interpolation using radial basis functions. A set of 30 data points was generated by sampling the function y = 0.5 + 0.4 sin (2πx), shown by the dashed curve, and adding Gaussian noise with standard deviation 0.05. The solid curve shows the interpolating function which results from using Gaussian basis functions of the form exp(−x2/2σ2) with width parameter σ = 0.067 which corresponds to roughly twice the spacing of the data points. Values for the second-layer weights were found using matrix inversion techniques as discussed in thetext.

Is there a problem with the above result?

- noisy data produces oscillations need to average noise effects out to give a smother curve

- procedure is expensive as N

[Bishop]

8/27/06 EC4460.SuFy06/MPF 13

1) Number M of basis functions doesn’t haveto be equal to number N of data points(usually select M<<N)

Results in computational load decrease but no exact fit (LS fit instead)

* Modifications Needed to Generate RBFN

3) bias term is introduced

2) Each RBF can have a different σ (and y)those can be determined during training

( ) ( )1

N

i ik kk

h x W xφ=

=∑

Before Now

( )2

2

( )exp2

kk

x yxφσ

⎛ ⎞−= −⎜ ⎟

⎝ ⎠

( ) 0 01

2

2

2

( ) ( )

ex: ( ) exp2

, : basis function mean & var.

M N

i ik k ik

kk

k

kk

h x w x w x

x yx

y

φ φ

φσ

σ

=

= +

⎛ ⎞− −⎜ ⎟= ⎜ ⎟⎜ ⎟⎝ ⎠

∑Bias term added

8/27/06 EC4460.SuFy06/MPF 14

• Consequence of modification– M < N ⇒

Number of basis functions

Number of data points

• How to select the weights now?

2

1 1

0

target value for output unit " " when network is presented with input vector

( )

( ), 1,..., , 1,..., ,

N K

i j ijj i

M

ij ik k jk

ij

t

E h x t

w x i K j N M Nφ

= =

=

= −

= = = <

∑∑

∑

[ ] [ ]

1 0 1 1 1

0

1 1

( ) ( )1,...,

( ) ( ) 1,...,

... ... T = W

i M i

iN N M N iM

i i

K K

t x x wi K

t x x wt w i K

t t w w

φ φ

φ φ

⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦ ⎣ ⎦

= Φ =

==> = Φ

Φ

N K× ( 1)N M× +( 1)M K+ ×

W=Φ+T

8/27/06 EC4460.SuFy06/MPF 15

• Ex: XOR gate using RBF

{ }

0 0 1 10 1 0 1

0 1 1 0t

⎧ ⎫⎛ ⎞⎛ ⎞⎛ ⎞⎛ ⎞⎨ ⎬⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠⎝ ⎠⎩ ⎭

=

X =

0X X0

Assume a) M = 2 no bias, b) M=2 with bias

2 22 2

2

2

0;

0 4, 1

2 2 1

( ) exp2

1 2

jj

j

N K

x xx

µ µ

σ σ

φσ

⎧ ⎡ ⎤ ⎡ ⎤=⎪ ⎢ ⎥ ⎢ ⎥ = =⎨ ⎣ ⎦ ⎣ ⎦

⎪ = =⎩⎛ ⎞− −⎜ ⎟=⎜ ⎟⎝ ⎠

1=

1

8/27/06 EC4460.SuFy06/MPF 16

8/27/06 EC4460.SuFy06/MPF 17

* Correspondence with MATLAB Notations

• The ||dist|| box in the above figure accepts the input vector p and the input weight matrix IW1,1, and produces a vector having S1 elements. The elements are the distances between the input vector and vectors iIW1,1 formed from the rows of the input weight matrix.

• The bias vector b1 and the output of are combined with the MATLAB operation .* , which does element-by-element multiplication.

# of basis functions (M)

Output dim (K)

8/27/06 EC4460.SuFy06/MPF 18

* Model Complexity → caution: do not overfit model

more error in training databut no overfit(much better)

too good a fit fortraining data

bad

Conclusion: Do not select too complicated a model.

Recall: great training error performance good performance with testing data!⇒

Figure 1.10. A schematic example of vectors in two dimensions (x1,x2) belonging to two classes shown by crosses and circles. The solid curve shows the decision boundary of a simple model which gives relatively poor separation of the two classes.

Figure 1.11. As in Figure 1.10, but showing the decision boundary corresponding to a more flexible model, which gives better separation of the training data.

[Bishop]

8/27/06 EC4460.SuFy06/MPF 19

How to Select the Number of RBFs to Use

• Progressively add basis functions and seehow it affects the overall performance.

• Use a model selection criteria– Bayesian information criterion (BIC)

– Generalized non-validation criterion (GCV)

– divide training set in S distinct segments

– train network from S-1 segments on Sth one

– process is averaged over S segments and test errors averaged over S results

8/27/06 EC4460.SuFy06/MPF 20

* Regularization Theory

• Goal: keep the complex model but avoidmodel overfit– Need to control the smoothness properties of the

mapping.– Add to error function an extra term which

penalizes mappings which are not smooth.

( ) ( )2

1

12

N

reg i ii

E g x t g x dxα=

= − +∑ ∫one type of function so that a large curvature in g(x) result in large values for the integral expression.

⇓⇓

Figure 5.1. A simple example of exact interpolation using radialbasis.functions. A set of 30 data points was generated by sampling the function y = 0.5 + 0.4 sin(2πx) , shown by the dashed curve, and adding Gaussian noise with standard deviation 0.05. The solid curve shows the interpolating function which results from using Gaussian basis functions of the form (5.5) with width parameter σ = 0.067 which corresponds to roughly twice the spacing of the data points. Values for the second-layer weights were found using matrix inversion techniques as discussed in the text.

Figure 5.6. This shows the same data set as in Figure 5.1, againwith one basis function centered on each data point, and a widthparameter σ = 0.067. In this case, however, a regularization term is used, with coefficient α = 40, leading to a smoother mapping (shown by the solid curve) which no longer gives an exact fit tothe data, but which now gives a much better approximation to theunderlying function which generated the data (shown by the dashed curve).

[Bishop]

8/27/06 EC4460.SuFy06/MPF 21

using MATLAB notation

Network Training

• Recall for N data points, and M basis functionsused with associated vectors tK×1

• During training, find weights wij so that

( ) ( )0

1, ,M

i ij kk

h x w x i Kφ=

= =∑ …

( )1

2 ;

K

ha

h

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

Notation MATLAB

M S1

K S2→→

• Training issues– Recall generic

– How to find basis function parameters? (µj, σj, Σj)

– What is the effect of varying σj?

wi0: biasφ0(x) = 1.

output vector

( ) ( )1( ) expT

k k kx x xφ µ µ−⎡ ⎤= − − Σ −⎢ ⎥⎣ ⎦

8/27/06 EC4460.SuFy06/MPF 22

Figure 5.3. This shows the same set of 30 data points as in Figure 5.1, together with a network mapping (solid curve) in which the number of basis functions has been set to 5, which is significantly fewer than the number of data points. The centres of the basis functions have been set to a random subset of the data set input vectors, and the width parameters of the basis functions have been set to a common value of σ = 0.4, which again is roughly equal to twice the average spacing between the centres. The second-layer weights are found by minimizing a sum-of-squares error function using singular value decomposition.

Figure 5.4. As in Figure 5.3, but in which the width parameter has been set to σ = 0.08. The resulting network function is insufficiently smooth and gives a poor representation of the underlying function which generated the data.

Figure 5.5. As in Figure 5.3, but in which the width parameter has been set to σ = 10.0. This leads to a network function which is over-smoothed, and which again gives a poor representation of the underlying function which generated the data.

[Bishop]

8/27/06 EC4460.SuFy06/MPF 23

• How to Select Basis Function Parameters– RBF do approximation, noisy interpolation …

well.

–RBF parameters should be selected to form a representation of the data probability density

– Unsupervised procedure to select parameters isused (view µj as input vector prototypes).

– Trained RBF network plane for inputs of dim 2 × 1.

– Computed NN within training data range. What happens outside this range?

We don’t know !

8/27/06 EC4460.SuFy06/MPF 24

– Selection of input data can be crucial.

Figure 5.9. A schematic example of a function y(x1) of an input variable x1which has been modeled using a set of radial basis functions.

.Figure 5.10. As in Figure 5.9, but in which an extra, irrelevant variable x2 has been introduced. Note that the number of basis functions, whose locations are determined using the input data alone, has increased dramatically, even though x2 carries no useful information for determining the output variable. .

[Bishop]

8/27/06 EC4460.SuFy06/MPF 25

µ’s

1) Select µj equal to a random subset of the input vectors from the training set.

• Use all data points as µj and selectively remove µ’s to have minimum disruption of system performance.

• Use clustering technique to find a set of µ’s which represent the data (k-means is an option).

• Use the Gaussian mixture model (EM algorithm).

σ’’s

• Select all σ’’s the same and equal to some multiple of the average distance between basis function center µ’s. (some authors recommend 1/4 of dist between clusters)

• Select σ’’s differently from average distance of each basis function to its “L” nearest neighbors.

Basis Function Parameter Selection Options

8/27/06 EC4460.SuFy06/MPF 26

Potential problem with using a clustering technique to Find µ’s

Figure 5.11. A simple example to illustrate why the use of unsupervised methods based on density estimation to determine the basis function parameters need not be optimal for approximating the target function. Data in one dimension (shown by the circles) is generated from a Gaussian distribution p(x)shown by the dashed curve. Unsupervised training of one Gaussian basis function would cause it to be centered at x = a, giving a good approximation to p(x). Target values for the input data are generated from a Gaussian function centered at b shown by the solid curve. The basis function centered at a can only give a very poor representation of h(x). By contrast, if the basis function were centered at b it could represent the function h(x) exactly.

[Bishop]

8/27/06 EC4460.SuFy06/MPF 27

Function approximation with RBFN

[Mathworks]

8/27/06 EC4460.SuFy06/MPF 28


8/27/06 EC4460.SuFy06/MPF 29

SSE target= 0.02 Spread=1


8/27/06 EC4460.SuFy06/MPF 30

SSE target= 0.02 Spread=0.01

RBFN - Underlapping neurons (spread selected too small)

8/27/06 EC4460.SuFy06/MPF 31

SSE target=0.02 Spread=100

RBFN - Overlapping neurons (spread selected too large)

8/27/06 EC4460.SuFy06/MPF 32

RBFN and BPNN ComparisonsRBFN BPNN

Provide techniques for approximating non-linear functions.

Hidden units use distance to a prototype vector followed by a transformation with a local function (Normal).

Hidden units depend on weighted linear summation of inputs transformed by non-linear monotonic activation functions.

Use of localized basis functions leads to a representation in the space of hidden units which is LOCAL, as for a given input vector, typically only a few hidden units have significant activations.

Many hidden units contribute to the det. of outputs (due to cross coupling between units).

Distributed Representation

Simple architecture Complex architecture

Usually two-step process(1) µ’s, σ’s(2) weights

All parameters determined at same time.

leads to “hyper-ellipsoids” leads to “hyper-planes”

8/27/06 EC4460.SuFy06/MPF 33

• Why ?

⇓ ( ) ( ) ( )1

Mk

k i i ki

p x C a p x C=

= ∑where ai = mixture weight and M = # of components

IV Basis Function Parameter SelectionOptions

– In practice we don’t know the class densitiesneed to estimate them

– In lots of applications class density can beapproximated as a combination of Gaussian(called Gaussian mixture).

( ) ( ) ( ) ( )( ),kk kii k ip x C N m= Σ

mean vector covariance matrix

Unknown parameters to estimate:

( )kim( )kiΣ

– ai

–

– sometimes M assumed known–

EM Goal: Estimate parameters not directlyaccessible to user from the data.

8/27/06 EC4460.SuFy06/MPF 34

• EM procedure

Choose an initial set of values for unknownparameters

E Step: assign data to the model that fits it best

Initialization

M Step: update parameters of the models using only data assigned to it

Iterate until unknown parameters converge

8/27/06 EC4460.SuFy06/MPF 35

• You are given a set of data {xi} = x eachdata was generated as follows(1) belong to one of

two normal pdf’s withuniform probability.

(2) Assume normal pdf’s have some variance σ2

known, means unknown.

• Goal: find means of both distributions givendata {xi}.

* A Simple EM Example

– Bayes theorem

• Tools to use:

– Normal pdf definition

⇒ find hypothesis h = (m1, m2) that maximizes

P(x|h) means (which are hidden to user)

data observed {xi}

?m1

?m2

n

( , ) ( | ) ( )( | )( ) ( )

P Y X P X Y P YP Y XP X P X

= =

P(Y|X) increases with P(Y) and P(X|Y)

22

1 1( ) exp( ( ) )22

p x x µσσ π−

= −

8/27/06 EC4460.SuFy06/MPF 36

(1) Bayes Theorem: provides a way to calculate posterior probability of hypothesis for a given data.

(2) Maximum a-posteriori (MAP) learning algorithm:

– Compute a-posteriori probabilities

P(Ci|x) i = 1, …, K

(probability that class is Ci after making a measurement )

– Assign x to class j(or “hypothesis” j, if generic pb) if:

P(Cj|x) > P(Ci|x) i = 1, …, K j ≠ i

.

Information Detours:

Example on data x: generated as one of two classes C1 or C2

8/27/06 EC4460.SuFy06/MPF 37

Application of the MAP approach to mean estimation:

(3) If {xi} comes from a unique normal pdf, how doyou compute the mean of the pdf µ?

Note: generic mean is defined as: 1

1 m

ii

xm

µ=

= ∑

Following MAP approach

arg max ( | )

arg max ( | ) ( ) / ( )

mapH

H

P X

P X P P Xµ

µ

µ µ

µ µ∈

∈

=

=

Note: independentof µ

Note: if all hypotheses areequally probable P(µ)=constant==> can be dropped

[ ]

1

221

22

1

( | ) ( | )

1 1 = exp ( )22

Taking the log on both sides1 1Ln ( | ) ( )

22

mii

mii

m

ii

P X P X

x

P X Ln x

µ µ

µσσ π

µ µσσ π

=

=

=

=

−⎡ ⎤−⎢ ⎥⎣ ⎦

⎛ ⎞= − −⎜ ⎟⎝ ⎠

∏

∏

∑

8/27/06 EC4460.SuFy06/MPF 38

[ ] 22

1

22

1

2

1

2

1

1

1 1Ln ( | ) ( )22

1arg min ( )2

arg min ( )

obtained when:

( ) 0

2( )

m

ii

m

map iH i

m

iH i

map

m

ii

m

ii

P X Ln x

x

x

x

x

µ

µ

µ µσσ π

µ µσ

µ

µ

µµ

µ

=

∈ =

∈ =

=

=

⎛ ⎞= − −⎜ ⎟⎝ ⎠

⎡ ⎤==> = −⎢ ⎥⎣ ⎦⎡ ⎤

= −⎢ ⎥⎣ ⎦==>

∂ ⎡ ⎤− =⎢ ⎥∂ ⎣ ⎦

− −

∑

∑

∑

∑

∑

1

1 1

1

0

( ) 0

0

1

m

iim m

ii i

m

ii

x

x

xm

µ

µ

µ

=

= =

=

=

− =

− =

=

∑

∑ ∑

∑

(because of the negative sign)

8/27/06 EC4460.SuFy06/MPF 39

• Question 1: how is the 2-class data generated ?

* can be extended to 2 densities: Given data set of {xi} = x where:

(1) Each data generated as one of two classeswith normal density (σ2 known), mean mi unknown.

(2) Find (m1, m2)

Each data is generated by a single Gaussian

P(c)

21( | , )N x m σ 2

2( | , )N x m σ

c=1 c=2

22

1( ) ( ) ( | , )i

ip x p c i N x m σ

=

= =∑

Data characteristics:• 2 classes exist c={1,2}• each class c occurs with a specific frequency P(c)• Examples of each class c have a specific distribution p(x|c)

Mixing proportion Normal distribution

8/27/06 EC4460.SuFy06/MPF 40

• Question 2: which information do we have available?

x1,…,xN

We need to estimate m1 and m2

• Question 3: how to fully describe each xi ?

with zi(k) unknown to user (hidden

variable)Hidden variables

( ) ( )1 2{ , , }k k

k kY x z z=

Observed variable{( ) 1 in class i

0kk

i

if xz

otherwise=

8/27/06 EC4460.SuFy06/MPF 41

How can data may be represented ?

x c z=[z1,z2]x1 1 0 1x2 0 1 0x3 0 1 0

Data may get exactly characterized using a vector z, with the ith component of z equal to 1 when xk belongs to class i and zero otherwise

{( ) 1 in class i0

kki

if xz

otherwise=

8/27/06 EC4460.SuFy06/MPF 42

• Recall we need to estimate m1 and m2 but we don’t know which class each data belongs to

1 1 2 2{ ( 1), , ( 1), } for the 2-class problem

P z m P z mθ = = =• Define new variable

We can define the probability that data belongs to a given class

2

1

( | 1, ) ( 1)( 1| , )( )

( | 1) ( 1) =( | 1, ) ( 1)

i ii

i i

i i ji

p x z P zP z xp x

p x z P z

p x z m P z

θθ

=

= == =

= =

= =∑[using Bayes th]

• Assign a class membership to each data sample based on some available information

• Question 4: how to compute the unknown membership information of each data ?

What do we need to do that?

8/27/06 EC4460.SuFy06/MPF 43

If you don’t know which class the data belong to, represent class membership with a probability

x c z=[z1,z2]x1 1 P(z1=1|θ), P(z2 =1|θ)x2 0 P(z1 =1|θ), P(z2 =1|θ)x3 0 P(z1 =1|,θ), P(z2 =1|θ)

8/27/06 EC4460.SuFy06/MPF 44

* Why use that additional z parameter ?

{( ) 1 in class i0

kki

if xz

otherwise=

( )

1

Nk

i ik

n z=

= ∑Number of samples in class i

( ) inP c iN

= =Probability of occurrence of class i

( )

1

1 Nk

i i kki

m z xn =

= ∑Mean of class i

( ) ( 1| , )ki i kP z xα θ= =

( )

1

Nk

i ik

n α=

=∑

( ) inP c iN

= =

( )

1

1 Nk

i i kki

m xn

α=

= ∑

8/27/06 EC4460.SuFy06/MPF 45

Allows to specify data parameters completely

z

x

Hidden variable

Data collected

P(x|z)

P(z)

21 1( | 1, , )N x z m σ= 2

2 2( | 1, , )N x z m σ=2 2

1 1( ) ( , 1) ( 1) ( | 1)i i i

i iP x P x z P z P x z

= =

= = = = =∑ ∑

8/27/06 EC4460.SuFy06/MPF 46

FIGURE 9.1 In the center is a histogram for 1000 data points that were sampledfrom a mixture of two Gaussians. The correct mixture parameters are means of -4.0and 5.0, of 2.0 and 1.5, and equal weights of 0.5. The outermost Gaussian shapes correspond to an initial (poor) guess for the mixture parameters, consisting of means of -12 and 11, σ’s of 3.5 and 2.6, and equal weights. The log likelihood of the data for this guess was -5482. After one iteration of EM, the density in the middle is obtained, which is quite a good fit to the data. The corresponding means are -4.1 and 5.0, σ’s are 2.0 and 1.5, and weights are 0.49 and 0.51. The log likelihood for this new estimate was -2646. Further iterations did not appreciably cha.ngethe parameters or the likelihood for this simple example. The vertical axis for thedensities is scaled up by 1000 to match the histograms, i.e., having an integral of 1000.

8/27/06 EC4460.SuFy06/MPF 47

8/27/06 EC4460.SuFy06/MPF 48

Example: classification with EM (easy problem)

Initial guess 5th iteration

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

-10 -8 -6 -4 -2 0

-6

-4

-2

0

2

4

6

15th iteration10th iteration

To plot each distribution, I draw an ellipse centered at the mean with an area that covers the expected location of the 1st, 2nd, and 3rd

quartile of the data.

-10 -8 -6 -4 -2

-6

-4

-2

0

2

4

6

True labeled data Posterior probabilities

Colors indicate probability of belonging to one class or another.

Ref:[http://www.bme.jhu.edu/~reza/Courses/learningtheory_files/EM_1.ppt]

8/27/06 EC4460.SuFy06/MPF 49

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

Example: classification with EM (hard problem)

Initial guess 30th iteration 60th iteration

Posterior probabilities

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2

-12 -10 -8 -6 -4 -2 0

-4

-2

0

2True labeled data

iteration

VII. Radial Basis Function Networks (RBFN)faculty.nps.edu/fargues/teaching/ec4460/ec4460-VII.pdf ·...

Documents

Transcript of VII. Radial Basis Function Networks (RBFN)faculty.nps.edu/fargues/teaching/ec4460/ec4460-VII.pdf ·...