Jacek Mazurkiewicz, PhD Softcomputing Part 13: …...•Radial basis function (RBF) networks are...

Internet EngineeringJacek Mazurkiewicz, PhD

Softcomputing

Part 13: Radial Basis Function Networks

Contents• Overview• The Models of Function Approximator• The Radial Basis Function Networks• RBFN’s for Function Approximation• The Projection Matrix• Learning the Kernels• Bias-Variance Dilemma• The Effective Number of Parameters• Model Selection

RBF• Linear models have been studied in

statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model.

• However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians

Typical Applications of NN

• Pattern Classification

• Function Approximation

• Time-Series Forecasting

( )l f= xl C N

( )f=y x mY R y

mX R x

nX R x

1 2 3( ) ( , , , )t t tft − − −=x x x x

Function Approximation

mX R fnY R

:f X Y→Unknown

mX RnY R

ˆ :f X Y→Approximator

f

Linear Models

1

(( ) )i

i

i

m

f w=

=x xFixed BasisFunctions

Weights

Linear Models

Feature VectorsInputs

HiddenUnits

OutputUnits

• Decomposition• Feature Extraction• Transformation

Linearly weighted output

1

(( ) )i

i

i

m

f w=

=x xy

1 2 m

x1 x2 xn

w1 w2 wm

x =

Example Linear Models

• Polynomial

• Fourier Series

( ) i

i

iwf xx =

( )0exp 2( )k

k jf w k xx =

( ) , 0,1,2,i

i ix x = =

( )0( ) exp 2 0,1,, 2, k x j k x k = =

1

(( ) )i

i

i

m

f w=

=x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Single-Layer Perceptrons as Universal Aproximators

HiddenUnits

With sufficient number of sigmoidal units, it can be a universal approximator.

1

(( ) )i

i

i

m

f w=

=x x

y

1 2 m

x1 x2 xn

w1 w2 wm

x =

Radial Basis Function Networks as Universal Aproximators

HiddenUnits

With sufficient number of radial-basis-function units, it can also be a universal approximator.

1

(( ) )i

i

i

m

f w=

=x x

Non-Linear Models

1

(( ) )i

i

i

m

f w=

=x xAdjusted by theLearning process

Weights

Radial Basis Functions

• Center

• Distance Measure

• Shape

Three parameters for a radial function:

xi

r = ||x − xi||

i(x)= (||x − xi||)

Typical Radial Functions

• Gaussian

• Hardy-Multiquadratic (1971)

• Inverse Multiquadratic

( )2

22 0 and r

r e r −

=

( ) 2 2 0 and r r c c c r = +

( ) 2 2 0 and r c r c c r = +

Gaussian Basis Function (=0.5,1.0,1.5)

( )2

22 0 and r

r e r −

=

0.5 =1.0 =

1.5 =

Inverse Multiquadratic

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-10 -5 0 5 10

( ) 2 2 0 and r c r c c r = +

c=1

c=2c=3

c=4c=5

Most General RBF

( ) ( )1( ) ( )i i

T

i −= − −μ x μx x

1μ2μ 3μ

Basis {i: i =1,2,…} is `near’ orthogonal.

Properties of RBF’s

• On-Center, Off Surround

• Analogies with localized receptive fieldsfound in several biological structures, e.g.,– visual cortex;

– ganglion cells

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Projection

Interpolation

As a function approximator

The Topology of RBF

Feature Vectorsx1 x2 xn

y1 ym

Inputs

HiddenUnits

OutputUnits

Subclasses

Classes

As a pattern classifier.

• Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm.

• The activation function is selected from a class of functions called basis functions.

• They usually train much faster than BP.

• They are less susceptible to problems with non-stationary inputs

• Popularized by Broomhead and Lowe (1988), and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture.

• The major difference between RBF and BP is the behavior of the single hidden layer.

• Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function.

The idea

x

y Unknown Function to Approximate

TrainingData

The idea

x

y

Basis Functions (Kernels)

Function Learned

1

)) ((m

i

iiwy f =

= = xx

The idea

x

y

Basis Functions (Kernels)

Function Learned

NontrainingSample

1

)) ((m

i

iiwy f =

= = xx

The idea

x

yFunction Learned

NontrainingSample

1

)) ((m

i

iiwy f =

= = xx

Radial Basis Function Networks as Universal Aproximators

1

( ) ( )m

i

i

iy wf =

= =x x

x1 x2 xn

w1 w2 wm

x =

Training set ( ) ( )

1

( ),p

k

k

ky=

= xT

Goal ( )(( ))k kfy x for all k

( )2

( )

1

( )min kp

k

k

SSE fy=

= − x

( )2

)(

1

) (

1

p mk

i

k

k

i

i

y w= =

= −

x

Learn the Optimal Weight Vector

w1 w2 wm

x1 x2 xnx =


1

( ),p

k

k

ky=

= xT


1

( ) ( )m

i

i

iy wf =

= =x x

( )2

( )

1

( )min kp

k

k

SSE fy=

= − x

( )2

)(

1

) (

1

p mk

i

k

k

i

i

y w= =

= −

x

( )2

( )

1

( )min kp

k

k

SSE fy=

= − x

( )2

)(

1

) (

1

p mk

i

k

k

i

i

y w= =

= −

x

Regularization


1

( ),p

k

k

ky=

= xT


2

1

m

i

iiw=

+

2

1

m

i

iiw=

+

0i

If regularization is unneeded, set 0i =

C


( )2

2( )

1

( )

1

p mk

k

i

i

i

k wfyC = =

= − + xMinimize

( )( )

(

( )

) ( )

1

2 2kp

j

k

j

k

k

j j

wfC

fw w

y =

= − − +

x

x

( ) ( )( ) ( ) ( )

1

2 2p

k k

j

k

jj

k f wy =

= − − + x x

0 =

( ) ( ) ( )( ) * ( ) ( )

1 1

)* (p p

k k k

j j

k k

jj

kwf y = =

+ = x x x


( ) ( )( )(1) ( ), ,T

p

j j j =φ x x

( ) ( )( )* * (1) * ( ), ,T

pf f=f x xDefine

( )(1) ( ), ,T

py y=y

* *

j

T T

j jjw+ =φ f φ y 1, ,j m=

( ) ( ) ( )( ) * ( ) ( )

1 1

)* (p p

k k k

j j

k k

jj

kwf y = =

+ = x x x

Learn the Optimal Weight Vector*

1 1

*

2 2

*

1

*

*

1

2 2

*

T T

T T

T T

m mmm

w

w

w

+ =

+ =

+ =

φ f φ y

φ f φ y

φ f φ y

( )1 2, , , m=Φ φ φ φ

1

2

m

=

Λ

( )* * *

1 2* , , ,T

mw w w=w

Define

**T T+ =wΦ Λf Φ y


**T T+ =wΦ Λf Φ y

( )

( )

( )

* (1)

* (2)

*

* ( )p

f

f

f

=

x

xf

x

( )

( )

( )

(1)

1

(2)

1

(

*

*

1

*

)

k

k

m

k

k

m

k

k

mP

k

k

k

w

w

w

=

=

=

=

x

x

x

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

(1) (1) (1)*1 21

(2) (2) (2) *1 2 2

*( ) ( ) ( )

1 2

m

m

p p pm

m

w

w

w

=

x x x

x x x

x x x

*=Φw

* *T T+ =ΛwΦ wΦ Φ y


( ) *T T+ =wΦ ΛΦ Φ y

( )1

* T T−

= +Λw Φ Φ Φ y

Variance Matrix

1 T−= A Φ y

1 :−A

Design Matrix:Φ

The Empirical-Error Vector1* T−=w A Φ y

*

1w *

2w*

mw

( )kx

( )

1

kx ( )

2

kx( )k

nx( )kx ( )

1

kx ( )

2

kx( )k

nx

UnknownFunction

y ** =f Φw

1φ 2φ mφ

The Empirical-Error Vector1* T−=w A Φ y

*

1w *

2w*

mw

( )kx

( )

1

kx ( )

2

kx( )k

nx( )kx ( )

1

kx ( )

2

kx( )k

nx

UnknownFunction

y ** =f Φw

1φ 2φ mφ

Error Vector

*= −e y f*= −y Φw 1 T−= −ΦAy Φ y

( )1 T

p

−= − AI Φ Φ y = Py

Sum-Squared-Error

=e Py1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

( )1 2, , , m=Φ φ φ φError Vector

( )T

= Py Py

2T= y P y

( )2

( ) * ( )

1

pk k

k

SSE y f=

= − x

If =0, the RBFN’s learning algorithm

is to minimize SSE (MSE).

The Projection Matrix

=e Py1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

( )1 2, , , m=Φ φ φ φError Vector

2TSSE = y P y

1 2( , , , )mspan⊥e φ φ φ=Λ 0

( ) =P Py Pe = e = Py

T= y Py

RBFN’s as Universal Approximators

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Training set

( ) ) ( )(

1,

pk

k

k

== yxT

2

2( ) exp

2

j

j

j

− = −

x μx

Kernels

What to Learn?

• Weights wij’s• Centers j’s of j’s• Widths j’s of j’s

• Number of j’s −Model Selection

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

One-Stage Learning

( )( ) ( )( ) ( ) ( )

1

k k k

ij i i jw y f = − x x

( ) ( )( )( ) ( ) ( )

2 21

ljk k k

j j ij i i

ij

w y f =

− = −

x μμ x x

( ) ( )( )2

( ) ( ) ( )

3 31

ljk k k

j j ij i i

ij

w y f =

− = −

x μx x

One-Stage Learning

The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting.

( )( ) ( )( ) ( ) ( )

1

k k k

ij i i jw y f = − x x

( ) ( )( )( ) ( ) ( )

2 21

ljk k k

j j ij i i

ij

w y f =

− = −

x μμ x x

( ) ( )( )2

( ) ( ) ( )

3 31

ljk k k

j j ij i i

ij

w y f =

− = −

x μx x

x1 x2 xn

y1 yl

12 m

w11 w12w1m

wl1 wl2wlml

Two-Stage Training

Determines• Centers j’s of j’s.• Widths j’s of j’s.• Number of j’s.

Step 1

Step 2 Determines wij’s.

E.g., using batch-learning.

Train the Kernels

Unsupervised Training

Methods• Subset Selection

– Random Subset Selection– Forward Selection– Backward Elimination

• Clustering Algorithms– KMEANS– LVQ

• Mixture Models– GMM

Subset Selection

Random Subset Selection

• Randomly choosing a subset of points from training set

• Sensitive to the initially chosen points.

• Using some adaptive techniques to tune– Centers– Widths– #points

Clustering Algorithms

Clustering Algorithms

1

2

3

4

Questions -How should the user choose the kernel?

• Problem similar to that of selecting features for other learning algorithms.➢Poor choice ---learning made very difficult.

➢Good choice ---even poor learners could succeed.

• The requirement from the user is thus critical.– can this requirement be lessened?

– is a more automatic selection of features possible?

Goal Revisit

Minimize Prediction Error

• Ultimate Goal − Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error

Badness of Fit• Underfitting

– A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting.

– Produces excessive bias in the outputs.

• Overfitting– A model (e.g., network) that is too complex may

fit the noise, not just the signal, leading to overfitting.

– Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance

• Model selection • Jittering • Early stopping • Weight decay

– Regularization– Ridge Regression

• Bayesian learning • Combining networks

Best Way to Avoid Overfitting

• Use lots of training data, e.g.,– 30 times as many training cases as there

are weights in the network.

– for noise-free data, 5 times as many training cases as weights may be sufficient.

• Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of FitUnderfit Overfit

Bias-Variance Dilemma

Underfit Overfit

Large bias Small bias

Small variance Large variance

Underfit Overfit

Large bias Small bias

Small variance Large variance


More on overfitting

• Easily lead to predictionsthat are far beyond the range of the training data.

• Produce wild predictionsin multilayer perceptrons even with noise-free data.


fitmore

overfitmore

underfit

Bias

Variance


Sets of functions

E.g., depend on # hidden nodes used.

The true modelnoise

bias

Solution obtained with training set 2.



biasbias


Sets of functions


The true modelnoise

bias

Variance

Model Selection

Sets of functions


The true modelnoise

bias

Variance


Sets of functions


The true modelnoise

bias

( )g x( ) ( )y g = +x x

( )f x


( )2

( ) ( )E y f −

x x ( )2

( ) ( )E g f = + −

x x

( ) ( )2 2( ) ( ) 2 ( ) ( )E g f g f = − + − +

x x x x

( ) 2 2( ) ( ) 2 ( ) ( )E g f E g f E = − + − +

x x x x

Goal: ( )2

min ( ) ( )E g f −

x x( )2

min ( ) ( )E y f −

x x

0 constant


( )2

( ) ( )E g f −

x x ( )2

[ ( )] [( ) () )( ]E f E fE g f − −

+ =

x xx x

( )2

[ ]( )) (E fE g = −

x x ( )2

[ ]( )) (E fE f + −

x x

( )( )[ ( )] [ ( )) ( ]2 ( )E g fE f E f− − − x xx x

( ) ( ) ( )( )2 2

[ ( )] [ ( )]( ) ( ) [ ( )] [2 ( ) ( ) ( )]E f E f E f EE g g f ff = + − − −

− −

x x x xx x x x

( )( )( ) ( )[ ( )] [ ( )]E E f Eg f f− −x xxx

( ) ( )E g f= x x [ )]( ()EE fg− x x [ ( ()] )E E f f− x x [ ( )] [ ( )]E f E fE+ x x

( ) ( )E g E f= x x [ )]( () EE fg− x x [ ( )] ( )EE f f− x x [ ( )] [ ( )]E f E f+ x x 0=

0


( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −

x x x x

( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −

−

x x xx

noise bias2 variance

Cannot be minimized

Minimize both bias2 and variance

Model Complexity vs. Bias-Variance

( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −

x x x x

( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −

−

x x xx


ModelComplexity(Capacity)


( ) ( )2 22( ) ( ) ( ) ( )E y f E E g f − = + −

x x x x

( ) ( )2 22 [ ( )] [ ( )( ( ]) )E fE E fE g E f = + + −

−

x x xx


ModelComplexity(Capacity)

Example (Polynomial Fits)22exp( 16 )y x x= + −

Example (Polynomial Fits)

Example (Polynomial Fits)

Degree 1 Degree 5

Degree 10 Degree 15

Variance Estimation

Mean

Variance ( )22

1

1ˆ

p

i

i

xp

=

= −

In general, not available.

Variance Estimation

Mean1

1ˆ

p

i

i

x xp

=

= =

Variance ( )22 2

1

ˆ1

1ˆ

p

i

i

s xp

=

= = −−

Loss 1 degree of freedom

Simple Linear Regression

0 1y x = + +

2~ (0, )N 0

1

y

x

=

+

y

x

Simple Linear Regression

y

x

0

1ˆˆ

y

x

=

+0 1y x = + +

2~ (0, )N

Mean Squared Error (MSE)

y

x

0

1ˆˆ

y

x

=

+0 1y x = + +

2~ (0, )N

2

2ˆ

SSEMSE

p = =

−Loss 2 degrees

of freedom

Variance Estimation

2ˆSSE

MSEmp

= =−

Loss m degrees of freedom

m: #parameters of the model

The Number of Parameters

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

i

i

iy wf =

= =x x#degrees of freedom:

The Effective Number of Parameters ()

w1 w2 wm

x1 x2 xnx =

1

( ) ( )m

i

i

iy wf =

= =x x The projection Matrix

1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

=Λ 0

( )p trace = − P m=

The Effective Number of Parameters ()

The projection Matrix

1 T

p

−= −P I Φ ΦA

T= +A Φ Φ Λ

=Λ 0

( )1( ) T

ptrace trace −= − AP I Φ Φ

( )1 Tp trace −= − ΦA Φ

( )( )1TTp trace

−

= − Φ Φ ΦΦ

Pf)

( )( )1T Tp trace

−

= − Φ ΦΦ Φ

( )mp trace= − I

p m= −( )p trace = − P m=

Regularization

( )((

1

)

2

)p

k

k

k fy=

− x

2

1

m

i

iiw=

Empirial

Er

Model s

penalror ts

yCo t = +

( )2

( )

1 1

( )p m

k

i

k

k

i

iwy = =

−

x

Penalize models withlarge weightsSSE

Regularization

( )((

1

)

2

)p

k

k

k fy=

− x

2

1

m

i

iiw=

Empirial

Er

Model s

penalror ts

yCo t = +

( )2

( )

1 1

( )p m

k

i

k

k

i

iwy = =

−

x


Without penalty (i=0), there are m degrees

of freedom to minimize SSE (Cost).

The effective number of parameters =m.

Regularization

( )((

1

)

2

)p

k

k

k fy=

− x

2

1

m

i

iiw=

Empirial

Er

Model s

penalror ts

yCo t = +

( )2

( )

1 1

( )p m

k

i

k

k

i

iwy = =

−

x


With penalty (i>0), the liberty to minimize SSE will be reduced.

The effective number of parameters <m.

Variance Estimation

2ˆ MSE =

Loss degrees of freedom

SSE

p =

−

Variance Estimation

2ˆ MSE =( )

SSE

trace=

P

Model Selection• Goal

– Choose the fittest model

• Criteria– Least prediction error

• Main Tools (Estimate Model Fitness)– Cross validation– Projection matrix

• Methods– Weight decay (Ridge regression)– Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness


• Ultimate Goal − Generalization

• Goal of Our Learning Procedure

Minimize Empirical Error


(MSE)

Estimating Prediction Error

• When you have plenty of data use independent test sets– E.g., use the same training set to train

different models, and choose the best model by comparing on the test set.

• When data is scarce, use– Cross-Validation– Bootstrap

Cross Validation

• Simplest and most widely used method for estimating prediction error.

• Partition the original set into several different ways and to compute an average score over the different partitions, e.g.,– K-fold Cross-Validation– Leave-One-Out Cross-Validation– Generalize Cross-Validation

Global hyperplane Local receptive field

EBP LMS

Local minima Serious local minima

Smaller number of hidden

neurons

Larger number of hidden

neurons

Shorter computation time Longer computation time

Longer learning time Shorter learning time

Comparison of RBF and MLP

RBF networks MLP

Learning speed Very Fast Very Slow

Convergence Almost guarantee Not guarantee

Response time Slow Fast

Memory

requirement

Very large Small

Hardware

implementation

IBM ZISC036

Nestor Ni1000www-5.ibm.com/fr/cdlab/zisc.html

Voice Direct 364

www.sensoryinc.com

Generalization Usually better Usually poorer

Comparison of RBF and MLP

Jacek Mazurkiewicz, PhD Softcomputing Part 13: …...•Radial basis function (RBF) networks are...

Documents

Transcript of Jacek Mazurkiewicz, PhD Softcomputing Part 13: …...•Radial basis function (RBF) networks are...