Artificial Neural Networks - University of Minnesota...

CS 5751 Machine Learning

Chapter 4 Artificial Neural Networks 1

Artificial Neural Networks • Threshold units

• Gradient descent

• Multilayer networks

• Backpropagation

• Hidden layer representations

• Example: Face recognition

• Advanced topics



Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010

• Connections per neuron ~104-5

• Scene recognition time ~.1 second• 100 inference step does not seem like enoughmust use lots of parallel computation!Properties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automatically



When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw

sensor input)• Output is discrete or real valued• Output is a vector of values• Possibly noisy data• Form of target function is unknown• Human readability of result is unimportant

Examples:• Speech phoneme recognition [Waibel]• Image classification [Kanade, Baluja, Rowley]• Financial prediction



ALVINN drives 70 mph on highways

4 HiddenUnits

30x32 SensorInput Retina

SharpLeft

SharpRight

StraightAhead



Perceptron

W n

W2

W0W

1

X1

ΣX2

Xn

X0=1

i

n

ii xw∑

=0

= ∑=

otherwise 1-

if 1n

1iii xwσ

>⋅

=

>+++

=

otherwise 1-0 if 1

:notationctor simpler ve use will weSometimesotherwise 1-

0... if 1),...,( 110

1

xw ) xo(

xwxwwxxo nn

n

rrr



Decision Surface of Perceptron

X2

X1

X2

X1

Represents some useful functions• What weights represent g(x1,x2) = AND(x1,x2)?But some functions not representable• e.g., not linearly separable• therefore, we will want networks of these ...



Perceptron Training Rule•

smallly sufficient is and separablelinearly is data trainingIf

converge it will proveCan

rate learning called .1) (e.g.,constant small is output perceptron is

lue target vais )( )(

where

η

η

η

••

••

=•−=∆

∆+←

oxct

xotw

www

ii

iii

r



Gradient Descent

[ ]

examples trainingofset theis Where

)(

error squared theminimize that s'learn :Idea

where, simpleconsider ,understand To

221

110

D

otwE

wxw...xwwo

tlinear uni

Dddd

i

nn

∑∈

−≡

+++=

r



Gradient Descent



Gradient Descent

ii

i

n

wEw

wEwwE

wE

wEwE

∂∂

−=∆

∇−=∆

∂∂

∂∂

∂∂

≡∇

i.e.,

][ :rule Training

,...,,][ Gradient 10

η

η r

r



Gradient Descent

))((

)()(

)()(221

)(21

)(21

,

2

2

∑

∑

∑

∑

∑

−−=∂∂

⋅−∂∂

−=

−∂∂

−=

−∂∂

=

−∂∂

=∂∂

ddidd

i

ddd i

dd

ddd i

dd

ddd i

ddd

ii

xotwE

xwtw

ot

otw

ot

otw

otww

E

rr



Gradient Descent

iii

iii

i

i

i

www

xotwww

oxexamplestraining,tx

w

w). (e.g., .rning rateis the leavalue. et output s the tes and t iinput valu

ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each traexamplestraining

∆+←

−+∆←∆

∗><∆

••

><−

do t w,unit weighlinear each For -

)( do ,t unit weighlinear each For *

output compute and instance Input the do ,_in each For -

zero. toeach Initialize - do met, iscondition on terminati the Until valuerandom small some toeach Initialize

05 arg

) ,_(DESCENTGRADIENT

η

η

η

r

r

rr



SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separable• Sufficiently small learning rate η

Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with

minimum squared error• Given sufficiently small learning rate η• Even when training data contains noise• Even when training data not separable by H



Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:

][ 2.gradient theCompute .1

wEww]w[E

D

Drrr

r

∇−←∇

η

Incremental mode Gradient Descent:Do until satisfied:- For each training example d in D

][ 2.gradient theCompute .1

wEww]w[E

d

drrr

r

∇−←∇

η

221 )(][ d

DddD otwE −≡ ∑

∈

r

221 )(][ ddd otwE −≡

r

Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if η made small enough



Multilayer Networks of Sigmoid Units

d1

d0 d2

d3

d1

d0

d2

d3

h1h2

x1 x2

h1 h2

o



Multilayer Decision Space



Sigmoid Unit

W n

W2

W0W

1

X1

ΣX2

Xn

X0=1

i

n

ii xwnet ∑

=

=0

neteneto −+

==1

1)(σ

ationBackpropag networksMultilayer

xxdx

xde

x

-x

units sigmoid of unit sigmoid One

train torulesdescent gradient derivecan We

))( 1)(( )( :property Nice

11

functionsigmoid theis )(σ

→••

−=

+

σσσ



The Sigmoid Function

00.10.20.30.40.50.60.70.80.9

1

-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6

net input

outp

ut

-xex

+=

11)(σ

Sort of a rounded step functionUnlike step function, can take derivative (makes learningpossible)



Error Gradient for a Sigmoid Unit

i

d

d

d

ddd

d i

ddd

dd

di

dd

dd

di

dDd

dii

wnet

netoot

woot

otw

ot

otw

otww

E

∂∂

∂∂

−=

∂∂

−−=

−∂∂

−=

−∂∂

=

−∂∂

=∂∂

∑

∑

∑

∑

∑∈

)(-

)(

)()(221

)(21

)(21

2

2

didddDd

di

dii

d

i

d

ddd

d

d

d

xoootwE

xw

xww

net

oonetnet

neto

,

,

)1()(

:So

)(

)1()(

:know But we

−−−=∂∂

=∂⋅∂

=∂∂

−=∂

∂=

∂∂

∑∈

rr

σ



Backpropagation Algorithm

jijji

jijiji

ji

koutputsk

khhhk

kkkkk

xw

www

w

woohotoo

k

,,

,,,

,

,

ere wh

ight network weeach Update4.

)1( unit hidden each For 3.

))(1( unit output each For 2.

outputs thecompute and example trainingInput the 1. do example, ingeach trainFor

do satisfied, Untilnumbers. random small to weightsall Initialize

δη

δδ

δ

=∆

∆+←

−←

−−←

•

∑∈



More on Backpropagation• Gradient descent over entire network weight vector

• Easily generalized to arbitrary directed graphs

• Will find a local, not necessarily global error minimum– In practice, often works well (can run multiple times)

• Often include weight momentum α

• Minimizes error over training examples

• Will it generalize well to subsequent examples?

• Training can take thousands of iterations -- slow!– Using network after training is fast

)1( )( ,,, −∆+=∆ nwxnw jijijji αδη



Learning Hidden Layer Representations

00000001 0000000100000010 0000001000000100 0000010000001000 0000100000010000 0001000000100000 0010000001000000 0100000010000000 10000000

Output Input

→→→→→→→→

Inputs

Outputs



Learning Hidden Layer Representations

Inputs

Outputs

00000001 .01 .94 .60 0000000100000010 .98 .01 .80 0000001000000100 .99 .99 .22 0000010000001000 .02 .05 .03 0000100000010000 .71 .97 .99 0001000000100000 .27 .97 .01 0010000001000000 .88 .11 .01 01000000

10000000 .08 .04 .89 10000000Output Input

→→→→→→→→→→→→→→→→



Output Unit Error during Training

Sum of squared errors for each output unit

00.10.20.30.40.50.60.70.80.9

0 500 1000 1500 2000 2500



Hidden Unit Encoding

Hidden unit encoding for one input

0.10.20.30.40.50.60.70.80.9

1

0 500 1000 1500 2000 2500



Input to Hidden WeightsWeights from inputs to one hidden unit

-5-4-3-2-101234

0 500 1000 1500 2000 2500



Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimum• Momentum can cause quicker convergence• Stochastic gradient descent also results in faster

convergence• Can train multiple networks and get different results (using

different initial weights)

Nature of convergence• Initialize weights near zero• Therefore, initial networks near-linear• Increasingly non-linear functions as training progresses



Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network

with a single hidden layer• But that might require an exponential (in the number of

inputs) hidden units

Continuous functions:• Every bounded continuous function can be approximated

with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]



Overfitting in ANNs

Error versus weight updates (example 1)

0.0020.0030.0040.0050.0060.0070.0080.009

0.01

0 5000 10000 15000 20000

Number of weight updates

Erro

r

Training setValidation set



Overfitting in ANNs

Error versus weight updates (Example 2)

00.010.020.030.040.050.060.070.08

0 1000 2000 3000 4000 5000 6000

Number of weight updates

Erro

r

Training setValidation set



Neural Nets for Face Recognition

30x32inputs

left strt rgt up

Typical Input Images

90% accurate learninghead pose, and recognizing1-of-20 faces



Learned Network Weights

Typical Input Images

30x32inputs

left strt rgt upLearned Weights



Alternative Error Functions

nrecognitio phonemein e.g., :weights together Tie

)()(

: valuesas wellas slopeson target Train

)()( : weightslarge Penalize

22

21

ji,

2221

•

∂∂

−∂∂

+−≡

+−≡

∑ ∑

∑∑ ∑

∈ ∈

∈ ∈

Dd outputsk djkd

dj

kdkdkd

jikdDd outputsk

kd

xo

xtotwE

wotwE

µ

γ

r

r



Recurrent Networks

FeedforwardNetwork

y(t+1)

x(t)

RecurrentNetwork

y(t+1)

x(t)

y(t+1)

x(t)

y(t)

x(t-1)

y(t-1)

x(t-2)

Recurrent Networkunfolded in time



Neural Network Summary• physiologically (neurons) inspired model• powerful (accurate), slow, opaque (hard to

understand resulting model)• bias: preferential

– based on gradient descent– finds local minimum– effect by initial conditions, parameters

• neural units– linear– linear threshold– sigmoid



Neural Network Summary (cont)• gradient descent

– convergence

• linear units– limitation: hyperplane decision surface– learning rule

• multilayer network– advantage: can have non-linear decision surface– backpropagation to learn

• backprop learning rule

• learning issues– units used



Neural Network Summary (cont)• learning issues (cont)

– batch versus incremental (stochastic)– parameters

• initial weights• learning rate• momentum

– cost (error) function• sum of squared errors• can include penalty terms

• recurrent networks– simple– backpropagation through time

Artificial Neural Networks - University of Minnesota...

Documents

Transcript of Artificial Neural Networks - University of Minnesota...