Post on 22-Dec-2015
BackpropagationBackpropagation
Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence
COS302COS302
Michael L. LittmanMichael L. Littman
Fall 2001Fall 2001
AdministrationAdministration
Questions, concerns?Questions, concerns?
Classification Percept.Classification Percept.
xx11
netnet
xx22 xx33 xxDD…
sumsum
ww11
11
ww22 ww33wwDD ww00
outout squashsquash
gg
PerceptronsPerceptrons
Recall that the squashing function Recall that the squashing function makes the output look more like makes the output look more like bits: 0 or 1 decisions.bits: 0 or 1 decisions.
What if we give it inputs that are also What if we give it inputs that are also bits?bits?
A Boolean FunctionA Boolean Function
A B C D E FA B C D E F G G outout
1 0 1 0 1 0 11 0 1 0 1 0 1 00
0 1 1 0 0 0 10 1 1 0 0 0 1 00
0 0 1 0 0 1 00 0 1 0 0 1 0 00
1 0 0 0 1 0 01 0 0 0 1 0 0 11
0 0 1 1 0 0 00 0 1 1 0 0 0 11
1 1 1 0 1 0 11 1 1 0 1 0 1 00
0 1 0 1 0 0 10 1 0 1 0 0 1 11
1 1 1 1 1 0 11 1 1 1 1 0 1 11
1 1 1 1 1 1 11 1 1 1 1 1 1 11
1 1 1 0 0 1 11 1 1 0 0 1 1 00
Think GraphicallyThink Graphically
Can perceptron learn this?Can perceptron learn this?
CC
DD
11
11
11
00
Ands and OrsAnds and Ors
out(out(xx) = g(sum) = g(sumk k wwk k xxkk))
How can we set the weights to represent How can we set the weights to represent (v(v11)(v)(v22)(~v)(~v77)) ? ? ANDAND
wwii=0, except=0, except
ww11=10, w=10, w22=10, w=10, w77=-10, w=-10, w00=-15 (5-max)=-15 (5-max)
How about How about ~v~v3 3 ++ vv4 4 ++ ~v~v88 ?? OROR
wwii=0, except=0, except
ww11=-10, w=-10, w22=10, w=10, w77=-10, w=-10, w00=15 (-5-min)=15 (-5-min)
MajorityMajority
Are at least half the bits on?Are at least half the bits on?
Set all weights to 1, wSet all weights to 1, w00 to –n/2. to –n/2.A B C D E FA B C D E F G G outout1 0 1 0 1 0 11 0 1 0 1 0 1 110 1 1 0 0 0 10 1 1 0 0 0 1 000 0 1 0 0 1 00 0 1 0 0 1 0 001 0 0 0 1 0 01 0 0 0 1 0 0 001 1 1 0 1 0 11 1 1 0 1 0 1 110 1 0 1 0 0 10 1 0 1 0 0 1 001 1 1 1 1 0 11 1 1 1 1 0 1 111 1 1 1 1 1 11 1 1 1 1 1 1 11
Representation size using decision tree?Representation size using decision tree?
Sweet Sixteen?Sweet Sixteen?
abab (~a)+(~b)(~a)+(~b)
a(~b)a(~b) (~a)+b(~a)+b
(~a)b(~a)b a+(~b)a+(~b)
(~a)(~b)(~a)(~b) a+ba+b
aa ~a~a
bb ~b~b
11 00
a = ba = b a exclusive-or b (a a exclusive-or b (a b) b)
XOR ConstraintsXOR Constraints
A B A B outout
0 0 0 0 00 g(g(ww00) < 1/2) < 1/2
0 1 0 1 11 g(wg(wBB++ww00) > 1/2) > 1/2
1 01 0 11 g(wg(wAA++ww00) > 1/2) > 1/2
1 11 1 00 g(wg(wAA+w+wBB++ww00) < 1/2) < 1/2
ww0 0 < 0, w< 0, wAA+w+w00>0, w>0, wBB+w+w00>0,>0,
wwAA+w+wBB+2 w+2 w00>0, 0 < w>0, 0 < wAA+w+wBB+w+w0 0 < 0< 0
Linearly SeparableLinearly Separable
XOR problematicXOR problematic
CC
DD
00
00
11
11
??
How Represent XOR?How Represent XOR?
A xor B A xor B = (A+B)(~A+~B)= (A+B)(~A+~B)
netnet
AA BB
cc11
netnet
11cc22
netnet
outout
1010 -10-10-10-101010
11-5-5
111515
1010 1010 -15-15
Requiem for a PerceptronRequiem for a Perceptron
Rosenblatt proved that a perceptron Rosenblatt proved that a perceptron will learn any linearly separable will learn any linearly separable function.function.
Minsky and Papert (1969) in Minsky and Papert (1969) in PerceptronsPerceptrons: “there is no reason to : “there is no reason to suppose that any of the virtues suppose that any of the virtues carry over to the many-layered carry over to the many-layered version.”version.”
BackpropagationBackpropagation
Bryson and Ho (1969, same year) Bryson and Ho (1969, same year) described a training procedure for described a training procedure for multilayer networks. Went multilayer networks. Went unnoticed.unnoticed.
Multiply rediscovered in the 1980s.Multiply rediscovered in the 1980s.
Multilayer NetMultilayer Net
xx11
netnet11ii
xx22 xx33 xxDD…WW1111
11WW1212WW1313
hidhid11gg
netnetHHiinetnet22
ii
hidhidhidhid22
UU11 netnetii
outout
…
…UU00
Multiple OutputsMultiple Outputs
Makes no difference for the Makes no difference for the perceptron.perceptron.
Add more outputs off the hidden Add more outputs off the hidden layer in the multilayer case.layer in the multilayer case.
Output FunctionOutput Function
outoutii((xx) = g(sum) = g(sumj j UUji ji g(sumg(sumk k WWkj kj xxkk))))
H: number of “hidden” nodesH: number of “hidden” nodes
Also:Also:• Use more than one hidden layerUse more than one hidden layer• Use direct input-output weightsUse direct input-output weights
How Train?How Train?
Find a set of weights U, WFind a set of weights U, W
that minimizethat minimize
sumsum((xx,,yy) ) sumsumi i (y(yii-out-outii((xx))))22
using gradient descent.using gradient descent.
Incremental version (vs. batch):Incremental version (vs. batch):
Move weights a small amount for Move weights a small amount for each training exampleeach training example
Updating WeightsUpdating Weights
1.1. Feed-forward to hidden:Feed-forward to hidden: netnetj j = = sumsumk k WWkj kj xxkk; hid; hidjj = g(net = g(netjj))
2.2. Feed-forward to output:Feed-forward to output:
netneti i = sum= sumj j UUji ji hidhidjj; out; outii = g(net = g(netii))
3. Update output weights:3. Update output weights:
i i = g’(net= g’(netii) (y) (yii-out-outii); U); Uji ji += += hid hidjj ii
4. Update hidden weights:4. Update hidden weights:
jj= g’(net= g’(netjj) sum) sumi i UUjj jj ii; W; Wkj kj += += x xkk jj
Multilayer Net (schema)Multilayer Net (schema)
WWkjkj
xxkk
netnetjj
hidhidjj
netnetii
outoutii
UUjiji
jj
ii
yyii
UUjiji
Does it Work?Does it Work?
Sort of: Lots of practical applications, Sort of: Lots of practical applications, lots of people play with it. Fun.lots of people play with it. Fun.
However, can fall prey to the However, can fall prey to the standard problems with local standard problems with local search…search…
NP-hard to train a 3-node net.NP-hard to train a 3-node net.
Step Size IssuesStep Size Issues
Too small? Too big?Too small? Too big?
Representation IssuesRepresentation Issues
Any continuous function can be Any continuous function can be represented by a one hidden layer net represented by a one hidden layer net with sufficient hidden nodes.with sufficient hidden nodes.
Any function at all can be represented by Any function at all can be represented by a two hidden layer net with a sufficient a two hidden layer net with a sufficient number of hidden nodes.number of hidden nodes.
What’s the downside for learning?What’s the downside for learning?
Generalization IssuesGeneralization Issues
Pruning weights: Pruning weights: “optimal “optimal brain damage”brain damage”
Cross validationCross validation
Much, much more to this. Take a Much, much more to this. Take a class on machine learning.class on machine learning.
What to LearnWhat to Learn
Representing logical functions using Representing logical functions using sigmoid unitssigmoid units
Majority (net vs. decision tree)Majority (net vs. decision tree)
XOR is not linearly separableXOR is not linearly separable
Adding layers adds expressibilityAdding layers adds expressibility
Backprop is gradient descentBackprop is gradient descent
Homework 10 (due 12/12)Homework 10 (due 12/12)
1.1. Describe a procedure for Describe a procedure for converting a Boolean formula in converting a Boolean formula in CNF (n variables, m clauses) into CNF (n variables, m clauses) into an equivalent network? How an equivalent network? How many hidden units does it have?many hidden units does it have?
2.2. More soonMore soon