Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

BackpropagationBackpropagation

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

AdministrationAdministration

Questions, concerns?Questions, concerns?

Classification Percept.Classification Percept.

netnet

xx22 xx33 xxDD…

sumsum

ww22 ww33wwDD ww00

outout squashsquash

PerceptronsPerceptrons

Recall that the squashing function Recall that the squashing function makes the output look more like makes the output look more like bits: 0 or 1 decisions.bits: 0 or 1 decisions.

What if we give it inputs that are also What if we give it inputs that are also bits?bits?

A Boolean FunctionA Boolean Function

A B C D E FA B C D E F G G outout

1 0 1 0 1 0 11 0 1 0 1 0 1 00

0 1 1 0 0 0 10 1 1 0 0 0 1 00

0 0 1 0 0 1 00 0 1 0 0 1 0 00

1 0 0 0 1 0 01 0 0 0 1 0 0 11

0 0 1 1 0 0 00 0 1 1 0 0 0 11

1 1 1 0 1 0 11 1 1 0 1 0 1 00

0 1 0 1 0 0 10 1 0 1 0 0 1 11

1 1 1 1 1 0 11 1 1 1 1 0 1 11

1 1 1 1 1 1 11 1 1 1 1 1 1 11

1 1 1 0 0 1 11 1 1 0 0 1 1 00

Think GraphicallyThink Graphically

Can perceptron learn this?Can perceptron learn this?

Ands and OrsAnds and Ors

out(out(xx) = g(sum) = g(sumk k wwk k xxkk))

How can we set the weights to represent How can we set the weights to represent (v(v11)(v)(v22)(~v)(~v77)) ? ? ANDAND

wwii=0, except=0, except

ww11=10, w=10, w22=10, w=10, w77=-10, w=-10, w00=-15 (5-max)=-15 (5-max)

How about How about ~v~v3 3 ++ vv4 4 ++ ~v~v88 ?? OROR

wwii=0, except=0, except

ww11=-10, w=-10, w22=10, w=10, w77=-10, w=-10, w00=15 (-5-min)=15 (-5-min)

MajorityMajority

Are at least half the bits on?Are at least half the bits on?

Set all weights to 1, wSet all weights to 1, w00 to –n/2. to –n/2.A B C D E FA B C D E F G G outout1 0 1 0 1 0 11 0 1 0 1 0 1 110 1 1 0 0 0 10 1 1 0 0 0 1 000 0 1 0 0 1 00 0 1 0 0 1 0 001 0 0 0 1 0 01 0 0 0 1 0 0 001 1 1 0 1 0 11 1 1 0 1 0 1 110 1 0 1 0 0 10 1 0 1 0 0 1 001 1 1 1 1 0 11 1 1 1 1 0 1 111 1 1 1 1 1 11 1 1 1 1 1 1 11

Representation size using decision tree?Representation size using decision tree?

Sweet Sixteen?Sweet Sixteen?

abab (~a)+(~b)(~a)+(~b)

a(~b)a(~b) (~a)+b(~a)+b

(~a)b(~a)b a+(~b)a+(~b)

(~a)(~b)(~a)(~b) a+ba+b

aa ~a~a

bb ~b~b

a = ba = b a exclusive-or b (a a exclusive-or b (a b) b)

XOR ConstraintsXOR Constraints

A B A B outout

0 0 0 0 00 g(g(ww00) < 1/2) < 1/2

0 1 0 1 11 g(wg(wBB++ww00) > 1/2) > 1/2

1 01 0 11 g(wg(wAA++ww00) > 1/2) > 1/2

1 11 1 00 g(wg(wAA+w+wBB++ww00) < 1/2) < 1/2

ww0 0 < 0, w< 0, wAA+w+w00>0, w>0, wBB+w+w00>0,>0,

wwAA+w+wBB+2 w+2 w00>0, 0 < w>0, 0 < wAA+w+wBB+w+w0 0 < 0< 0

Linearly SeparableLinearly Separable

XOR problematicXOR problematic

How Represent XOR?How Represent XOR?

A xor B A xor B = (A+B)(~A+~B)= (A+B)(~A+~B)

netnet

11cc22

netnet

outout

1010 -10-10-10-101010

11-5-5

111515

1010 1010 -15-15

Requiem for a PerceptronRequiem for a Perceptron

Rosenblatt proved that a perceptron Rosenblatt proved that a perceptron will learn any linearly separable will learn any linearly separable function.function.

Minsky and Papert (1969) in Minsky and Papert (1969) in PerceptronsPerceptrons: “there is no reason to : “there is no reason to suppose that any of the virtues suppose that any of the virtues carry over to the many-layered carry over to the many-layered version.”version.”

BackpropagationBackpropagation

Bryson and Ho (1969, same year) Bryson and Ho (1969, same year) described a training procedure for described a training procedure for multilayer networks. Went multilayer networks. Went unnoticed.unnoticed.

Multiply rediscovered in the 1980s.Multiply rediscovered in the 1980s.

Multilayer NetMultilayer Net

netnet11ii

xx22 xx33 xxDD…WW1111

11WW1212WW1313

hidhid11gg

netnetHHiinetnet22

hidhidhidhid22

UU11 netnetii

outout

…UU00

Multiple OutputsMultiple Outputs

Makes no difference for the Makes no difference for the perceptron.perceptron.

Add more outputs off the hidden Add more outputs off the hidden layer in the multilayer case.layer in the multilayer case.

Output FunctionOutput Function

outoutii((xx) = g(sum) = g(sumj j UUji ji g(sumg(sumk k WWkj kj xxkk))))

H: number of “hidden” nodesH: number of “hidden” nodes

Also:Also:• Use more than one hidden layerUse more than one hidden layer• Use direct input-output weightsUse direct input-output weights

How Train?How Train?

Find a set of weights U, WFind a set of weights U, W

that minimizethat minimize

sumsum((xx,,yy) ) sumsumi i (y(yii-out-outii((xx))))22

using gradient descent.using gradient descent.

Incremental version (vs. batch):Incremental version (vs. batch):

Move weights a small amount for Move weights a small amount for each training exampleeach training example

Updating WeightsUpdating Weights

1.1. Feed-forward to hidden:Feed-forward to hidden: netnetj j = = sumsumk k WWkj kj xxkk; hid; hidjj = g(net = g(netjj))

2.2. Feed-forward to output:Feed-forward to output:

netneti i = sum= sumj j UUji ji hidhidjj; out; outii = g(net = g(netii))

3. Update output weights:3. Update output weights:

i i = g’(net= g’(netii) (y) (yii-out-outii); U); Uji ji += += hid hidjj ii

4. Update hidden weights:4. Update hidden weights:

jj= g’(net= g’(netjj) sum) sumi i UUjj jj ii; W; Wkj kj += += x xkk jj

Multilayer Net (schema)Multilayer Net (schema)

WWkjkj

netnetjj

hidhidjj

netnetii

outoutii

UUjiji

Does it Work?Does it Work?

Sort of: Lots of practical applications, Sort of: Lots of practical applications, lots of people play with it. Fun.lots of people play with it. Fun.

However, can fall prey to the However, can fall prey to the standard problems with local standard problems with local search…search…

NP-hard to train a 3-node net.NP-hard to train a 3-node net.

Step Size IssuesStep Size Issues

Too small? Too big?Too small? Too big?

Representation IssuesRepresentation Issues

Any continuous function can be Any continuous function can be represented by a one hidden layer net represented by a one hidden layer net with sufficient hidden nodes.with sufficient hidden nodes.

Any function at all can be represented by Any function at all can be represented by a two hidden layer net with a sufficient a two hidden layer net with a sufficient number of hidden nodes.number of hidden nodes.

What’s the downside for learning?What’s the downside for learning?

Generalization IssuesGeneralization Issues

Pruning weights: Pruning weights: “optimal “optimal brain damage”brain damage”

Cross validationCross validation

Much, much more to this. Take a Much, much more to this. Take a class on machine learning.class on machine learning.

What to LearnWhat to Learn

Representing logical functions using Representing logical functions using sigmoid unitssigmoid units

Majority (net vs. decision tree)Majority (net vs. decision tree)

XOR is not linearly separableXOR is not linearly separable

Adding layers adds expressibilityAdding layers adds expressibility

Backprop is gradient descentBackprop is gradient descent

Homework 10 (due 12/12)Homework 10 (due 12/12)

1.1. Describe a procedure for Describe a procedure for converting a Boolean formula in converting a Boolean formula in CNF (n variables, m clauses) into CNF (n variables, m clauses) into an equivalent network? How an equivalent network? How many hidden units does it have?many hidden units does it have?

2.2. More soonMore soon

Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Documents

Transcript of Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Backpropagation Networks. Introduction to Backpropagation - In 1969 a method for learning in multi-layer network, Backpropagation Backpropagation, was.

Expectation Maximization Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Backpropagation con momentum

Probability and Information Retrieval Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Wrap Up Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Satisfiability Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Introduction to Backpropagation Networks · Backpropagation Networks Introduction to Backpropagation - In 1986 a method for learning in multi-layer network, Backpropagation, was invented

Backpropagation Networks

Variantes de BACKPROPAGATION

Constraint Satisfaction Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Local Search Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

JST Backpropagation

BackPropagation Matlab

Nips09 Littman Mbrl

Backpropagation training

RED NEURONAL Backpropagation

4. ANN Backpropagation

Backpropagation algorithm

More Probabilistic Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.