Neural Networks: Backpropagation -...

MachineLearning

NeuralNetworks:Backpropagation

1BasedonslidesandmaterialfromGeoffreyHinton,RichardSocher,DanRoth,Yoav Goldberg,ShaiShalev-Shwartz andShaiBen-David,andothers

Thislecture

• Whatisaneuralnetwork?

• Predictingwithaneuralnetwork

• Trainingneuralnetworks– Backpropagation

• Practicalconcerns

3

Traininganeuralnetwork

• Given– Anetworkarchitecture(layoutofneurons,theirconnectivityand

activations)– Adatasetoflabeledexamples

• S={(xi,yi)}

• Thegoal:Learntheweightsoftheneuralnetwork

• Remember:Forafixedarchitecture,aneuralnetworkisafunctionparameterizedbyitsweights– Prediction:! = $$(&,()

4

Recall:Learningaslossminimization

WehaveaclassifierNN thatiscompletelydefinedbyitsweightsLearntheweightsbyminimizingaloss*

6

Perhapswitharegularizermin(.*($$ &/,( , !/)�

/

Sofar,wesawthatthisstrategyworkedfor:1. LogisticRegression2. SupportVectorMachines3. Perceptron4. LMSregression

Allofthesearelinearmodels

Sameideafornon-linearmodelstoo!

Eachminimizesadifferentlossfunction

Backtoourrunningexample

7

output

Givenaninputx,howistheoutputpredicted

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backtoourrunningexample

9

output

Givenaninputx,howistheoutputpredicted

Supposethetruelabelforthisexampleisanumber!/

Wecanwritethesquarelossforthisexampleas:

* = 12!– !/ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Learningaslossminimization

WehaveaclassifierNN thatiscompletelydefinedbyitsweightsLearntheweightsbyminimizingaloss*

10

Perhapswitharegularizermin(.*($$ ;/, 2 , !/)�

/

Howdowesolvetheoptimizationproblem?

Stochasticgradientdescent

GivenatrainingsetS={(xi,yi)},x 2 <d

1. Initializeparametersw2. Forepoch=1…T:

1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

• TreatthisexampleastheentiredatasetComputethegradientofthelossA*($$ &/, ( , !/)

• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

11

min(.*($$ ;/, 2 , !/)

�

/






• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

12

min(.*($$ ;/, 2 , !/)

�

/






• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

13

min(.*($$ ;/, 2 , !/)

�

/






• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

14

min(.*($$ ;/, 2 , !/)

�

/






• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

15

min(.*($$ ;/, 2 , !/)

�

/






• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

17

°t:learningrate,manytweakspossible

Theobjectiveisnotconvex.Initializationcanbeimportant

min(.*($$ ;/, 2 , !/)

�

/






• Update:( ← (− DEA*($$ &/, ( , !/))

3. Returnw

18



min(.*($$ ;/, 2 , !/)

�

/

Havewesolvedeverything?

Thederivativeofthelossfunction?

Iftheneuralnetworkisadifferentiablefunction,wecanfindthegradient

– Ormaybeitssub-gradient– Thisisdecidedbytheactivationfunctionsandthelossfunction

ItwaseasyforSVMsandlogisticregression– Onlyonelayer

Buthowdowefindthesub-gradientofamorecomplexfunction?

– Eg:Arecentpaperuseda~150layerneuralnetworkforimageclassification!

19Weneedanefficientalgorithm:Backpropagation

A*($$ &/, ( , !/)








A*($$ &/, ( , !/)

Checkpoint

22

Wherearewe

Ifwehaveaneuralnetwork(structure,activationsandweights),wecanmakeapredictionforaninput

Ifwehadthetruelabeloftheinput,thenwecandefinethelossforthatexample

Ifwecantakethederivativeofthelosswithrespecttoeachoftheweights,wecantakeagradientstepinSGD

Questions?

Checkpoint

23

Wherearewe




Questions?

Checkpoint

24

Wherearewe




Questions?

Checkpoint

25

Wherearewe




Questions?

Reminder:Chainruleforderivatives

– If7 isafunctionof!and! isafunctionof;• Then7 isafunctionof;,aswell

– Question:howtofindFGFH

27SlidecourtesyRichardSocher


– If7 =afunctionof!4 +afunctionof!8,andthe!/’sarefunctionsof;• Then7 isafunctionof;,aswell




– If7 isasumoffunctionsof!/’s,andthe!/’sarefunctionsof;• Then7 isafunctionof;,aswell



Backpropagation

30

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backpropagation

31

WewanttocomputeFIFJKL

M andFI

FJKLN

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Backpropagation

32

Applyingthechainruletocomputethegradient(Andrememberingpartialcomputationsalongthewaytospeedupthings)

WewanttocomputeFIFJKL

M andFI

FJKLN

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)

Outputlayer

33

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

3

Backpropagationexample

Outputlayer

34

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

5


Outputlayer

35

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

5

O*O!

= ! − !∗


Outputlayer

36

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

O*O234

5 = O*O!

O!O234

5

O*O!

= ! − !∗O!O234

5 = 1


Outputlayer

37

O*O244

5 = O*O!

O!O244

3

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78


Outputlayer

38

O*O244

5 = O*O!

O!O244

5

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78


Outputlayer

39

O*O244

5 = O*O!

O!O244

5

O*O!

= ! − !∗

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78


Outputlayer

40

O*O244

5 = O*O!

O!O244

5

O*O!

= ! − !∗O!O234

5 = 74

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78


Outputlayer

41

O*O244

5 = O*O!

O!O244

5

O*O!

= ! − !∗O!O234

5 = 74

output* =

12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

Wehavealreadycomputedthispartialderivativeforthepreviouscase

Cachetospeedup!


Hiddenlayerderivatives

42

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)



43

WewantFIFJPP

N

output

* = 12!–!∗ 8

y = 2345 + 244

5 74 + 2845 78

78 = 9(238: + 248

: ;4 + 288: ;8)

z4 = 9(234: + 244

: ;4 + 284: ;8)



44

O*O288

: = O*O!

O!O288

:

Backpropagationexample * = 12!–!∗ 8

Hiddenlayer

45

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78


Hiddenlayer

46

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

= O*O!(244

5 OO288

: 74 + 2845 OO288

: 78)


Hiddenlayer

47

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

= O*O!(244

5 OO288

: 74 + 2845 OO288

: 78)

74 isnotafunctionof288:

0


Hiddenlayer

48

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

y = 2345 + 244

5 74 + 2845 78

= O*O!2845 O78O288

:


Hiddenlayer

49

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

= O*O!2845 O78O288

:

78 = 9(238: + 248

: ;4 + 288: ;8)


Hiddenlayer

50

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

= O*O!2845 O78O288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss


Hiddenlayer

51

O*O288

: = O*O!

O!O288

:

= O*O!

OO288

: (2345 + 244

5 74 + 2845 78)

= O*O!2845 O78O288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

=O*O!2845 O78OQ

OQO288

:


Hiddenlayer

52

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss


(Frompreviousslide)

Hiddenlayer

53

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

Eachofthesepartialderivativesiseasy


Hiddenlayer

54

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗



Hiddenlayer

55

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

O78OQ

= 78(1 − 78)Why?Because78 Qisthelogisticfunctionwehavealreadyseen



Hiddenlayer

56

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

O78OQ

= 78(1 − 78)Why?Because78 QisthelogisticfunctionwehavealreadyseenOQ

O288: = ;8



Hiddenlayer

57

O*O288

: = O*O!2845 O78OQ

OQO288

:

78 = 9(238: + 248

: ;4 + 288: ;8)

Callthiss

O*O!

= ! − !∗

O78OQ

= 78(1 − 78)Why?Because78 QisthelogisticfunctionwehavealreadyseenOQ

O288: = ;8

Moreimportant:Wehavealreadycomputedmanyofthesepartialderivativesbecauseweareproceedingfromtoptobottom(i.e.backwards)



TheBackpropagationAlgorithm

Thesamealgorithmworksformultiplelayers

Repeatedapplicationofthechainruleforpartialderivatives– Firstperformforwardpassfrominputstotheoutput– Computeloss

– Fromtheloss,proceedbackwardstocomputepartialderivativesusingthechainrule

– Cachepartialderivativesasyoucomputethem• Willbeusedforlowerlayers

58

Mechanizinglearning

• Backpropagationgivesyouthegradientthatwillbeusedforgradientdescent– SGDgivesusagenericlearningalgorithm– Backpropagationisagenericmethodforcomputingpartialderivatives

• Arecursivealgorithmthatproceedsfromthetopofthenetworktothebottom

• Modernneuralnetworklibrariesimplementautomaticdifferentiationusingbackpropagation– Allowseasyexplorationofnetworkarchitectures– Don’thavetokeepderivingthegradientsbyhandeachtime

59





• Treatthisexampleastheentiredataset

• ComputethegradientofthelossA*($$ &/,( , !/) usingbackpropagation

• Update:( ← (− DEA*($$ &/,( , !/))

3. Returnw

60



min(.*($$ ;/, 2 , !/)

�

/

Neural Networks: Backpropagation -...

Documents

Transcript of Neural Networks: Backpropagation -...