CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks...
Transcript of CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks...
![Page 2: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/2.jpg)
OverviewToday:
• OrganizationalStuff
• ProjectTips
• Fromone-layertomultilayerneuralnetwork!
• Max-Marginlossandbackprop!(Thisisthehardestlectureofthequarter)
4/12/16RichardSocherLecture5,Slide 2
![Page 3: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/3.jpg)
Announcement:
• 1%extracreditforPiazzaparticipation!• HintforPSet1:Understandmathanddimensionality,thenadd
printstatements,e.g.
• Studentsurveysentoutlastnight,pleasegiveusfeedbacktoimprovetheclass:)
4/12/16RichardSocherLecture5,Slide 3
![Page 4: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/4.jpg)
ClassProject
• Mostimportant(40%)andlastingresultoftheclass• PSet 3alittleeasiertohavemoretime• Startearlyandclearlydefineyourtaskanddataset
• Projecttypes:1. Applyexistingneuralnetworkmodeltoanewtask2. Implementacomplexneuralarchitecture3. Comeupwithanewneuralnetworkmodel4. Theory
4/12/16RichardSocherLecture5,Slide 4
![Page 5: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/5.jpg)
ClassProject:ApplyExistingNNets toTasks
1. DefineTask:• Example:Summarization
2. DefineDataset1. Searchforacademicdatasets• Theyalreadyhavebaselines• E.g.:DocumentUnderstandingConference(DUC)
2. Defineyourown(harder,needmorenewbaselines)• Ifyou’reagraduatestudent:connecttoyourresearch• Summarization,Wikipedia:Introparagraphandrestoflargearticle• Becreative:Twitter,Blogs,News
4/12/16RichardSocherLecture5,Slide 5
![Page 6: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/6.jpg)
ClassProject:ApplyExistingNNets toTasks
3. Defineyourmetric• Searchonlineforwellestablishedmetricsonthistask• Summarization:Rouge(Recall-OrientedUnderstudyfor
Gisting Evaluation)whichdefinesn-gramoverlaptohumansummaries
4. Splityourdataset!• Train/Dev/Test• Academicdatasetoftencomepre-split• Don’tlookatthetestsplituntil~1weekbeforedeadline!
4/12/16RichardSocherLecture5,Slide 6
![Page 7: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/7.jpg)
ClassProject:ApplyExistingNNets toTasks
5. Establishabaseline• Implementthesimplestmodel(oftenlogisticregressionon
unigramsandbigrams)first• ComputemetricsontrainANDdev• Analyzeerrors• Ifmetricsareamazingandnoerrors:
done,problemwastooeasy,restart:)
6. Implementexistingneuralnetmodel• Computemetricontrainanddev• Analyzeoutputanderrors• Minimumbarforthisclass
4/12/16RichardSocherLecture5,Slide 7
![Page 8: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/8.jpg)
ClassProject:ApplyExistingNNets toTasks
7. Alwaysbeclosetoyourdata!• Visualizethedataset• Collectsummarystatistics• Lookaterrors• Analyzehowdifferenthyperparameters affectperformance
8. Tryoutdifferentmodelvariants• Soonyouwillhavemoreoptions• Wordvectoraveragingmodel(neuralbagofwords)• Fixedwindowneuralmodel• Recurrentneuralnetwork• Recursiveneuralnetwork• Convolutionalneuralnetwork
4/12/16RichardSocherLecture5,Slide 8
![Page 9: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/9.jpg)
ClassProject:ANewModel-- AdvancedOption
• Doallotherstepsfirst(Startearly!)• Gainintuitionofwhyexistingmodelsareflawed
• Talktootherresearchers, cometomyofficehoursalot• Implementnewmodelsanditeratequicklyoverideas• Setupefficientexperimentalframework• Buildsimplernewmodelsfirst• ExampleSummarization:
• Averagewordvectorsperparagraph,thengreedysearch• Implementlanguagemodelorautoencoder (introducedlater)• Stretchgoalforpotentialpaper:Generatesummary!
4/12/16RichardSocherLecture5,Slide 9
![Page 10: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/10.jpg)
ProjectIdeas
• Summarization• NER,likePSet 2butwithlargerdata
NaturalLanguageProcessing (almost) fromScratch,RonanCollobert, JasonWeston, LeonBottou,MichaelKarlen,Koray Kavukcuoglu, Pavel Kuksa, http://arxiv.org/abs/1103.0398
• Simplequestionanswering,A NeuralNetworkforFactoidQuestionAnsweringoverParagraphs,Mohit Iyyer,JordanBoyd-Graber,LeonardoClaudino, RichardSocherandHalDaumé III(EMNLP2014)
• Imagetotextmappingorgeneration,GroundedCompositional SemanticsforFinding andDescribingImageswithSentences, RichardSocher, AndrejKarpathy,Quoc V.Le,Christopher D.Manning,AndrewY.Ng.(TACL2014)orDeepVisual-SemanticAlignments forGeneratingImageDescriptions, AndrejKarpathy,LiFei-Fei
• Entitylevelsentiment• UseDLtosolveanNLPchallengeonkaggle,
Developascoringalgorithmforstudent-written short-answerresponses, https://www.kaggle.com/c/asap-sas
4/12/16RichardSocherLecture5,Slide 10
![Page 11: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/11.jpg)
Defaultproject:sentimentclassification
• Sentimentonmoviereviews:http://nlp.stanford.edu/sentiment/• Lotsofdeeplearningbaselinesandmethodshavebeentried
4/12/16RichardSocherLecture1,Slide 11
![Page 12: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/12.jpg)
Amorepowerfulwindowclassifier
• Revisiting
• Xwindow =[xmuseums xin xParis xare xamazing ]
• Assumewewanttoclassifywhetherthecenterwordisalocationornot
4/12/16RichardSocherLecture5,Slide 12
![Page 13: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/13.jpg)
ASingleLayerNeuralNetwork
• Asinglelayerisacombinationofalinearlayerandanonlinearity:
• Theneuralactivationsacanthenbeusedtocomputesomefunction
• Forinstance,anunnormalized scoreorasoftmax probabilitywecareabout:
13
![Page 14: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/14.jpg)
Summary:Feed-forwardComputation
14
Computingawindow’sscorewitha3-layerneuralnet:s=score(museumsinParisareamazing)
Xwindow =[xmuseums xin xParis xare xamazing ]
![Page 15: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/15.jpg)
Mainintuitionforextralayer
15
Thelayerlearnsnon-linearinteractionsbetweentheinputwordvectors.
Example:onlyif“museums” isfirstvectorshoulditmatterthat“in” isinthesecondposition
Xwindow =[xmuseums xin xParis xare xamazing ]
![Page 16: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/16.jpg)
Summary:Feed-forwardComputation
• s =score(museumsinParisareamazing)• sc =score(NotallmuseumsinParis)
• Ideafortrainingobjective:makescoreoftruewindowlargerandcorruptwindow’sscorelower(untilthey’regoodenough):minimize
• Thisiscontinuous,canperformSGD
16
![Page 17: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/17.jpg)
Max-marginObjectivefunction
• Objectiveforasinglewindow:
• Eachwindowwithalocationatitscentershouldhaveascore+1higherthananywindowwithoutalocationatitscenter
• xxx|ß 1à|ooo
• Forfullobjectivefunction:Sumoveralltrainingwindows
17
![Page 18: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/18.jpg)
TrainingwithBackpropagation
AssumingcostJis>0,computethederivativesofs andsc wrt alltheinvolvedvariables:U,W,b,x
18
![Page 19: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/19.jpg)
TrainingwithBackpropagation
• Let’sconsiderthederivativeofasingleweightWij
• Thisonlyappearsinsideai
• Forexample:W23 isonlyusedtocomputea2
x1 x2x3 +1
a1 a2
s U2
W23
19
b2
![Page 20: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/20.jpg)
TrainingwithBackpropagation
DerivativeofweightWij:
20
x1 x2x3 +1
a1 a2
s U2
W23
![Page 21: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/21.jpg)
whereforlogisticf
TrainingwithBackpropagation
DerivativeofsingleweightWij :
Localerrorsignal
Localinputsignal
21
x1 x2x3 +1
a1 a2
s U2
W23
![Page 22: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/22.jpg)
• Wewantallcombinationsofi =1,2 and j=1,2,3à ?
• Solution:Outerproduct:whereisthe“responsibility”orerrormessagecomingfromeachactivationa
TrainingwithBackpropagation
• FromsingleweightWij tofullW:
22
x1 x2x3 +1
a1 a2
s U2
W23
S
![Page 23: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/23.jpg)
TrainingwithBackpropagation
• Forbiasesb,weget:
23
x1 x2x3 +1
a1 a2
s U2
W23
![Page 24: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/24.jpg)
TrainingwithBackpropagation
24
That’salmostbackpropagationIt’ssimplytakingderivativesandusingthechainrule!
Remainingtrick:wecanre-usederivativescomputedforhigherlayersincomputingderivativesforlowerlayers!
Example:lastderivativesofmodel,thewordvectorsinx
![Page 25: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/25.jpg)
TrainingwithBackpropagation
• Takederivativeofscorewithrespecttosingleelementofwordvector
• Now,wecannotjusttakeintoconsiderationoneaibecauseeachxj isconnectedtoalltheneuronsaboveandhencexj influencestheoverallscorethroughallofthese,hence:
Re-usedpartofpreviousderivative25
![Page 26: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/26.jpg)
TrainingwithBackpropagation
• With,whatisthefullgradient?à
• Observations:Theerrormessage± thatarrivesatahiddenlayerhasthesamedimensionalityasthathiddenlayer
26
![Page 27: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/27.jpg)
Puttingallgradientstogether:
• Remember:Fullobjectivefunctionforeachwindowwas:
• Forexample:gradientforU:
27
![Page 28: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/28.jpg)
Twolayerneuralnetsandfullbackprop
• Let’slookata2layerneuralnetwork• Samewindowdefinitionforx• Samescoringfunction• 2hiddenlayers(carefullynotsuperscriptsnow!)
4/12/16RichardSocherLecture5,Slide 28
W(1)
W(2)
a(2)
a(3)
x
Us
![Page 29: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/29.jpg)
Twolayerneuralnetsandfullbackprop
• Fullywrittenoutasonefunction:
• SamederivationasbeforeforW(2)(nowsittingona(1))
4/12/16RichardSocherLecture5,Slide 29
W(1)
W(2)
S
a(2)
a(3)
![Page 30: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/30.jpg)
Twolayerneuralnetsandfullbackprop
• SamederivationasbeforefortopW(2):
• Inmatrixnotation:
whereand± istheelement-wiseproductalsocalledHadamard product
• Lastmissingpieceforunderstandinggeneralbackprop:
4/12/16RichardSocherLecture5,Slide 30
![Page 31: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/31.jpg)
Twolayerneuralnetsandfullbackprop
• Lastmissingpiece:
• What’sthebottomlayer’serrormessage±(2)?
• Similarderivationtosinglelayermodel
• Maindifference,wealreadyhaveandneedtoapplythechainruleagainon
4/12/16RichardSocherLecture5,Slide 31
![Page 32: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/32.jpg)
Twolayerneuralnetsandfullbackprop
• Chainrulefor:
• Getintuitionbyderivingasifitwasascalar
• Intuitively,wehavetosumoverallthenodescomingintolayer
• Puttingitalltogether:
4/12/16RichardSocherLecture5,Slide 32
The second derivative in eq. 28 for output units is simply
@a
(nl
)i
@W
(nl
�1)ij
=@
@W
(nl
�1)ij
z
(nl
)i
=@
@W
(nl
�1)ij
⇣W
(nl
�1)i· a
(nl
�1)⌘= a
(nl
�1)j
. (46)
We adopt standard notation and introduce the error � related to an output unit:
@E
n
@W
(nl
�1)ij
= (yi
� t
i
)a(nl
�1)j
= �
(nl
)i
a
(nl
�1)j
. (47)
So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:
@E
@W
(nl
�2)ij
=X
n
@E
n
@a
(nl
)| {z }�
(nl
)
@a
(nl
)
@W
(nl
�2)ij
+ �W
(nl
�2)ji
. (48)
Now,
(�(nl
))T@a
(nl
)
@W
(nl
�2)ij
= (�(nl
))T@z
(nl
)
@W
(nl
�2)ij
(49)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)a
(nl
�1) (50)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)·i a
(nl
�1)i
(51)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
a
(nl
�1)i
(52)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(z(nl
�1)i
) (53)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(W (nl
�2)i· a
(nl
�2)) (54)
= (�(nl
))TW (nl
�1)·i f
0(z(nl
�1)i
)a(nl
�2)j
(55)
=⇣(�(nl
))TW (nl
�1)·i
⌘f
0(z(nl
�1)i
)a(nl
�2)j
(56)
=
0
@s
l+1X
j=1
W
(nl
�1)ji
�
(nl
)j
)
1
Af
0(z(nl
�1)i
)
| {z }
a
(nl
�2)j
(57)
= �
(nl
�1)i
a
(nl
�2)j
(58)
where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.
So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):
�
(l) =⇣(W (l))T �(l+1)
⌘� f 0(z(l)), (59)
7
![Page 33: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/33.jpg)
Twolayerneuralnetsandfullbackprop
• Lastmissingpiece:
• IngeneralforanymatrixW(l)atinternallayerl andanyerrorwithregularizationERallbackprop instandardmultilayerneuralnetworksboilsdownto2equations:
• Topandbottomlayershavesimpler±
4/12/16RichardSocherLecture5,Slide 33
The second derivative in eq. 28 for output units is simply
@a
(nl
)i
@W
(nl
�1)ij
=@
@W
(nl
�1)ij
z
(nl
)i
=@
@W
(nl
�1)ij
⇣W
(nl
�1)i· a
(nl
�1)⌘= a
(nl
�1)j
. (46)
We adopt standard notation and introduce the error � related to an output unit:
@E
n
@W
(nl
�1)ij
= (yi
� t
i
)a(nl
�1)j
= �
(nl
)i
a
(nl
�1)j
. (47)
So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:
@E
@W
(nl
�2)ij
=X
n
@E
n
@a
(nl
)| {z }�
(nl
)
@a
(nl
)
@W
(nl
�2)ij
+ �W
(nl
�2)ji
. (48)
Now,
(�(nl
))T@a
(nl
)
@W
(nl
�2)ij
= (�(nl
))T@z
(nl
)
@W
(nl
�2)ij
(49)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)a
(nl
�1) (50)
= (�(nl
))T@
@W
(nl
�2)ij
W
(nl
�1)·i a
(nl
�1)i
(51)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
a
(nl
�1)i
(52)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(z(nl
�1)i
) (53)
= (�(nl
))TW (nl
�1)·i
@
@W
(nl
�2)ij
f(W (nl
�2)i· a
(nl
�2)) (54)
= (�(nl
))TW (nl
�1)·i f
0(z(nl
�1)i
)a(nl
�2)j
(55)
=⇣(�(nl
))TW (nl
�1)·i
⌘f
0(z(nl
�1)i
)a(nl
�2)j
(56)
=
0
@s
l+1X
j=1
W
(nl
�1)ji
�
(nl
)j
)
1
Af
0(z(nl
�1)i
)
| {z }
a
(nl
�2)j
(57)
= �
(nl
�1)i
a
(nl
�2)j
(58)
where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.
So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):
�
(l) =⇣(W (l))T �(l+1)
⌘� f 0(z(l)), (59)
7
where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a
(l))a(l). Using that definition, we get thehidden layer backprop derivatives:
@
@W
(l)ij
E
R
= a
(l)j
�
(l+1)i
+ �W
(l)ij
(60)
(61)
Which in one simplified vector notation becomes:
@
@W
(l)E
R
= �
(l+1)(a(l))T + �W
(l). (62)
In summary, the backprop procedure consists of four steps:
1. Apply an input x
n
and forward propagate it through the network to get the hidden and outputactivations using eq. 18.
2. Evaluate �
(nl
) for output units using eq. 42.
3. Backpropagate the �’s to obtain a �
(l) for each hidden layer in the network using eq. 59.
4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.
If you have any further questions or found errors, please send an email to [email protected]
5 Recursive Neural Networks
Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.
References
[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.
8
![Page 34: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/34.jpg)
Visualizationofintuition
• Let’ssaywewant withpreviouslayerandf=¾
4/12/16RichardSocherLecture1,Slide 34
Our first example: Backpropagation using error vectors
CS224D: Deep Learning for NLP 31
1 σ 1 z(1) a(1)
W(1) z(2) a(2) W(2)
z(3) s
δ(3)
Gradient w.r.t W(2) = δ(3)a(2)T
![Page 35: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/35.jpg)
Visualizationofintuition
4/12/16RichardSocherLecture5,Slide 35
Our first example: Backpropagation using error vectors
CS224D: Deep Learning for NLP 32
1 σ 1 z(1) a(1)
W(1) z(2) a(2) W(2)
z(3) s
δ(3) W(2)T δ(3)
--Reusing the δ(3) for downstream updates. --Moving error vector across affine transformation simply requires multiplication with the transpose of forward matrix --Notice that the dimensions will line up perfectly too!
![Page 36: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/36.jpg)
VisualizationofintuitionOur first example: Backpropagation using error vectors
CS224D: Deep Learning for NLP 33
1 σ 1 z(1) a(1)
W(1) z(2) a(2) W(2)
z(3) s
W(2)T δ(3) σ’(z(2))!W(2)T δ(3)
= δ(2)
--Moving error vector across point-wise non-linearity requires point-wise multiplication with local gradient of the non-linearity
4/12/16RichardSocherLecture5,Slide 36
![Page 37: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/37.jpg)
VisualizationofintuitionOur first example: Backpropagation using error vectors
CS224D: Deep Learning for NLP 34
1 σ 1 z(1) a(1)
W(1) z(2) a(2) W(2)
z(3) s
δ(2)
Gradient w.r.t W(1) = δ(2)a(1)T
W(1)T δ(2)
4/12/16RichardSocherLecture5,Slide 37
![Page 38: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/38.jpg)
Backpropagation (Anotherexplanation)• Computegradientofexample-wiselosswrt
parameters
• Simplyapplyingthederivativechainrulewisely
• Ifcomputingtheloss(example,parameters)isO(n)computation,thensoiscomputingthegradient
38
![Page 39: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/39.jpg)
Simple Chain Rule
39
![Page 40: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/40.jpg)
Multiple Paths Chain Rule
40
![Page 41: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/41.jpg)
MultiplePathsChainRule- General
…
41
![Page 42: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/42.jpg)
Chain Rule in Flow Graph
…
…
…
Flowgraph:anydirectedacyclicgraphnode=computationresultarc=computationdependency
=successorsof
42
![Page 43: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/43.jpg)
Back-Prop in Multi-Layer Net
…
…
43
h = sigmoid(Vx)
![Page 44: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/44.jpg)
Back-Prop in General Flow Graph
…
…
…
=successorsof
1. Fprop:visitnodes intopo-sortorder- Computevalueofnodegivenpredecessors
2. Bprop:- initializeoutputgradient=1- visitnodes inreverseorder:
Computegradientwrt eachnodeusinggradientwrt successors
Single scalaroutput
44
![Page 45: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/45.jpg)
Automatic Differentiation
• Thegradientcomputationcanbeautomaticallyinferredfromthesymbolicexpressionofthefprop.
• Eachnodetypeneedstoknowhowtocomputeitsoutputandhowtocomputethegradientwrt itsinputsgiventhegradientwrt itsoutput.
• Easyandfastprototyping
45…
…
![Page 46: CS224d Deep NLP Lecture 5: Project Information Neural ... · Project Information + Neural Networks & Backprop Richard Socher richard@metamind.io. Overview Today: ... • Implement](https://reader033.fdocuments.net/reader033/viewer/2022042005/5e6f799e35aeae56a7397758/html5/thumbnails/46.jpg)
Summary
4/12/16RichardSocher46
• Congrats!
• Yousurvivedthehardestpartofthisclass.
• Everythingelsefromnowonisjustmorematrixmultiplicationsandbackprop :)
• Nextup:• RecurrentNeuralNetworks