DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND...

DEEPLEARNINGANDNEURALNETWORKS:BACKGROUNDANDHISTORY

1

On-lineResources• http://neuralnetworksanddeeplearning.com/index.htmlOnlinebookbyMichaelNielsen

• http://matlabtricks.com/post-5/3x3-convolution-kernels-with-online-demo - ofconvolutions

• https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html - demoofCNN

• http://scs.ryerson.ca/~aharley/vis/conv/ - 3Dvisualization

• http://cs231n.github.io/ StanfordCSclassCS231n:ConvolutionalNeuralNetworksforVisualRecognition.

• http://www.deeplearningbook.org/ MITPressbookfromBengio etal,freeonlineversion

2

Ahistoryofneuralnetworks• 1940s-60’s:

– McCulloch&Pitts;Hebb:modelingrealneurons– Rosenblatt,Widrow-Hoff::perceptrons– 1969:Minskey &Papert,Perceptrons bookshowedformallimitationsofone-layerlinearnetwork

• 1970’s-mid-1980’s:…• mid-1980’s– mid-1990’s:

– backprop andmulti-layernetworks– Rumelhart andMcClellandPDP bookset– Sejnowski’s NETTalk,BP-basedtext-to-speech– NeuralInfoProcessingSystems(NIPS)conferencestarts

• Mid1990’s-early2000’s:…• Mid-2000’stocurrent:

– Moreandmoreinterestandexperimentalsuccess

3

Multilayernetworks• Simplestcase:classifierisamultilayernetworkoflogisticunits

• Eachunittakessomeinputsandproducesoneoutputusingalogisticclassifier

• Outputofoneunitcanbetheinputofanother

Input layer

Output layer

Hidden layer

v1=S(wTX)w0,1

x1

x2

1

v2=S(wTX)

z1=S(wTV)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

8

Learningamultilayernetwork• Definealoss(simplestcase:squarederror)

– Butoveranetworkof“units”• Minimizelosswithgradientdescent

– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable

€

JX,y (w) = y i − ˆ y i( )i∑

2

v1=S(wTX)w0,1

x1

x2

1

v2=S(wTX)

z1=S(wTV)

w1,1

w2,1

w0,2

w1,2

w2,2

w1

w2

9

ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks(eg CNNs)

• WorkedwellbuttrainingwasslowandfinickyNov1998– Yann LeCunn,Bottou,Bengio,Haffner

10

ANNsinthe90’s• Mostly2-layernetworksorelsecarefullyconstructed“deep”networks

• Workedwellbuttrainingtypicallytookweekswhenguidedbyanexpert

SVM:98.9-99.2%accurate

CNNs:98.3-99.3%accurate

11

Learningamultilayernetwork• Definealoss(simplestcase:squarederror)

– Butoveranetworkof“units”• Minimizelosswithgradientdescent

– Youcandothisovercomplexnetworksifyoucantakethegradient ofeachunit:everycomputationisdifferentiable

€

JX,y (w) = y i − ˆ y i( )i∑

2

12

Example:weightupdatesformultilayerANNwithsquarelossandlogisticunits

δk ≡ tk − ak( ) ak 1− ak( )

δ j ≡ δkwkj( )k∑ aj 1− aj( )

For nodes k in output layer:

For nodes j in hidden layer:

For all weights:

“Propagate errors backward”BACKPROP

wkj = wkj −ε δkajwji = wji −ε δ jai

Can carry this recursion out further if you have multiple hidden layers

13

BACKPROP FORMLPS

14

BackProp inMatrix-VectorNotation

MichaelNielson:http://neuralnetworksanddeeplearning.com/

15

Notation

16

Notation

Eachdigitis28x28pixels=784inputs

17

Notation

wl isweightmatrixforlayerl

18

Notation

activationbias

wl

al andbl

19

Notation

Matrix:wl

Vector:alVector:bl

activationbias

weight

vectoràvector function:componentwise logistic

Vector:zl

20

Computationis“feedforward”

forl=1,2,…L:

21

Notation

Costfunctiontooptimize:sumoverexamplesx

where

Matrix:wl

Vector:alVector:bl

Vector:zl

Vector:y22

Notation

23

BackProp:lastlayer

Levellforl=1,…,LMatrix:wl

Vectors:• biasbl• activational• pre-sigmoidactiv:zl• targetoutputy• “localerror”δl

Matrixform:

componentsarecomponentsare

componentwiseproductofvectors

24

BackProp:lastlayer



Matrixformforsquareloss:

25

BackProp:erroratlevellintermsoferroratlevell+1



whichwecanusetocompute

26

BackProp:summary



27

Computationpropagateserrorsbackward

forl=L,L-1,…1:

forl=1,2,…L:

28

EXPRESSIVENESSOFDEEPNETWORKS

29

DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs– e.g.,(x3 ANDx5 AND…ANDx19)

• Everyboolean functioncanbeexpressedasanORofANDs– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…

• SoonehiddenlayercanexpressanyBF

(But it might need lots and lots of hidden units)

30

DeepANNsareexpressive• OnelogisticunitcanimplementandANDoranORofasubsetofinputs

– e.g.,(x3 ANDx5 AND…ANDx19)• Everyboolean functioncanbeexpressedasanORofANDs

– e.g.,(x3 ANDx5 )OR(x7 ANDx19)OR…• SoonehiddenlayercanexpressanyBF

• Example:parity(x1,…,xN)=1iff offnumberofxi’saresettoone

Parity(a,b,c,d)=(a&-b&-c&-d)OR(-a&b&-c&-d)OR…#listallthe“1s”(a&b&c&-d)OR(a&b&-c&d)OR…#listallthe“3s”

SizeingeneralisO(2N)

31

DeeperANNsaremoreexpressive• Atwo-layernetworkneedsO(2N)units• Atwo-layernetworkcanexpressbinaryXOR• A2*logN layernetworkcanexpresstheparityofNinputs(even/oddnumberof1’s)– WithO(logN)unitsinabinarytree

• Deepnetwork+parametertying~=subroutinesx1

x2

x3

x4

x5

x6

x7

x832

Hypotheticalcodeforfacerecognition

http://neuralnetworksanddeeplearning.com/chap1.html

….

….

33

PARALLELTRAININGFORANNS

34

HowareANNstrained?• Typically,withsomevariantofstreamingSGD– Keepthedataondisk,inapreprocessedform– Loopoveritmultipletimes– Keepthemodelinmemory

• Solutiontobigdata:butlongtrainingtimes!

• However,some parallelismisoftenused….

35

Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1

1+ e−x⋅w

36

Thispartcomputes

innerproduct<x,w>

Thispartlogisticof<x,w>

Recap:logisticregressionwithSGDP(Y =1| X = x) = p = 1

1+ e−x⋅w

37

Ononeexample:computes

innerproduct<x,w>

There’ssomechancetocomputethisinparallel…canwedomore?

a z

InANNswehavemanymanylogisticregressionnodes

38

Recap:logisticregressionwithSGD

39

ai zi

LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThenoutputai =x.wi


40

ai zi

LetxbeanexampleLetwi betheinputweightsforthei-th hiddenunitThena =xWisoutputforallmunits w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

..

…

W=


41

LetX beamatrixwithk examplesLetwi betheinputweightsforthei-th hiddenunitThenA =X Wisoutputforallmunitsforallkexamples

w1 w2 w3 … wm

0.1 -0.3 …

-1.7 …

0.3 …

1.2

x1 1 0 1 1x2 …

…

xk

XW=

x1.w1 x1.w2 … x1.wm

xk.w1 … … xk.wm

There’salotofchancestodothisinparallel

ANNsandmulticoreCPUs• Modernlibraries(Matlab,numpy,…)domatrixoperationsfast,inparallel

• ManyANNimplementationsexploitthisparallelismautomatically

• Keyimplementationissueisworkingwithmatricescomfortably

42

ANNsandGPUs• GPUsdomatrixoperationsveryfast,inparallel– Fordensematrixes,notsparseones!

• TrainingANNsonGPUsiscommon– SGDandminibatch sizesof128

• ModernANNimplementationscanexploitthis• GPUsarenotsuper-expensive– $500forhigh-endone– largemodelswithO(107)parameterscanfitinalarge-memoryGPU(12Gb)

• Speedupsof20x-50xhavebeenreported

43

ANNsandmulti-GPUsystems• TherearewaystosetupANNcomputationssothattheyarespreadacrossmultipleGPUs– SometimesinvolvessomesortofIPM– SometimesinvolvespartitioningthemodelacrossmultipleGPUs

– Oftenneededforverylargenetworks– Notespeciallyeasytoimplementanddowithmostcurrenttools

44

WHYAREDEEPNETWORKSHARDTOTRAIN?

45

Recap:weightupdatesformultilayerANN

δ Lk ≡ tk − ak( ) ak 1− ak( )

δ hj ≡ δ h+1j wkj( )

k∑ aj 1− aj( )

For nodes k in output layer L:

For nodes j in hidden layer h:

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it?

46

Gradientsareunstable

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?

Maxat1/4

If weights are usually < 1 then we are multiplying by many numbers < 1 so the gradients get very small.

The vanishing gradient problem

47

Gradientsareunstable

What happens as the layers get further and further from the output layer? E.g., what’s gradient for the bias term with several layers after it in a trivial net?

Maxat1/4

If weights are usually > 1 then we are multiplying by many numbers > 1 so the gradients get very big.

The exploding gradient problem(lesscommonbutpossible)

48

AIStats2010

Histogramofgradientsina5-layernetworkforanartificialimagerecognitiontask

input

output

49

AIStats2010

Wewillgettothesetrickseventually….

50

It’seasyforsigmoidunitstosaturate

Learningrateapproacheszero

andunitis“stuck”

componentsare

51


Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget

stuckwithafewlargeinputsORmanysmallones.52

It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis 500 ≈ 22.4

53

• SaturationvisualizationfromGlorot &Bengio 2010-- usingasmarterinitializationscheme


Closest-to-outputhiddenlayerstillstuckforfirst100

epochs

54

WHAT’SDIFFERENTABOUTMODERNANNS?

55

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.

• Useofalternatenon-linearities– reLU andhyperbolictangent

• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

56

Cross-entropyloss

Comparetogradientforsquarelosswhena~=1y=0andx=1

∂C∂w

=σ (z)− y

a z

57

Cross-entropyloss

58

Cross-entropyloss

59

Softmax outputlayer

Softmax

Networkoutputsaprobabilitydistribution!Cross-entropylossafterasoftmax layergivesaverysimple,numericallystablegradient

Δwij =(yi-zi)yj

60

Somekeydifferences• Useofsoftmax andentropiclossinsteadofquadraticloss.– Oftenlearningisfasterandmorestableaswellasgettingbetteraccuraciesinthelimit

• Useofalternatenon-linearities• Betterunderstandingofweightinitialization• Dataaugmentation– Especiallyforimagedata

61




62

Alternativenon-linearities• Changessofar– Changedtheloss fromsquareerrortocross-entropy

– Proposedaddinganotheroutputlayer(softmax)• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs

63

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– ThelogisticisnotwidelyusedinmodernANNs

Alternate1:tanh

Likelogisticfunctionbutshiftedtorange[-1,+1]

64

AIStats2010

Wewillgettothesetrickseventually….

depth5

65

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks

Alternate2:rectifiedlinearunit

Linearwithacutoffatzero

(Implement:clipthegradientwhenyoupasszero)

66

Alternativenon-linearities• Anewchange:modifyingthenonlinearity– reLU oftenusedinvisiontasks

Alternate2:rectifiedlinearunit

Softversion:log(exp(x)+1)

Doesn’tsaturate(atoneend)Sparsifies outputsHelpswithvanishinggradient

67




68


Forabignetworktherearelotsofweightedinputstoeachneuron.Ifanyofthemaretoolargethentheneuronwillsaturate.Soneuronsget

stuckwithafewlargeinputsORmanysmallones.69

It’seasyforsigmoidunitstosaturate• Ifthereare500non-zeroinputsinitializedwithaGaussian~N(0,1)thentheSDis

• Commonheuristicsforinitializingweights:

500 ≈ 22.4

U −1#inputs

, −1#inputs

"

#$$

%

&''N 0, 1

#inputs

!

"##

$

%&&

70

• SaturationvisualizationfromGlorot &Bengio 2010using

U −1#inputs

, −1#inputs

"

#$$

%

&''


71

Initializingtoavoidsaturation• InGlorot andBengio theysuggestweightsiflevelj(withnj inputs)from

Thisisnotalwaysthesolution– butgoodinitializationisveryimportant fordeepnets!

Firstbreakthroughdeeplearningresultswerebasedoncleverpre-traininginitialization

schemes,wheredeepnetworkswereseededwithweightslearnedfromunsupervised

strategies

72

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND...

Documents

Transcript of DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND …wcohen/10-605/deep-1.pdf · DEEP LEARNING AND...