Lecture 9: Neural Networks

65
Lecture 9: Neural Networks Shuai Li John Hopcroft Center, Shanghai Jiao Tong University https://shuaili8.github.io https://shuaili8.github.io/Teaching/CS410/index.html 1

Transcript of Lecture 9: Neural Networks

Page 1: Lecture 9: Neural Networks

Lecture 9: Neural NetworksShuai Li

John Hopcroft Center, Shanghai Jiao Tong University

https://shuaili8.github.io

https://shuaili8.github.io/Teaching/CS410/index.html

1

Page 2: Lecture 9: Neural Networks

Brief history of artificial neural nets

β€’ The First waveβ€’ 1943 McCulloch and Pitts proposed the McCulloch-Pitts neuron model

β€’ 1958 Rosenblatt introduced the simple single layer networks now called Perceptrons

β€’ 1969 Minsky and Papert’s book Perceptrons demonstrated the limitation of single layer perceptrons, and almost the whole field went into hibernation

β€’ The Second waveβ€’ 1986 The Back-Propagation learning algorithm for Multi-Layer Perceptrons was

rediscovered and the whole field took off again

β€’ The Third waveβ€’ 2006 Deep (neural networks) Learning gains popularity

β€’ 2012 made significant break-through in many applications

2

Page 3: Lecture 9: Neural Networks

Biological neuron structure

β€’ The neuron receives signals from their dendrites, and send its own signal to the axon terminal

3

Page 4: Lecture 9: Neural Networks

Biological neural communicationβ€’ Electrical potential across cell membrane

exhibits spikes called action potentials

β€’ Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters

β€’ Chemical diffuses across synapse to dendrites of other neurons

β€’ Neurotransmitters can be excitatory or inhibitory

β€’ If net input of neuro transmitters to a neuron from other neurons is excitatory and exceeds some threshold, it fires an action potential

4

Page 5: Lecture 9: Neural Networks

Perceptron

β€’ Inspired by the biological neuron among humans and animals, researchers build a simple model called Perceptron

β€’ It receives signals π‘₯𝑖’s, multiplies them with different weights 𝑀𝑖, and outputs the sum of the weighted signals after an activation function, step function

5

Page 6: Lecture 9: Neural Networks

Neuron vs. Perceptron

6

Page 7: Lecture 9: Neural Networks

Artificial neural networks

β€’ Multilayer perceptron network

β€’ Convolutional neural network

β€’ Recurrent neural network

7

Page 8: Lecture 9: Neural Networks

Perceptron

8

Page 9: Lecture 9: Neural Networks

Training

β€’ 𝑀𝑖 ← 𝑀𝑖 βˆ’ πœ‚ π‘œ βˆ’ 𝑦 π‘₯𝑖‒ 𝑦: the real label

β€’ π‘œ: the output for the perceptron

β€’ πœ‚: the learning rate

β€’ Explanationβ€’ If the output is correct, do nothing

β€’ If the output is higher, lower the weight

β€’ If the output is lower, increase the weight

9

Page 10: Lecture 9: Neural Networks

Properties

β€’ Rosenblatt [1958] proved the training can converge if two classes are linearly separable and πœ‚ is reasonably small

10

Page 11: Lecture 9: Neural Networks

Limitation

β€’ Minsky and Papert [1969] showed that some rather elementary computations, such as XOR problem, could not be done by Rosenblatt’s one-layer perceptron

β€’ However Rosenblatt believed the limitations could be overcome if more layers of units to be added, but no learning algorithm known to obtain the weights yet

11

Page 12: Lecture 9: Neural Networks

Solution: Add hidden layers

β€’ Two-layer feedforward neural network12

Page 13: Lecture 9: Neural Networks

Demo

β€’ Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by their class, and the decision regions by a trained neural network are shown underneath. You can play with these examples in this ConvNetsJS demo. 13

Page 14: Lecture 9: Neural Networks

Activation Functions

14

Page 15: Lecture 9: Neural Networks

Activation functions

β€’ Sigmoid: 𝜎 𝑧 =1

1+π‘’βˆ’π‘§

β€’ Tanh: tanh 𝑧 =π‘’π‘§βˆ’π‘’βˆ’π‘§

𝑒𝑧+π‘’βˆ’π‘§

β€’ ReLU (Rectified Linear Unity): ReLU 𝑧 = max 0, 𝑧

15

Most popular in fully connected neural network

Most popular in deep learning

Page 16: Lecture 9: Neural Networks

Activation function values and derivatives

16

Page 17: Lecture 9: Neural Networks

β€’ Its derivativeπœŽβ€² 𝑧 = 𝜎 𝑧 1 βˆ’ 𝜎 𝑧

β€’ Output range 0,1

β€’ Motivated by biological neurons and can be interpreted as the probability of an artificial neuron β€œfiring” given its inputs

β€’ However, saturated neurons make value vanished (why?)

β€’ 𝑓 𝑓 𝑓 β‹―

β€’ 𝑓 0,1 βŠ† 0.5, 0.732

β€’ 𝑓 0.5, 0.732 βŠ† 0.622,0.676

Sigmoid activation function

β€’ Sigmoid

𝜎 𝑧 =1

1 + π‘’βˆ’π‘§

17

Page 18: Lecture 9: Neural Networks

Tanh activation function

β€’ Tanh function

tanh 𝑧 =sinh(𝑧)

cosh(𝑧)=

𝑒𝑧 βˆ’ π‘’βˆ’π‘§

𝑒𝑧 + π‘’βˆ’π‘§

18

β€’ Its derivativetanh β€² 𝑧 = 1 βˆ’ tanh2(𝑧)

β€’ Output range βˆ’1,1

β€’ Thus strongly negative inputs to the tanh will map to negative outputs

β€’ Only zero-valued inputs are mapped to near-zero outputs

β€’ These properties make the network less likely to get β€œstuck” during training

Page 19: Lecture 9: Neural Networks

ReLU activation function

β€’ ReLU (Rectified linear unity) function

ReLU 𝑧 = max 0, 𝑧

19

β€’ Its derivative

ReLUβ€² 𝑧 = α‰Š1 if 𝑧 > 00 if 𝑧 ≀ 0

β€’ ReLU can be approximated by softplus function

β€’ ReLU’s gradient doesn't vanish as x increases

β€’ Speed up training of neural networksβ€’ Since the gradient computation is very simple

β€’ The computational step is simple, no exponentials, no multiplication or division operations (compared to others)

β€’ The gradient on positive portion is larger than sigmoid or tanh functions

β€’ Update more rapidly

β€’ The left β€œdead neuron” part can be ameliorated by Leaky ReLU

Page 20: Lecture 9: Neural Networks

ReLU activation function (cont.)

β€’ ReLU functionReLU 𝑧 = max 0, 𝑧

20

β€’ The only non-linearity comes from the path selection with individual neurons being active or not

β€’ It allows sparse representations:β€’ for a given input only a subset of neurons are active

Sparse propagation of activations and gradients

Page 21: Lecture 9: Neural Networks

Multilayer Perceptron Networks

21

Page 22: Lecture 9: Neural Networks

Neural networks

β€’ Neural networks are built by connecting many perceptrons together, layer by layer

22

Page 23: Lecture 9: Neural Networks

Universal approximation theorem

β€’ A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions

23

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universalapproximators." Neural networks 2.5 (1989): 359-366

1-20-1 NN approximates a noisy sine function

Page 24: Lecture 9: Neural Networks

Increasing power of approximation

β€’ With more neurons, its approximation power increases. The decision boundary covers more details

β€’ Usually in applications, we use more layers with structures to approximate complex functions instead of one hidden layer with many neurons 24

Page 25: Lecture 9: Neural Networks

Common neuron network structures at a glance

25

Page 26: Lecture 9: Neural Networks

Multilayer perceptron network

26

Page 27: Lecture 9: Neural Networks

Single / Multiple layers of calculation

β€’ Single layer functionπ‘“πœƒ π‘₯ = 𝜎 πœƒ0 + πœƒ1π‘₯1 + πœƒ2π‘₯2

β€’ Multiple layer functionβ€’ β„Ž1 π‘₯ = 𝜎 πœƒ0

1 + πœƒ11π‘₯1 + πœƒ2

1π‘₯2β€’ β„Ž2 π‘₯ = 𝜎 πœƒ0

2 + πœƒ12π‘₯1 + πœƒ2

2π‘₯2β€’ π‘“πœƒ β„Ž = 𝜎 πœƒ0 + πœƒ1β„Ž1 + πœƒ2β„Ž2

27

π‘₯1 π‘₯2

π‘“πœƒ

π‘₯1 π‘₯2

π‘“πœƒ

β„Ž2β„Ž1

Page 28: Lecture 9: Neural Networks

TrainingBackpropagation

28

Page 29: Lecture 9: Neural Networks

How to train?

β€’ As previous models, we use gradient descent method to train the neural network

β€’ Given the topology of the network (number of layers, number of neurons, their connections), find a set of weights to minimize the error function

OutputTargetThe set of training

examples

29

Page 30: Lecture 9: Neural Networks

Gradient interpretation

β€’ Gradient is the vector (the red one) along which the value of the function increases most rapidly. Thus its opposite direction is where the value decreases most rapidly

30

Page 31: Lecture 9: Neural Networks

Gradient descent

β€’ To find a (local) minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or an approximation) of the function at the current point

β€’ For a smooth function 𝑓(π‘₯), πœ•π‘“

πœ•π‘₯is the direction that 𝑓 increases most

rapidly. So we apply

π‘₯𝑑+1 = π‘₯𝑑 βˆ’ πœ‚πœ•π‘“

πœ•π‘₯(π‘₯𝑑)

until π‘₯ converges

31

Page 32: Lecture 9: Neural Networks

Gradient descent

Gradient

Training Rule

i.e.

Update

Partial derivatives are the key

32

Page 33: Lecture 9: Neural Networks

The chain rule

β€’ The challenge in neural network model is that we only know the target of the output layer, but don’t know the target for hidden and input layers, how can we update their connection weights using the gradient descent?

β€’ The answer is the chain rule that you have learned in calculus

𝑦 = 𝑓(𝑔(π‘₯))

⇒𝑑𝑦

𝑑π‘₯= 𝑓′(𝑔(π‘₯))𝑔′(π‘₯)

33

Page 34: Lecture 9: Neural Networks

Feed forward vs. Backpropagation

34

Page 35: Lecture 9: Neural Networks

Make a prediction

35

𝑑1

π‘‘π‘˜

2

Page 36: Lecture 9: Neural Networks

Make a prediction (cont.)

36

𝑑1

π‘‘π‘˜

2

Page 37: Lecture 9: Neural Networks

Make a prediction (cont.)

37

𝑑1

π‘‘π‘˜

2

Page 38: Lecture 9: Neural Networks

Backpropagationβ€’ Assume all the activation functions are sigmoid

β€’ Error function 𝐸 =1

2Οƒπ‘˜ π‘¦π‘˜ βˆ’ π‘‘π‘˜

2

β€’πœ•πΈ

πœ•π‘¦π‘˜= π‘¦π‘˜ βˆ’ π‘‘π‘˜

β€’πœ•π‘¦π‘˜

πœ•π‘€π‘˜,𝑗(2) = 𝑓(2)

β€² π‘›π‘’π‘‘π‘˜(2)

β„Žπ‘—(1)

= π‘¦π‘˜ 1 βˆ’ π‘¦π‘˜ β„Žπ‘—(1)

β€’ β‡’πœ•πΈ

πœ•π‘€π‘˜,𝑗(2) = βˆ’ π‘‘π‘˜ βˆ’ π‘¦π‘˜ π‘¦π‘˜ 1 βˆ’ π‘¦π‘˜ β„Žπ‘—

(1)

β€’ β‡’ π‘€π‘˜,𝑗(2)

← π‘€π‘˜,𝑗(2)

+ πœ‚π›Ώπ‘˜(2)β„Žπ‘—(1)

38

π›Ώπ‘˜(2)

𝑑1

π‘‘π‘˜

Output of unit 𝑗

2

Page 39: Lecture 9: Neural Networks

Backpropagation (cont.)β€’ Error function 𝐸 =

1

2Οƒπ‘˜ π‘¦π‘˜ βˆ’ π‘‘π‘˜

2

β€’πœ•πΈ

πœ•π‘¦π‘˜= π‘¦π‘˜ βˆ’ π‘‘π‘˜

β€’ π›Ώπ‘˜(2)

= π‘‘π‘˜ βˆ’ π‘¦π‘˜ π‘¦π‘˜ 1 βˆ’ π‘¦π‘˜

β€’ β‡’ π‘€π‘˜,𝑗(2)

← π‘€π‘˜,𝑗(2)

+ πœ‚π›Ώπ‘˜(2)β„Žπ‘—(1)

β€’πœ•π‘¦π‘˜

πœ•β„Žπ‘—(1) = π‘¦π‘˜ 1 βˆ’ π‘¦π‘˜ π‘€π‘˜,𝑗

(2)

β€’πœ•β„Žπ‘—

(1)

πœ•π‘€π‘—,π‘š(1) = 𝑓(1)

β€² 𝑛𝑒𝑑𝑗(1)

π‘₯π‘š = β„Žπ‘—(1)

1 βˆ’ β„Žπ‘—(1)

π‘₯π‘š

β€’πœ•πΈ

πœ•π‘€π‘—,π‘š(1) = βˆ’β„Žπ‘—

(1)1 βˆ’ β„Žπ‘—

(1)Οƒπ‘˜π‘€π‘˜,𝑗

2π‘‘π‘˜ βˆ’ π‘¦π‘˜ π‘¦π‘˜ 1 βˆ’ π‘¦π‘˜ π‘₯π‘š

= βˆ’β„Žπ‘—(1)

1 βˆ’ β„Žπ‘—(1)

π‘˜

π‘€π‘˜,𝑗2π›Ώπ‘˜(2)

π‘₯π‘š

β€’ β‡’ 𝑀𝑗,π‘š(1)

← 𝑀𝑗,π‘š(1)

+ πœ‚π›Ώπ‘—(1)π‘₯π‘š

39

𝛿𝑗(1)

𝑑1

π‘‘π‘˜

2

Page 40: Lecture 9: Neural Networks

Backpropagation algorithms

β€’ Activation function: sigmoid

Initialize all weights to small random numbers

Do until convergence

β€’ For each training example:1. Input it to the network and compute the network output2. For each output unit π‘˜, π‘œπ‘˜ is the output of unit π‘˜

π›Ώπ‘˜ ← π‘œπ‘˜ 1 βˆ’ π‘œπ‘˜ π‘‘π‘˜ βˆ’ π‘œπ‘˜3. For each hidden unit 𝑗, π‘œπ‘— is the output of unit 𝑗

𝛿𝑗 ← π‘œπ‘— 1 βˆ’ π‘œπ‘—

π‘˜βˆˆnext layer

π‘€π‘˜,π‘—π›Ώπ‘˜

4. Update each network weight, where π‘₯𝑖 is the output for unit 𝑖

𝑀𝑗,𝑖 ← 𝑀𝑗,𝑖 + πœ‚π›Ώπ‘—π‘₯𝑖40

β€’ Error function 𝐸 =1

2Οƒπ‘˜ π‘¦π‘˜ βˆ’ π‘‘π‘˜

2

β€’ π›Ώπ‘˜(2)

= π‘‘π‘˜ βˆ’ π‘¦π‘˜ π‘¦π‘˜ 1 βˆ’ π‘¦π‘˜

β€’ β‡’ π‘€π‘˜,𝑗(2)

← π‘€π‘˜,𝑗(2)

+ πœ‚π›Ώπ‘˜(2)β„Žπ‘—(1)

β€’ 𝛿𝑗(1)

= β„Žπ‘—(1)

1 βˆ’ β„Žπ‘—(1) Οƒπ‘˜π‘€π‘˜,𝑗

2π›Ώπ‘˜(2)

β€’ β‡’ 𝑀𝑗,π‘š(1)

← 𝑀𝑗,π‘š(1)

+ πœ‚π›Ώπ‘—(1)π‘₯π‘š

Page 41: Lecture 9: Neural Networks

See the backpropagation demo

β€’ https://shuaili8.github.io/Teaching/VE445/L6%20backpropagation%20demo.html

41

Page 42: Lecture 9: Neural Networks

Formula example for backpropagation

42

Page 43: Lecture 9: Neural Networks

Formula example for backpropagation (cont.)

43

Page 44: Lecture 9: Neural Networks

Calculation example

β€’ Consider the simple network below:

β€’ Assume that the neurons have sigmoid activation function and β€’ Perform a forward pass on the network and find the predicted output

β€’ Perform a reverse pass (training) once (target = 0.5) with πœ‚ = 1

β€’ Perform a further forward pass and comment on the result

44

Page 45: Lecture 9: Neural Networks

Calculation example (cont.)

45

β€’ For each output unit π‘˜, π‘œπ‘˜ is the outputof unit π‘˜

π›Ώπ‘˜ ← π‘œπ‘˜ 1 βˆ’ π‘œπ‘˜ π‘‘π‘˜ βˆ’ π‘œπ‘˜

β€’ For each hidden unit 𝑗, π‘œπ‘— is the output of unit 𝑗

𝛿𝑗 ← π‘œπ‘— 1 βˆ’ π‘œπ‘—

π‘˜βˆˆπ‘›π‘’π‘₯𝑑 π‘™π‘Žπ‘¦π‘’π‘Ÿ

π‘€π‘˜,π‘—π›Ώπ‘˜

β€’ Update each network weight, where π‘₯𝑖 is the input for unit 𝑗

𝑀𝑗,𝑖 ← 𝑀𝑗,𝑖 + πœ‚π›Ώπ‘—π‘₯𝑖

Page 46: Lecture 9: Neural Networks

Calculation example (cont.)

β€’ Answer (i)β€’ Input to top neuron = 0.35 Γ— 0.1 + 0.9 Γ— 0.8 = 0.755. Out=0.68β€’ Input to bottom neuron = 0.35 Γ— 0.4 + 0.9 Γ— 0.6 = 0.68. Out= 0.6637β€’ Input to final neuron = 0.3 Γ— 0.68 + 0.9 Γ— 0.6637 = 0.80133. Out= 0.69

β€’ (ii) It is both OK to use new or old weights when computing 𝛿𝑗 for hidden units

β€’ Output error 𝛿 = 𝑑 βˆ’ π‘œ π‘œ 1 βˆ’ π‘œ = 0.5 βˆ’ 0.69 Γ— 0.69 Γ— 1 βˆ’ 0.69 = βˆ’0.0406β€’ Error for top hidden neuron 𝛿1 = 0.68 Γ— 1 βˆ’ 0.68 Γ— 0.3 Γ— βˆ’0.0406 = βˆ’0.00265β€’ Error for top hidden neuron 𝛿2 = 0.6637 Γ— 1 βˆ’ 0.6637 Γ— 0.9 Γ— βˆ’0.0406 = βˆ’0.008156β€’ New weights for the output layer

β€’ π‘€π‘œ1 = 0.3 βˆ’ 0.0406 Γ— 0.68 = 0.272392

β€’ π‘€π‘œ2 = 0.9 βˆ’ 0.0406 Γ— 0.6637 = 0.87305

β€’ New weights for the hidden layer

β€’ 𝑀1𝐴 = 0.1 βˆ’ 0.00265 Γ— 0.35 = 0.0991

β€’ 𝑀1𝐡 = 0.8 βˆ’ 0.00265 Γ— 0.9 = 0.7976

β€’ 𝑀2𝐴 = 0.4 βˆ’ 0.008156 Γ— 0.35 = 0.3971

β€’ 𝑀2𝐡 = 0.6 βˆ’ 0.008156 Γ— 0.9 = 0.5927

β€’ (iii)β€’ Input to top neuron = 0.35 Γ— 0.0991 + 0.9 Γ— 0.7976 = 0.7525. Out=0.6797β€’ Input to bottom neuron = 0.35 Γ— 0.3971 + 0.9 Γ— 0.5927 = 0.6724. Out= 0.662β€’ Input to final neuron = 0.272392 Γ— 0.6797 + 0.87305 Γ— 0.662 = 0.7631. Out= 0.682β€’ New error is βˆ’0.182, which is reduced compared to old error βˆ’0.19

46

β€’ For each output unit π‘˜, π‘œπ‘˜ is the outputof unit π‘˜

π›Ώπ‘˜ ← π‘œπ‘˜ 1 βˆ’ π‘œπ‘˜ π‘‘π‘˜ βˆ’ π‘œπ‘˜

β€’ For each hidden unit 𝑗, π‘œπ‘— is the output of unit 𝑗

𝛿𝑗 ← π‘œπ‘— 1 βˆ’ π‘œπ‘—

π‘˜βˆˆπ‘›π‘’π‘₯𝑑 π‘™π‘Žπ‘¦π‘’π‘Ÿ

π‘€π‘˜,π‘—π›Ώπ‘˜

β€’ Update each network weight, where π‘₯𝑖 is the input for unit 𝑗

𝑀𝑗,𝑖 ← 𝑀𝑗,𝑖 + πœ‚π›Ώπ‘—π‘₯𝑖

Page 47: Lecture 9: Neural Networks

More on backpropagations

β€’ It is gradient descent over entire network weight vectors

β€’ Easily generalized to arbitrary directed graphs

β€’ Will find a local, not necessarily global error minimumβ€’ In practice, often work well (can run multiple times)

β€’ Often include weight momentum 𝛼Δ𝑀𝑗,𝑖 𝑛 ← πœ‚π›Ώπ‘—π‘₯𝑖 + 𝛼Δ𝑀𝑗,𝑖(𝑛 βˆ’ 1)

β€’ Minimizes error over training examplesβ€’ Will it generalize well to unseen examples?

β€’ Training can take thousands of iterations, slow!

β€’ Using network after training is very fast47

Page 48: Lecture 9: Neural Networks

An Training Example

48

Page 49: Lecture 9: Neural Networks

Network structure

β€’ What happens when we train a network?

β€’ Example:

49

Page 50: Lecture 9: Neural Networks

Input and output

β€’ A target function:

β€’ Can this be learned?50

Page 51: Lecture 9: Neural Networks

Learned hidden layer

51

Page 52: Lecture 9: Neural Networks

Convergence of sum of squared errors

52

Page 53: Lecture 9: Neural Networks

Hidden unit encoding for 01000000

53

Page 54: Lecture 9: Neural Networks

Convergence of first layer weights

54

Page 55: Lecture 9: Neural Networks

Convergence of backpropagation

β€’ Gradient descent to some local minimumβ€’ Perhaps not global minimum

β€’ Add momentum

β€’ Stochastic gradient descent

β€’ Train multiple times with different initial weights

β€’ Nature of convergenceβ€’ Initialize weights near zero

β€’ Therefore, initial networks near-convex

β€’ Increasingly approximate non-linear functions as training progresses

55

Page 56: Lecture 9: Neural Networks

Expressiveness of neural nets

β€’ Boolean functions:β€’ Every Boolean function can be represented by network with single hidden

layer

β€’ But might require exponential (in number of inputs) hidden units

β€’ Continuous functions:β€’ Every bounded continuous function can be approximated with arbitrarily

small error, by network with one hidden layer

β€’ Any function can be approximated to arbitrary accuracy by network with two hidden layers

56

Page 57: Lecture 9: Neural Networks

Tensorflow playground

β€’ http://playground.tensorflow.org/ 57

Page 58: Lecture 9: Neural Networks

Overfitting

58

Page 59: Lecture 9: Neural Networks

Two examples of overfitting

59

Page 60: Lecture 9: Neural Networks

Avoid overfitting

β€’ Penalize large weights:

β€’ Train on target slopes as well as values:

β€’ Weight sharingβ€’ Reduce total number of weights

β€’ Using structures, like convolutional neural networks

β€’ Early stopping

β€’ Dropout 60

Page 61: Lecture 9: Neural Networks

Applications

61

Page 62: Lecture 9: Neural Networks

Autonomous self-driving cars

β€’ Use neural network to summarize observation information and learn path planning

62

Page 63: Lecture 9: Neural Networks

Autonomous self-driving cars (cont.)

63

Page 64: Lecture 9: Neural Networks

A recipe for training neural networks

β€’ By Andrej Karpathy

β€’ https://karpathy.github.io/2019/04/25/recipe/

64

Page 65: Lecture 9: Neural Networks

Summary

β€’ Perceptron

β€’ Activation functions

β€’ Multilayer perceptron networks

β€’ Training: backpropagation

β€’ Examples

β€’ Overfitting

β€’ Applications

Questions?

https://shuaili8.github.io

Shuai Li

65