Neural Networks - University of Texas at...

75
Neural Networks CSE 6363 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

Transcript of Neural Networks - University of Texas at...

Neural Networks

CSE 6363 – Machine Learning Vassilis Athitsos

Computer Science and Engineering Department University of Texas at Arlington

1

Perceptrons

• A perceptron is a function that maps D-dimensional vectors to real numbers.

• For notational convenience, we add a zero-th dimension to every input vector, that is always equal to 1.

• 𝑥0 is called the bias input. It is always equal to 1.

• 𝑤0 is called the bias weight. It is optimized during training.

2

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑥2

𝑥𝐷

Output: 𝑧

𝐱 =

1𝑥1𝑥2…𝑥𝐷

Perceptrons

• A perceptron computes its output 𝑧 in two steps:

First step: 𝑎 = 𝒘𝑇𝒙 = 𝑤𝑖𝑥𝑖𝐷𝑖=0

Second step: 𝑧 = ℎ 𝑎

• In a single formula: 𝑧 = ℎ 𝑤𝑖𝑥𝑖𝐷𝑖=0

3

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑥2

𝑥𝐷

Output: 𝑧

Perceptrons

• A perceptron computes its output 𝑧 in two steps:

First step: 𝑎 = 𝒘𝑇𝒙 = 𝑤𝑖𝑥𝑖𝐷𝑖=0

Second step: 𝑧 = ℎ 𝑎

• ℎ is called an activation function.

• For example, ℎ could be the sigmoidal function 𝜎 𝑎 =1

1+𝑒−𝑎 4

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑥2

𝑥𝐷

Output: 𝑧

Perceptrons

• We have seen perceptrons before, we just did not call them perceptrons.

• For example, logistic regression produces a classifier function

y 𝐱 = 𝜎 𝒘Τ𝜑(𝒙) .

• If we set 𝜑 𝒙 = 𝒙 and ℎ = 𝜎, then y 𝐱 is a perceptron.

5

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑥2

𝑥𝐷

Output: 𝑧

Perceptrons and

Neurons

• Perceptrons are inspired by neurons. – Neurons are the cells forming the nervous system, and the brain.

– Neurons somehow sum up their inputs, and if the sum exceeds a threshold, they "fire".

• Since brains are "intelligent", computer scientists have been hoping that perceptron-based systems can be used to model intelligence.

6

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑥2

𝑥𝐷

Output: 𝑧

Activation Functions

• A perceptron produces output 𝑧 = ℎ 𝒘𝑇𝒙 .

• One choice for the activation function ℎ: the step function.

ℎ 𝑎 = 0, if 𝑎 < 01, if 𝑎 ≥ 0

• The step function is useful for providing some intuitive examples.

• It is not useful for actual real-world systems. – Reason: it is not differentiable, it does not allow optimization via gradient

descent. 7

Activation Functions

• A perceptron produces output 𝑧 = ℎ 𝒘𝑇𝒙 .

• Another choice for the activation function ℎ(𝑎): the sigmoidal function.

𝜎 𝑎 =1

1+𝑒−𝑎

• The sigmoidal is often used in real-world systems.

• It is a differentiable function, it allows use of gradient descent.

8

Example: The AND Perceptron

• Suppose we use the step function for activation.

• Suppose boolean value false is represented as number 0.

• Suppose boolean value true is represented as number 1.

• Then, the perceptron below computes the boolean AND function:

9

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false AND false = false false AND true = false true AND false = false true AND true = true

Example: The AND Perceptron

• Verification: If 𝑥1 = 0 and 𝑥2 = 0:

– 𝒘𝑇𝒙 = −1.5 + 1 ∗ 0 + 1 ∗ 0 = −1.5.

– ℎ 𝒘𝑇𝒙 = h −1.5 = 0.

• Corresponds to case false AND false = false.

10

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false AND false = false false AND true = false true AND false = false true AND true = true

Example: The AND Perceptron

• Verification: If 𝑥1 = 0 and 𝑥2 = 1:

– 𝒘𝑇𝒙 = −1.5 + 1 ∗ 0 + 1 ∗ 1 = −0.5.

– ℎ 𝒘𝑇𝒙 = h −0.5 = 0.

• Corresponds to case false AND true = false.

11

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false AND false = false false AND true = false true AND false = false true AND true = true

Example: The AND Perceptron

• Verification: If 𝑥1 = 1 and 𝑥2 = 0:

– 𝒘𝑇𝒙 = −1.5 + 1 ∗ 1 + 1 ∗ 0 = −0.5.

– ℎ 𝒘𝑇𝒙 = h −0.5 = 0.

• Corresponds to case true AND false = false.

12

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false AND false = false false AND true = false true AND false = false true AND true = true

Example: The AND Perceptron

• Verification: If 𝑥1 = 1 and 𝑥2 = 1:

– 𝒘𝑇𝒙 = −1.5 + 1 ∗ 1 + 1 ∗ 1 = 0.5.

– ℎ 𝒘𝑇𝒙 = h 0.5 = 1.

• Corresponds to case true AND true = true.

13

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false AND false = false false AND true = false true AND false = false true AND true = true

Example: The OR Perceptron

• Suppose we use the step function for activation.

• Suppose boolean value false is represented as number 0.

• Suppose boolean value true is represented as number 1.

• Then, the perceptron below computes the boolean OR function:

14

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false OR false = false false OR true = true true OR false = true true OR true = true

Example: The OR Perceptron

• Verification: If 𝑥1 = 0 and 𝑥2 = 0:

– 𝒘𝑇𝒙 = −0.5 + 1 ∗ 0 + 1 ∗ 0 = −0.5.

– ℎ 𝒘𝑇𝒙 = h −0.5 = 0.

• Corresponds to case false OR false = false.

15

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false OR false = false false OR true = true true OR false = true true OR true = true

Example: The OR Perceptron

• Verification: If 𝑥1 = 0 and 𝑥2 = 1:

– 𝒘𝑇𝒙 = −0.5 + 1 ∗ 0 + 1 ∗ 1 = 0.5.

– ℎ 𝒘𝑇𝒙 = h 0.5 = 1.

• Corresponds to case false OR true = true.

16

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false OR false = false false OR true = true true OR false = true true OR true = true

Example: The OR Perceptron

• Verification: If 𝑥1 = 1 and 𝑥2 = 0:

– 𝒘𝑇𝒙 = −0.5 + 1 ∗ 1 + 1 ∗ 0 = 0.5.

– ℎ 𝒘𝑇𝒙 = h 0.5 = 1.

• Corresponds to case true OR false = true.

17

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false OR false = false false OR true = true true OR false = true true OR true = true

Example: The OR Perceptron

• Verification: If 𝑥1 = 1 and 𝑥2 = 1:

– 𝒘𝑇𝒙 = −0.5 + 1 ∗ 1 + 1 ∗ 1 = 1.5.

– ℎ 𝒘𝑇𝒙 = h 1.5 = 1.

• Corresponds to case true OR true = true.

18

𝑧 = ℎ 𝒘𝑇𝒙

𝑥1

𝑥0 = 1

𝑥2

Output: 𝑧

false OR false = false false OR true = true true OR false = true true OR true = true

Example: The NOT Perceptron

• Suppose we use the step function for activation.

• Suppose boolean value false is represented as number 0.

• Suppose boolean value true is represented as number 1.

• Then, the perceptron below computes the boolean NOT function:

19

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑤1 = −1 Output: 𝑧

NOT(false) = true NOT(true) = false

Example: The NOT Perceptron

• Verification: If 𝑥1 = 0:

– 𝒘𝑇𝒙 = 0.5 − 1 ∗ 0 = 0.5.

– ℎ 𝒘𝑇𝒙 = h 0.5 = 1.

• Corresponds to case NOT(false) = true.

20

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑤1 = −1 Output: 𝑧

NOT(false) = true NOT(true) = false

Example: The NOT Perceptron

• Verification: If 𝑥1 = 1:

– 𝒘𝑇𝒙 = 0.5 − 1 ∗ 1 = −0.5.

– ℎ 𝒘𝑇𝒙 = h −0.5 = 0.

• Corresponds to case NOT(true) = false.

21

𝑧 = ℎ 𝒘𝑇𝒙 𝑥1

𝑥0 = 1

𝑤1 = −1 Output: 𝑧

NOT(false) = true NOT(true) = false

The XOR Function

• As before, we represent false with 0 and true with 1.

• The figure shows the four input points of the XOR function. – green corresponds to output value true.

– red corresponds to output value false.

• The two classes (true and false) are not linearly separable.

• Therefore, no perceptron can compute the XOR function.

22

false XOR false = false false XOR true = true true XOR false = true true XOR true = false

Neural Networks

• A neural network is built using perceptrons as building blocks.

• The inputs to some perceptrons are outputs of other perceptrons.

• Here is an example neural network computing the XOR function.

23

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Neural Networks

• To simplify the picture, we do not show the bias input anymore.

– We just show the bias weights 𝑤𝑗,0.

• Besides the bias input, there are two inputs: 𝑥1, 𝑥2.

24

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Neural Networks

• This neural network example consists of six units: – Three input units (including the not-shown bias input).

– Three perceptrons.

• Yes, inputs count as units.

25

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Neural Networks

• Weights are denoted as 𝑤𝑗𝑖.

– Weight 𝑤𝑗𝑖 belongs to the edge that connects the output of unit 𝑖 with an

input of unit 𝑗.

• Units 0, 1, … , 𝐷 are the input units (units 0, 1, 2 in this example).

26

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Neural Network Layers

• Oftentimes, neural networks are organized into layers.

• The input layer is the initial layer of input units (units 0, 1, 2 in our example).

• The output layer is at the end (unit 5 in our example).

• Zero, one or more hidden layers can be between the input and output layers.

27

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Neural Network Layers

• There is only one hidden layer in our example, containing units 4 and 5.

• Each hidden layer's inputs are outputs from the previous layer.

• Each hidden layer's outputs are inputs to the next layer.

• The first hidden layer's inputs come from the input layer.

• The last hidden layer's outputs are inputs to the output layer.

28

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Feedforward Networks

• Feedforward networks are networks where there are no directed loops.

• If there are no loops, the output of a neuron cannot (directly or indirectly) influence its input.

• While there are varieties of neural networks that are not feedforward or layered, our main focus will be layered feedforward networks.

29

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Computing the Output

• Notation: L is the number of layers.

– Layer 1 is the input layer, layer L is the output layer.

• Given values for the input units, output is computed as follows:

• For (𝑙 = 2; 𝑙 ≤ 𝐿; 𝑙 = 𝑙 + 1):

– Compute the outputs of layer L, given the outputs of layer L-1.

30

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

Computing the Output

• To compute the outputs of layer 𝑙 (where 𝑙 > 1), we simply need to compute the output of each perceptron belonging to layer 𝑙 . – For each such perceptron, its inputs are coming from outputs of

perceptrons at layer 𝑙 − 1.

– Remember, we compute layer outputs in increasing order of 𝑙.

31

Unit 4

Unit 3

Unit 5

Output: 𝑥1

𝑥2

What Neural Networks Can Compute

• An individual perceptron is a linear classifier. – The weights of the perceptron define a linear boundary between two classes.

• Layered feedforward neural networks with one hidden layer can compute any continuous function.

• Layered feedforward neural networks with two hidden layers can compute any mathematical function.

• This has been known for decades, and is one reason scientists have been optimistic about the potential of neural networks to model intelligent systems.

• Another reason is the analogy between neural networks and biological brains, which have been a standard of intelligence we are still trying to achieve.

• There is only one catch: How do we find the right weights? 32

Training a Neural Network

• In linear regression, for the sum-of-squares error, we could find the best weights using a closed-form formula.

• In logistic regression, for the cross-entropy error, we could find the best weights using an iterative method.

• In neural networks, we cannot find the best weights (unless we have an astronomical amount of luck).

– We only have optimization methods that find local minima of the error function.

– Still, in recent years such methods have produced spectacular results in real-world applications.

33

Notation for Training Set

• We define 𝑤 to be the vector of all weights in the neural network.

• We have a set 𝑋 of N training examples. – 𝑋 = {𝒙1, 𝒙2, … , 𝒙𝑁}

• Each 𝒙𝑛 is a (D+1)-dimensional column vector. – Dimension 0 is the bias input, always set to 1.

– 𝒙𝑛 = (1, 𝑥𝑛1, 𝑥𝑛2, … , 𝑥𝑛𝐷)′

• We also have a set 𝑇 of N target outputs. – 𝑇 = 𝒕1, 𝒕2, … , 𝒕𝑁

– 𝒕𝑛 is the target output for training example 𝒙𝑛.

• Each 𝒕𝑛 is a K-dimensional column vector: – 𝒕𝑛 = (𝑡𝑛1, 𝑡𝑛2, … , 𝑡𝑛𝐾)′

• Note: K typically is not equal to D. 34

Perceptron Learning

• Before we discuss how to train an entire neural network, we start with a single perceptron.

• Remember: given input 𝒙𝑛, a perceptron computes its output 𝑧 using this formula: 𝑧(𝒙) = ℎ 𝒘𝑇𝒙

• We use sum-of-squares as our error function.

• 𝐸𝑛(𝒘) is the contribution of training example 𝒙𝑛:

𝐸𝑛 𝒘 =1

2𝑧(𝒙𝑛) − 𝑡𝑛

2

• The overall error 𝐸 is defined as: 𝐸 𝒘 = 𝐸𝑛 𝒘𝑁𝑛=1

35

Perceptron Learning

• Suppose that a perceptron is using the step function as its activation function ℎ.

• Can we apply gradient descent in that case?

• No, because 𝐸(𝒘) is not differentiable.

– Small changes of 𝒘 usually lead to no changes in ℎ 𝒘𝑇𝒙 .

– The only exception is when the change in 𝒘 causes ℎ 𝒘𝑇𝒙 to switch signs (from positive to negative, or from negative to positive).

36

𝑧(𝒙) = ℎ 𝒘𝑇𝒙 = 0, if 𝒘𝑇𝒙 < 0

1, if 𝒘𝑇𝒙 ≥ 0 ℎ 𝑎 =

0, if 𝑎 < 01, if 𝑎 ≥ 0

• A better option is setting ℎ to the sigmoid function:

𝑧 𝒙 = ℎ 𝒘𝑇𝒙 =1

1 + 𝑒−𝒘𝑇𝒙

• Then, measured just on a single training object 𝒙𝑛, the error 𝐸𝑛(𝒘) is defined as:

𝐸𝑛 𝒘 = 1

2𝑡𝑛 − 𝑧 𝒙𝑛

2 =1

2𝑡𝑛 −

1

1 + 𝑒−𝒘𝑇𝒙𝑛

2

• Note: here we use the sum-of-squares error, and not the cross-entropy error that we used for logistic regression.

• Also note: if our neural network is a single perceptron, then the target output 𝑡𝑛 is one-dimensional.

37

Perceptron Learning

Computing the Gradient

• 𝐸𝑛 𝒘 = 1

2𝑡𝑛 − 𝑧 𝒙𝑛

2 =1

2𝑡𝑛 −

1

1+𝑒−𝒘𝑇𝒙𝑛

2

• In this form, 𝐸𝑛 𝑤 is differentiable.

• If we do the calculations, the gradient turns out to be:

𝜕𝐸𝑛𝜕𝒘=1

2𝑡𝑛 − 𝑧 𝒙𝑛 ∗ 𝑧 𝒙𝑛 ∗ 1 − 𝑧 𝒙𝑛 ∗ 𝒙𝑛

• Note that 𝜕𝐸𝑛

𝜕𝒘 is a (D+1) dimensional vector. It is a

scalar (shown in red) multiplied by vector 𝒙𝑛.

38

Weight Update 𝜕𝐸𝑛𝜕𝒘=1

2𝑡𝑛 − 𝑧 𝒙𝑛 ∗ 𝑧 𝒙𝑛 ∗ 1 − 𝑧 𝒙𝑛 ∗ 𝒙𝑛

• So, we update the weight vector 𝒘 as follows:

𝒘 = 𝒘 − 𝜂 ∗ 𝑡𝑛 − 𝑧 𝒙𝑛 ∗ 𝑧 𝒙𝑛 ∗ 1 − 𝑧 𝒙𝑛 ∗ 𝒙𝑛

• As before, 𝜂 is the learning rate parameter.

– It is a positive real number that should be chosen carefully, so as not to be too big or too small.

• In terms of individual weights 𝑤𝑑, the update rule is:

𝑤𝑑 = 𝑤𝑑 − 𝜂 ∗ 𝑡𝑛 − 𝑧 𝒙𝑛 ∗ 𝑧 𝒙𝑛 ∗ 1 − 𝑧 𝒙𝑛 ∗ 𝑥𝑛𝑑

39

Perceptron Learning - Summary

• Input: Training inputs 𝒙1, , … , 𝒙𝑁, target outputs 𝑡1, … , 𝑡𝑁

1. Extend each 𝒙𝑛 to a (D+1) dimensional vector, by adding 1 (the bias input) as the value for dimension 0.

2. Initialize weights 𝑤𝑑 to small random numbers.

– For example, set each 𝑤𝑑 between -0.1 and 0.1

3. For n = 1 to N:

1. Compute 𝑧 𝑥𝑛 .

2. For d = 0 to D:

𝑤𝑑 = 𝑤𝑑 − 𝜂 ∗ 𝑡𝑛 − 𝑧 𝒙𝑛 ∗ 𝑧 𝒙𝑛 ∗ 1 − 𝑧 𝒙𝑛 ∗ 𝑥𝑛𝑑

4. If some stopping criterion has been met, exit.

5. Else, go to step 3. 40

Stopping Criterion

• At step 4 of the perceptron learning algorithm, we need to decide whether to stop or not.

• One thing we can do is:

– Compute the cumulative squared error E(w) of the perceptron at that point:

– Compare the current value of 𝐸 𝒘 with the value of 𝐸 𝒘 computed at the previous iteration.

– If the difference is too small (e.g., smaller than 0.00001) we stop. 41

𝐸 𝒘 = 𝐸𝑛 𝒘

𝑁

𝑛=1

= 1

2𝑡𝑛 − 𝑧 𝒙𝑛

2

𝑁

𝑛=1

Using Perceptrons for Multiclass Problems

• A perceptron outputs a number between 0 and 1.

• This is sufficient only for binary classification problems.

• For more than two classes, there are many different options.

• We will follow a general approach called one-versus-all classification.

42

One-Versus-All Perceptrons

• Suppose we have 𝐾 classes 𝐶1, … , 𝐶𝐾 , where 𝐾 > 2.

• We have training inputs 𝒙1, , … , 𝒙𝑁, and target values 𝒕1, … , 𝒕𝑁.

• Each target value 𝒕𝑛 is a K-dimensional vector:

– 𝒕𝑛 = (𝑡𝑛1, 𝑡𝑛2, … , 𝑡𝑛𝐾)

– 𝑡𝑛𝑘 = 0 if the class of 𝒙𝑛 is not Ck.

– 𝑡𝑛𝑘 = 1 if the class of 𝒙𝑛 is Ck.

• For each class 𝐶𝑘, train a perceptron 𝑧𝑘 by using 𝑡𝑛𝑘 as the target value for 𝒙𝑛.

– So, perceptron 𝑧𝑘 is trained to recognize if an object belongs to class 𝐶𝑘 or not.

– In total, we train 𝐾 perceptrons, one for each class.

43

One-Versus-All Perceptrons

• To classify a test pattern 𝒙:

– Compute the responses 𝑧𝑘(𝒙) for all 𝐾 perceptrons.

– Find the perceptron 𝑧𝑘∗ such that the value 𝑧𝑘∗(𝒙) is higher than all other responses.

– Output that the class of x is 𝐶𝑘∗.

• In summary: we assign 𝒙 to the class whose perceptron produced the highest output value for 𝒙.

44

Neural Network Notation

• 𝑈 is the total number of units in the neural network.

• Each unit, is denoted as 𝑃𝑗, where 0 ≤ 𝑗 ≤ 𝑈 − 1.

– Units 𝑃0, … , 𝑃𝐷 are the input units.

– Unit 𝑃0 is the bias input, always equal to 1.

• We denote by 𝑤𝑗𝑖 the weight of the edge connecting

the output of 𝑃𝑖 to an input of 𝑃𝑗.

• We denote by 𝑧𝑗 the output of unit 𝑃𝑗.

– If 0 ≤ 𝑗 ≤ 𝐷, then 𝑧𝑗 = 𝑥𝑗.

• We denote by vector 𝒛 the vector of all outputs: 𝒛 = (𝑧0, 𝑧1, … , 𝑧𝑈−1)

45

Target Value Notation

• To make notation more convenient, we treat target value 𝒕𝑛 as a U-dimensional vector.

– In other words, the dimensionality of 𝒕𝑛 is equal to the number of units in the network.

• If 𝑃𝑗 is the k-th output unit, and 𝑥𝑛 belongs to class 𝐶𝑘,

then:

– 𝒕𝑛𝑗 = 1.

– 𝒕𝑛 will have values 0 in all other dimensions.

• This way, there is a one-to-one correspondence between dimensions of 𝒕𝑛 and dimensions of 𝒛.

• We do this only to simplify notation.

– See formula for defining error, in the next slide.

46

Squared Error for Neural Networks

• We define 𝕐 to be the set of output units:

𝕐 = 𝑃𝑗 ∶ 𝑃𝑗 belongs to the output layer

• As usual, we denote by 𝐸𝑛(𝒘) the contribution that training input 𝒙𝑛 makes to the overall error 𝐸(𝒘).

𝐸𝑛(𝑤) =1

2 𝑡𝑛𝑗 − 𝑧𝑗

2

𝑗:𝑃𝑗 ∈𝕐

• Note that only the output from output units contributes to the error.

• If 𝑃𝑗 is not an output unit, then 𝑡𝑛𝑗 and 𝑧𝑗 get ignored.

47

Squared Error for Neural Networks

• As usual, we denote by 𝐸(𝒘) the overall error over all training examples:

𝐸 𝑤 = 𝐸𝑛 𝒘

𝑁

𝑛=1

=1

2 𝑡𝑛𝑗 − 𝑧𝑗

2

𝑗:𝑃𝑗 ∈𝕐

𝑁

𝑛=1

• This is now a double summation.

– We sum over all training examples 𝒙𝑛.

– For each 𝒙𝑛, we sum over all perceptrons in the output layer.

48

Training Neural Networks

• To train a neural network, we follow the same approach of sequential learning that we followed for training single perceptrons:

• Given a training example 𝒙𝑛 and target output 𝒕𝑛:

– Compute the training error 𝐸𝑛(𝑤).

– Compute the gradient 𝜕𝐸𝑛

𝜕𝑤.

– Based on the gradient, we can update all weights.

• The process of computing the gradient and updating neural network weights is called backpropagation.

• We will see the solution when we use the sigmoidal function as activation function ℎ.

49

Computing the Gradient

• Overall, we want to compute 𝜕𝐸𝑛

𝜕𝑤.

•𝜕𝐸𝑛

𝜕𝑤 is a vector containing all weights 𝑤𝑗𝑖.

• Therefore, it suffices to compute, for each 𝑤𝑗𝑖, the

partial derivative 𝜕𝐸𝑛

𝜕𝑤𝑗𝑖.

• In order to compute 𝜕𝐸𝑛

𝜕𝑤𝑗𝑖, we will use this strategy:

– Decompose 𝐸𝑛 into a composition of simpler functions.

– Compute the derivative of each of those simpler functions.

– Apply the chain rule to obtain 𝜕𝐸𝑛

𝜕𝑤𝑗𝑖.

50

Decomposing the Error Function

• Let 𝑃𝑗 be a perceptron in the neural network.

• Define function 𝑎𝑗 𝒙𝑛, 𝐰 = 𝑤𝑗𝑖𝑧𝑖 𝒙𝑛, 𝐰𝑖

– 𝑎𝑗 𝒙𝑛, 𝐰 is simply the weighted sum of the inputs of 𝑃𝑗,

given current input 𝒙𝑛 and given the current value of 𝐰.

• Define function 𝑧𝑗 𝛼 = σ α =1

1+𝑒−𝑎 .

– The output of 𝑃𝑗 is 𝑧𝑗 𝑎𝑗 𝒙𝑛, 𝐰 .

51

Decomposing the Error Function

• Define y𝑗 to be a vector containing all outputs of all

perceptrons belonging to the same layer as 𝑃𝑗.

• Define function 𝐸𝑛𝑗 y𝑗 to be the error of the network

given the outputs of all perceptrons of the layer that 𝑃𝑗

belongs to.

• Intuition for 𝐸𝑛𝑗 y𝑗 :

– Suppose that you do not know anything about layers before the layer of perceptron 𝑃𝑗.

– Suppose that you know y𝑗, and all layers after the layer of 𝑃𝑗 .

– Then, you can still compute the output of the network, and the error 𝐸𝑛.

52

• As long as we can see the outputs of the layer that 𝑃𝑗

belongs to, and all layers after the layer of 𝑃𝑗, we can

compute the output of the network, and the error 𝐸𝑛.

Visualizing Function 𝐸𝑛𝑗

53

Unit𝑃𝑞

Output layer

? ? ? ? ? ?

Previous layers:

unknown

Layer of 𝑃𝑗

Unit𝑃𝑗

Unit𝑃𝑟

Unit …

Unit…

Unit …

Unit …

Unit …

Unit …

Unit𝑃𝑞+1

Decomposing the Error Function

• We have defined three auxiliary functions:

– 𝑎𝑗 𝒙𝑛, 𝐰

– 𝑧𝑗 𝛼

– 𝐸𝑛𝑗 y𝑗

• Suppose that the perceptrons belonging to the same layer as 𝑃𝑗 are indexed as 𝑃𝑞 , 𝑃𝑞+1, … , 𝑃𝑟−1, 𝑃𝑟

• Then, 𝐸𝑛 is a composition of functions 𝐸𝑛𝑗, 𝑧𝑗, 𝑎𝑗.

𝐸𝑛 𝒙𝑛, 𝐰 = 𝐸𝑛𝑗 y𝑗

= 𝐸𝑛𝑗 𝑧𝑞 𝑎𝑞 𝒙𝑛, 𝐰 , … , 𝑧𝑟 𝑎𝑟 𝒙𝑛, 𝐰 54

Computing the Gradient of 𝐸𝑛

𝐸𝑛 𝒙𝑛, 𝐰 = 𝐸𝑛𝑗 𝑧𝑞 𝑎𝑞 𝒙𝑛, 𝐰 , … , 𝑧𝑟 𝑎𝑟 𝒙𝑛, 𝐰

• Then, we can compute 𝜕𝐸𝑛

𝜕𝑤𝑗𝑖 by applying the chain rule:

𝜕𝐸𝑛𝜕𝑤𝑗𝑖=𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗𝜕𝑧𝑗

𝜕𝑎𝑗∗𝜕𝑎𝑗

𝜕𝑤𝑗𝑖

• We will compute each of these three terms.

55

Computing 𝜕𝑎𝑗

𝜕𝑤𝑗𝑖

𝜕𝐸𝑛𝜕𝑤𝑗𝑖=𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗𝜕𝑧𝑗

𝜕𝑎𝑗∗𝜕𝑎𝑗

𝜕𝑤𝑗𝑖

𝜕𝑎𝑗

𝜕𝑤𝑗𝑖=𝜕 𝑤𝑗𝑢𝑧𝑢 𝒙𝑛, 𝐰𝑢

𝜕𝑤𝑗𝑖= 𝑧𝑖 𝒙𝑛, 𝐰

• Remember, 𝑧𝑖 𝒙𝑛, 𝐰 is just the output of unit 𝑃𝑖.

• The outputs of all units are straightforward to compute, given 𝒙𝑛 and 𝐰.

• So, computing 𝜕𝑎𝑗

𝜕𝑤𝑗𝑖 is entirely straightforward.

56

Computing 𝜕𝑧𝑗

𝜕𝑎𝑗

𝜕𝐸𝑛𝜕𝑤𝑗𝑖=𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗𝜕𝑧𝑗

𝜕𝑎𝑗∗𝜕𝑎𝑗

𝜕𝑤𝑗𝑖

𝜕𝑧𝑗

𝜕𝑎𝑗=𝜕 σ 𝑎𝑗

𝜕𝑎𝑗= σ 𝑎𝑗 1 − σ 𝑎𝑗 = 𝑧𝑗 1 − 𝑧𝑗

• One of the reasons we like using the sigmoidal function for activation is that its derivative has such a simple form.

57

Computing 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗 , Case 1:

If 𝑃𝑗 Is an Output Unit

• If 𝑃𝑗 is an output unit, then 𝑧𝑗 is an output of the

entire network.

• 𝑧𝑗 contributes to the error the term 1

2𝑡𝑛𝑗 − 𝑧𝑗

2

• Therefore:

𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=𝜕12𝑡𝑛𝑗 − 𝑧𝑗

2

𝜕𝑧𝑗= 𝑧𝑗 − 𝑡𝑛𝑗

58

Updating Weights of Output Units

• If 𝑃𝑗 is an output unit, then we have computed all the

terms we need for 𝜕𝐸𝑛

𝜕𝑤𝑗𝑖.

𝜕𝐸𝑛𝜕𝑤𝑗𝑖=𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗𝜕𝑧𝑗

𝜕𝑎𝑗∗𝜕𝑎𝑗

𝜕𝑤𝑗𝑖

𝜕𝐸𝑛𝜕𝑤𝑗𝑖= 𝑧𝑗 − 𝑡𝑛𝑗 ∗ 𝑧𝑗 ∗ 1 − 𝑧𝑗 ∗ 𝑧𝑖

• So, if 𝑤𝑗𝑖 is the weight of an output unit, we update it

as follows: 𝑤𝑗𝑖 = 𝑤𝑗𝑖 − 𝜂 𝑧𝑗 − 𝑡𝑛𝑗 ∗ 𝑧𝑗 ∗ 1 − 𝑧𝑗 ∗ 𝑧𝑖 59

Computing 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗 , Case 2:

If 𝑃𝑗 Is a Hidden Unit

• Let 𝑃𝑗 be a hidden unit.

• Define 𝕃 to be the set of all units in the layer after 𝑃𝑗.

𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢∗𝜕𝑧𝑢𝜕𝑎𝑢∗𝜕𝑎𝑢𝜕𝑧𝑗

𝑢∈𝕃

• We need to compute these three terms.

60

Computing 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗 , Case 2:

If 𝑃𝑗 Is a Hidden Unit

𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢∗𝜕𝑧𝑢𝜕𝑎𝑢∗𝜕𝑎𝑢𝜕𝑧𝑗

𝑢∈𝕃

𝜕𝑎𝑢𝜕𝑧𝑗=𝜕 𝑤𝑢𝑖𝑧𝑖 𝒙𝑛, 𝐰𝑖

𝜕𝑧𝑗= 𝑤𝑢𝑗

𝜕𝑧𝑢𝜕𝑎𝑢= 𝑧𝑢 1 − 𝑧𝑢

61

We computed this already, a few slides ago.

Computing 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗 , Case 2:

If 𝑃𝑗 Is a Hidden Unit

𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢∗𝜕𝑧𝑢𝜕𝑎𝑢∗𝜕𝑎𝑢𝜕𝑧𝑗

𝑢∈𝕃

• In the previous slide, we computed 𝜕𝑧𝑢

𝜕𝑎𝑢 and

𝜕𝑎𝑢

𝜕𝑧𝑗.

• So, the formula becomes: 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢∗ 𝑧𝑢 ∗ 1 − 𝑧𝑢 ∗ 𝑤𝑢𝑗

𝑢∈𝕃

62

Computing 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗 , Case 2:

If 𝑃𝑗 Is a Hidden Unit

𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢∗ 𝑧𝑢 ∗ 1 − 𝑧𝑢 ∗ 𝑤𝑢𝑗

𝑢∈𝕃

• Notice that 𝜕𝐸𝑛𝑗

𝜕𝑧𝑗 is defined using

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢.

– This is a recursive definition. To compute the values for a layer, we use the values from the next layer.

– This is why the whole algorithm is called backpropagation.

– We propagate computations from the output layer backwards towards the input layer. 63

Formula for Hidden Units

• From the previous slides, we have these formulas:

𝜕𝐸𝑛𝜕𝑤𝑗𝑖=𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗𝜕𝑧𝑗

𝜕𝑎𝑗∗𝜕𝑎𝑗

𝜕𝑤𝑗𝑖=𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗ 𝑧𝑗 ∗ 1 − 𝑧𝑗 ∗ 𝑧𝑖

𝜕𝐸𝑛𝑗

𝜕𝑧𝑗=

𝜕𝐸𝑛𝑗

𝜕𝑧𝑢∗ 𝑧𝑢 ∗ 1 − 𝑧𝑢 ∗ 𝑤𝑢𝑗

𝑢∈𝕃

• We can combine these formulas, to compute 𝜕𝐸𝑛

𝜕𝑤𝑗𝑖 for

any weight of any hidden unit.

– Just remember to start the computations from the output layer, and move backwards towards the input layer.

64

Simplifying Notation

• The previous formulas are sufficient and will work, but look complicated.

• We can simplify the formulas considerably, by defining:

𝛿𝑗 =𝜕𝐸𝑛𝑗

𝜕𝑧𝑗∗𝜕𝑧𝑗

𝜕𝑎𝑗

• Then, if we combine calculations we already did:

– If 𝑃𝑗 is an output unit, then: 𝛿𝑗 = 𝑧𝑗 − 𝑡𝑛𝑗 ∗ 𝑧𝑗 ∗ 1 − 𝑧𝑗

– If 𝑃𝑗 is a hidden unit, then:

𝛿𝑗 = 𝛿𝑢𝑤𝑢𝑗𝑢∈𝕃

𝑧𝑗 1 − 𝑧𝑗 65

Final Backpropagation Formula

• Using the definition of 𝛿𝑗 from the previous slide, we

finally get a very simple formula:

𝜕𝐸𝑛𝜕𝑤𝑗𝑖= 𝛿𝑗𝑧𝑖

• Therefore, given a training input 𝒙𝑛, and given a positive learning rate 𝜂, each weight 𝑤𝑗𝑖 is updated

using this formula:

𝑤𝑗𝑖 = 𝑤𝑗𝑖 − 𝜂𝛿𝑗𝑧𝑖

66

Backpropagation for One Object Step 1: Initialize Input Layer

• We will now see how to apply backpropagation, step by step, in pseudocode style, for a single training object.

• First, given a training example 𝒙𝑛, and its target output 𝒕𝑛, we must initialize the input units:

// Array 𝑧 will store, for every perceptron 𝑃𝑗, its output.

• double 𝑧[] = new double[𝑈]

// Update the input layer, set inputs equal to 𝒙𝑛.

• For 𝑗 = 0 to 𝐷:

– 𝑧[𝑗] = 𝑥𝑛𝑗 // 𝑥𝑛𝑗 is the j-th dimension of 𝒙𝑛. 67

Backpropagation for One Object Step 2: Compute Outputs

// we create an array 𝑎, which will store, for every // perceptron 𝑃𝑗 , the weighted sum of the inputs of 𝑃𝑗.

• double 𝑎[] = new double[𝑈]

// Update the rest of the layers:

• For 𝑙 = 2 to 𝐿: // 𝐿 is the number of layers:

– For each perceptron 𝑃𝑗 in layer 𝑙:

• 𝑎[𝑗] = 𝑤𝑗𝑖𝑧[𝑖]𝑖 // weighted sum of inputs of 𝑃𝑗

• 𝑧[𝑗] = ℎ 𝑎[𝑗] =1

1+ 𝑒−𝑎[𝑗] // output of unit 𝑃𝑗

68

Backpropagation for One Object Step 3: Compute New 𝛿 Values

// we create an array 𝛿, which will store, for every // perceptron 𝑃𝑗 , value 𝛿𝑗.

• double 𝛿[] = new double[𝑈]

• For each output unit 𝑃𝑗:

– 𝛿 𝑗 = 𝑧[𝑗] − 𝑡𝑛𝑗 ∗ 𝑧[𝑗] ∗ 1 − 𝑧[𝑗]

• For 𝑙 = 𝐿 − 1 to 2: // MUST be decreasing order of 𝑙

– For each perceptron 𝑃𝑗 in layer 𝑙:

• 𝛿 𝑗 = 𝛿[𝑢]𝑤𝑢𝑗𝑢: 𝑃𝑢 ∈ layer 𝑙+1𝑧𝑗 1 − 𝑧𝑗

69

Backpropagation for One Object Step 4: Update Weights

• For 𝑙 = 2 to 𝐿: // Order does not matter here, we can go // from 2 to 𝐿 or from 𝐿 to 2.

– For each perceptron 𝑃𝑗 in layer 𝑙:

• For each perceptron 𝑃𝑖 in the preceding layer 𝑙 − 1:

–𝑤𝑗𝑖 = 𝑤𝑗𝑖 − 𝜂 ∗ 𝛿 j ∗ 𝑧[𝑖]

70

IMPORTANT: Do Step 3 before Step 4. Do NOT do steps 3 and 4 as a single loop. • All 𝛿 values must be computed using the old values of weights. • Then, all weights must be updated using the new 𝛿 values .

Backpropagation Summary • Inputs:

– N D-dimensional training objects 𝒙1, … , 𝒙𝑁.

– The associated target values 𝒕1, … , 𝒕𝑁, which are U-dimensional vectors.

1. Extend each 𝒙𝑛 to a (D+1) dimensional vector, by adding the bias input as the value for the zero-th dimension.

2. Initialize weights 𝑤𝑗𝑖 to small random numbers.

– For example, set each wu,v between -0.1 and 0.1.

3. last_error = 𝐸(𝒘)

4. For 𝑛 = 1 to 𝑁:

– Given 𝒙𝑛, update weights 𝑤𝑗𝑖 as described in the previous slides.

5. err = 𝐸(𝒘)

6. If |err – last_error| < threshold, exit. // threshold can be 0.00001.

7. Else: last_error = err, go to step 4. 71

Classification with Neural Networks

• Suppose we have 𝐾 classes 𝐶1, … , 𝐶𝐾 , where 𝐾 > 2.

• Each class 𝐶𝑘 corresponds to an output perceptron 𝑃𝑢.

• Given a test pattern 𝒙 to classify:

– Extend 𝒙 to a (D+1)-dimensional vector, by adding the bias input as value for dimension 0.

– Compute outputs for all units of the network, working from the input layer towards the output layer.

• Find the output unit 𝑃𝑢 with the highest output 𝑧𝑢.

• Return the class that corresponds to 𝑃𝑢.

72

Structure of Neural Networks

• Backpropagation describes how to learn weights.

• However, it does not describe how to learn the structure:

– How many layers?

– How many units at each layer?

• These are parameters that we have to choose somehow.

• A good way to choose such parameters is by using a validation set, containing examples and their class labels.

– The validation set should be separate (disjoint) from the training set.

73

Structure of Neural Networks

• To choose the best structure for a neural network using a validation set, we try many different parameters (number of layers, number of units per layer).

• For each choice of parameters:

– We train several neural networks using backpropagation.

– We measure how well each neural network classifies the validation examples.

– Why not train just one neural network?

74

Structure of Neural Networks

• To choose the best structure for a neural network using a validation set, we try many different parameters (number of layers, number of units per layer).

• For each choice of parameters:

– We train several neural networks using backpropagation.

– We measure how well each neural network classifies the validation examples.

– Why not train just one neural network?

– Each network is randomly initialized, so after backpropagation it can end up being different from the other networks.

• At the end, we select the neural network that did best on the validation set.

75