1.8 Combinations of Functions JMerrill, 2010 Arithmetic Combinations.
Statistical Methods for Data...
Transcript of Statistical Methods for Data...
![Page 1: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/1.jpg)
Statistical Methods for Data Mining
Kuangnan Fang
Xiamen University Email: [email protected]
May 18, 2019
![Page 2: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/2.jpg)
Neural Networks
• Multi-layer Perceptron
• Forward propagation
• Backward propagation
![Page 3: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/3.jpg)
Motivating Examples
![Page 4: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/4.jpg)
Are You Excited about Deep Learning?
![Page 5: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/5.jpg)
Limitations of Linear Classifiers
• Linear classifiers (e.g., logistic regression) classify inputs basedon linear combinations of features xi
• Many decisions involve non-linear functions of the input
• Canonical example: do 2 input elements have the same value?
• The positive and negative cases cannot be separated by aplane
• What can we do?
![Page 6: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/6.jpg)
Limitations of Linear Classifiers
• Linear classifiers (e.g., logistic regression) classify inputs basedon linear combinations of features xi• Many decisions involve non-linear functions of the input
• Canonical example: do 2 input elements have the same value?
• The positive and negative cases cannot be separated by aplane
• What can we do?
![Page 7: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/7.jpg)
Limitations of Linear Classifiers
• Linear classifiers (e.g., logistic regression) classify inputs basedon linear combinations of features xi• Many decisions involve non-linear functions of the input
• Canonical example: do 2 input elements have the same value?
• The positive and negative cases cannot be separated by aplane
• What can we do?
![Page 8: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/8.jpg)
Limitations of Linear Classifiers
• Linear classifiers (e.g., logistic regression) classify inputs basedon linear combinations of features xi• Many decisions involve non-linear functions of the input
• Canonical example: do 2 input elements have the same value?
• The positive and negative cases cannot be separated by aplane
• What can we do?
![Page 9: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/9.jpg)
Limitations of Linear Classifiers
• Linear classifiers (e.g., logistic regression) classify inputs basedon linear combinations of features xi• Many decisions involve non-linear functions of the input
• Canonical example: do 2 input elements have the same value?
• The positive and negative cases cannot be separated by aplane
• What can we do?
![Page 10: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/10.jpg)
How to Construct Nonlinear Classifiers?
• We would like to construct non-linear discriminative classifiersthat utilize functions of input variables
• Use a large number of simpler functions
• If these functions are fixed (Gaussian, sigmoid, polynomialbasis functions), then optimization still involves linearcombinations of (fixed functions of) the inputs
• Or we can make these functions depend on additionalparameters → need an efficient method of training extraparameters
![Page 11: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/11.jpg)
How to Construct Nonlinear Classifiers?
• We would like to construct non-linear discriminative classifiersthat utilize functions of input variables• Use a large number of simpler functions
• If these functions are fixed (Gaussian, sigmoid, polynomialbasis functions), then optimization still involves linearcombinations of (fixed functions of) the inputs
• Or we can make these functions depend on additionalparameters → need an efficient method of training extraparameters
![Page 12: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/12.jpg)
How to Construct Nonlinear Classifiers?
• We would like to construct non-linear discriminative classifiersthat utilize functions of input variables• Use a large number of simpler functions
• If these functions are fixed (Gaussian, sigmoid, polynomialbasis functions), then optimization still involves linearcombinations of (fixed functions of) the inputs
• Or we can make these functions depend on additionalparameters → need an efficient method of training extraparameters
![Page 13: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/13.jpg)
How to Construct Nonlinear Classifiers?
• We would like to construct non-linear discriminative classifiersthat utilize functions of input variables• Use a large number of simpler functions
• If these functions are fixed (Gaussian, sigmoid, polynomialbasis functions), then optimization still involves linearcombinations of (fixed functions of) the inputs
• Or we can make these functions depend on additionalparameters → need an efficient method of training extraparameters
![Page 14: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/14.jpg)
Inspiration: The Brain
• Many machine learning methods inspired by biology, e.g., the(human) brain
• Our brain has ∼ 1011 neurons, each of which communicates(is connected)to ∼ 104 other neurons
Figure : The basic computational unit of the brain: Neuron
![Page 15: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/15.jpg)
Mathematical Model of a Neuron
• Neural networks define functions of the inputs ( hiddenfeatures), computed by neurons
• Artificial neurons are called units
Figure : A mathematical model of the neuron in a neural network
![Page 16: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/16.jpg)
Activation Functions
Most commonly used activation functions:
• Sigmoid: σ(z) = 11+exp(−z)
• Tanh: tanh(z) = exp(z)−exp(−z)exp(z)+exp(−z)
• ReLU (Rectified Linear Unit): ReLU(z) = max(0, z)
![Page 17: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/17.jpg)
Neuron in Python
• Example in Python of a neuron with a sigmoid activationfunction
Figure : Example code for computing the activation of a singleneuron
![Page 18: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/18.jpg)
Neural Network Architecture (Multi-Layer Perceptron)
• Network with one layer of four hidden units:
Figure : Two different visualizations of a 2-layer neural network. In thisexample: 3 input units, 4 hidden units and 2 output units
• Each unit computes its value based on linear combination ofvalues of units that point into it, and an activation function
![Page 19: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/19.jpg)
Neural Network Architecture (Multi-Layer Perceptron)
• Network with one layer of four hidden units:
Figure : Two different visualizations of a 2-layer neural network. In thisexample: 3 input units, 4 hidden units and 2 output units
• Naming conventions; a 2-layer neural network:• One layer of hidden units• One output layer
(we do not count the inputs as a layer)
![Page 20: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/20.jpg)
Neural Network Architecture (Multi-Layer Perceptron)
• Going deeper: a 3-layer neural network with two layers ofhidden units
Figure : A 3-layer neural net with 3 input units, 4 hidden units in the first andsecond hidden layer and 1 output unit
• Naming conventions; a N-layer neural network:• N − 1 layers of hidden units• One output layer
![Page 21: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/21.jpg)
Representational Power
• Neural network with at least one hidden layer is a universalapproximator (can represent any function)Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
•• The capacity of the network increases with more hidden units
and more hidden layers
• Why go deeper? Read e.g.,: Do Deep Nets Really Need to beDeep? Jimmy Ba, Rich Caruana, Paper: paper
![Page 22: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/22.jpg)
Representational Power
• Neural network with at least one hidden layer is a universalapproximator (can represent any function)Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
•• The capacity of the network increases with more hidden units
and more hidden layers
• Why go deeper? Read e.g.,: Do Deep Nets Really Need to beDeep? Jimmy Ba, Rich Caruana, Paper: paper
![Page 23: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/23.jpg)
Representational Power
• Neural network with at least one hidden layer is a universalapproximator (can represent any function)Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
•• The capacity of the network increases with more hidden units
and more hidden layers
• Why go deeper? Read e.g.,: Do Deep Nets Really Need to beDeep? Jimmy Ba, Rich Caruana, Paper: paper
![Page 24: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/24.jpg)
Neural Networks
• We only need to know two algorithms• Forward pass: performs inference• Backward pass: performs learning
![Page 25: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/25.jpg)
Forward Pass: What does the Network Compute?
• Output of the network can be written as:
•hj(x) = f
(vj0 +
D∑i=1
xivji
)•
ok(x) = g
wk0 +
J∑j=1
hj(x)wkj
• (j indexing hidden units, k indexing the output units, D number of inputs)
• Activation functionsf, g: sigmoid/logistic, tanh, or rectifiedlinear (ReLU)
σ(z) =1
1 + exp(−z), tanh(z) =
exp(z)− exp(−z)exp(z) + exp(−z)
, ReLU(z) = max(0, z)
![Page 26: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/26.jpg)
Forward Pass: What does the Network Compute?
• Output of the network can be written as:•
hj(x) = f
(vj0 +
D∑i=1
xivji
)
•
ok(x) = g
wk0 +
J∑j=1
hj(x)wkj
• (j indexing hidden units, k indexing the output units, D number of inputs)
• Activation functionsf, g: sigmoid/logistic, tanh, or rectifiedlinear (ReLU)
σ(z) =1
1 + exp(−z), tanh(z) =
exp(z)− exp(−z)exp(z) + exp(−z)
, ReLU(z) = max(0, z)
![Page 27: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/27.jpg)
Forward Pass: What does the Network Compute?
• Output of the network can be written as:•
hj(x) = f
(vj0 +
D∑i=1
xivji
)•
ok(x) = g
wk0 +J∑
j=1
hj(x)wkj
• (j indexing hidden units, k indexing the output units, D number of inputs)
• Activation functionsf, g: sigmoid/logistic, tanh, or rectifiedlinear (ReLU)
σ(z) =1
1 + exp(−z), tanh(z) =
exp(z)− exp(−z)exp(z) + exp(−z)
, ReLU(z) = max(0, z)
![Page 28: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/28.jpg)
Forward Pass: What does the Network Compute?
• Output of the network can be written as:•
hj(x) = f
(vj0 +
D∑i=1
xivji
)•
ok(x) = g
wk0 +J∑
j=1
hj(x)wkj
• (j indexing hidden units, k indexing the output units, D number of inputs)
• Activation functionsf, g: sigmoid/logistic, tanh, or rectifiedlinear (ReLU)
σ(z) =1
1 + exp(−z), tanh(z) =
exp(z)− exp(−z)exp(z) + exp(−z)
, ReLU(z) = max(0, z)
![Page 29: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/29.jpg)
Forward Pass in Python
• Example code for a forward pass for a 3-layer network inPython:
• Can be implemented efficiently using matrix operations
• Example above: W1 is matrix of size 4× 3, W2 is 4× 4.What about biases and W3?
![Page 30: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/30.jpg)
Forward Pass in Python
• Example code for a forward pass for a 3-layer network inPython:
• Can be implemented efficiently using matrix operations
• Example above: W1 is matrix of size 4× 3, W2 is 4× 4.What about biases and W3?
![Page 31: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/31.jpg)
Special Case
• What is a single layer (no hiddens) network with a sigmoidact. function?
• Network:
ok(x) =1
1 + exp (−zk)
zk = wk0 +
J∑j=1
xjwkj
• Logistic regression!
![Page 32: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/32.jpg)
Special Case
• What is a single layer (no hiddens) network with a sigmoidact. function?
• Network:
ok(x) =1
1 + exp (−zk)
zk = wk0 +
J∑j=1
xjwkj
• Logistic regression!
![Page 33: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/33.jpg)
Special Case
• What is a single layer (no hiddens) network with a sigmoidact. function?
• Network:
ok(x) =1
1 + exp (−zk)
zk = wk0 +
J∑j=1
xjwkj
• Logistic regression!
![Page 34: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/34.jpg)
Example Application• Classify image of handwritten digit (32× 32 pixels): 4 vs
non-4
• How would you build your network?
• For example, use one hidden layer and the sigmoid activationfunction:
ok(x) =1
1 + exp (−zk)
zk = wk0 +J∑
j=1
hj(x)wkj
• How can we train the network, that is, adjust all theparameters w?
![Page 35: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/35.jpg)
Example Application• Classify image of handwritten digit (32× 32 pixels): 4 vs
non-4
• How would you build your network?
• For example, use one hidden layer and the sigmoid activationfunction:
ok(x) =1
1 + exp (−zk)
zk = wk0 +J∑
j=1
hj(x)wkj
• How can we train the network, that is, adjust all theparameters w?
![Page 36: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/36.jpg)
Example Application• Classify image of handwritten digit (32× 32 pixels): 4 vs
non-4
• How would you build your network?
• For example, use one hidden layer and the sigmoid activationfunction:
ok(x) =1
1 + exp (−zk)
zk = wk0 +J∑
j=1
hj(x)wkj
• How can we train the network, that is, adjust all theparameters w?
![Page 37: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/37.jpg)
Example Application• Classify image of handwritten digit (32× 32 pixels): 4 vs
non-4
• How would you build your network?
• For example, use one hidden layer and the sigmoid activationfunction:
ok(x) =1
1 + exp (−zk)
zk = wk0 +J∑
j=1
hj(x)wkj
• How can we train the network, that is, adjust all theparameters w?
![Page 38: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/38.jpg)
Training Neural Networks
• Find weights:
w∗ = argminw
N∑n=1
loss(o(n), t(n)
)where o = f(x;w) is the output of a neural network
• Define a loss function, eg:
• Squared loss:∑
k12
(o(n)k − t(n)k
)2• Cross-entropy loss: −
∑k t
(n)k log o
(n)k
• Gradient descent:
wt+1 = wt − η ∂E∂wt
where η is the learning rate (and E is error/loss)
![Page 39: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/39.jpg)
Training Neural Networks
• Find weights:
w∗ = argminw
N∑n=1
loss(o(n), t(n)
)where o = f(x;w) is the output of a neural network
• Define a loss function, eg:
• Squared loss:∑
k12
(o(n)k − t(n)k
)2• Cross-entropy loss: −
∑k t
(n)k log o
(n)k
• Gradient descent:
wt+1 = wt − η ∂E∂wt
where η is the learning rate (and E is error/loss)
![Page 40: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/40.jpg)
Training Neural Networks
• Find weights:
w∗ = argminw
N∑n=1
loss(o(n), t(n)
)where o = f(x;w) is the output of a neural network
• Define a loss function, eg:
• Squared loss:∑
k12
(o(n)k − t(n)k
)2• Cross-entropy loss: −
∑k t
(n)k log o
(n)k
• Gradient descent:
wt+1 = wt − η ∂E∂wt
where η is the learning rate (and E is error/loss)
![Page 41: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/41.jpg)
Useful Derivatives
![Page 42: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/42.jpg)
Training Neural Networks: Back-propagation
• Back-propagation: an efficient method for computinggradients needed to perform gradient-based optimization ofthe weights in a multi-layer network
•• Given any error function E, activation functions g() and f(),
just need to derive gradients
![Page 43: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/43.jpg)
Training Neural Networks: Back-propagation
• Back-propagation: an efficient method for computinggradients needed to perform gradient-based optimization ofthe weights in a multi-layer network
•
• Given any error function E, activation functions g() and f(),just need to derive gradients
![Page 44: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/44.jpg)
Training Neural Networks: Back-propagation
• Back-propagation: an efficient method for computinggradients needed to perform gradient-based optimization ofthe weights in a multi-layer network
•• Given any error function E, activation functions g() and f(),
just need to derive gradients
![Page 45: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/45.jpg)
Key Idea behind Backpropagation
• We don’t have targets for a hidden unit, but we can computehow fast the error changes as we change its activity
• Instead of using desired activities to train the hidden units, useerror derivatives w.r.t. hidden activities
• Each hidden activity can affect many output units and cantherefore have many separate effects on the error. Theseeffects must be combined
• We can compute error derivatives for all the hidden unitsefficiently
• Once we have the error derivatives for the hidden activities, itseasy to get the error derivatives for the weights going into ahidden unit
• This is just the chain rule!
![Page 46: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/46.jpg)
Key Idea behind Backpropagation
• We don’t have targets for a hidden unit, but we can computehow fast the error changes as we change its activity• Instead of using desired activities to train the hidden units, use
error derivatives w.r.t. hidden activities
• Each hidden activity can affect many output units and cantherefore have many separate effects on the error. Theseeffects must be combined
• We can compute error derivatives for all the hidden unitsefficiently
• Once we have the error derivatives for the hidden activities, itseasy to get the error derivatives for the weights going into ahidden unit
• This is just the chain rule!
![Page 47: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/47.jpg)
Key Idea behind Backpropagation
• We don’t have targets for a hidden unit, but we can computehow fast the error changes as we change its activity• Instead of using desired activities to train the hidden units, use
error derivatives w.r.t. hidden activities• Each hidden activity can affect many output units and can
therefore have many separate effects on the error. Theseeffects must be combined
• We can compute error derivatives for all the hidden unitsefficiently
• Once we have the error derivatives for the hidden activities, itseasy to get the error derivatives for the weights going into ahidden unit
• This is just the chain rule!
![Page 48: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/48.jpg)
Key Idea behind Backpropagation
• We don’t have targets for a hidden unit, but we can computehow fast the error changes as we change its activity• Instead of using desired activities to train the hidden units, use
error derivatives w.r.t. hidden activities• Each hidden activity can affect many output units and can
therefore have many separate effects on the error. Theseeffects must be combined
• We can compute error derivatives for all the hidden unitsefficiently
• Once we have the error derivatives for the hidden activities, itseasy to get the error derivatives for the weights going into ahidden unit
• This is just the chain rule!
![Page 49: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/49.jpg)
Key Idea behind Backpropagation
• We don’t have targets for a hidden unit, but we can computehow fast the error changes as we change its activity• Instead of using desired activities to train the hidden units, use
error derivatives w.r.t. hidden activities• Each hidden activity can affect many output units and can
therefore have many separate effects on the error. Theseeffects must be combined
• We can compute error derivatives for all the hidden unitsefficiently
• Once we have the error derivatives for the hidden activities, itseasy to get the error derivatives for the weights going into ahidden unit
• This is just the chain rule!
![Page 50: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/50.jpg)
Key Idea behind Backpropagation
• We don’t have targets for a hidden unit, but we can computehow fast the error changes as we change its activity• Instead of using desired activities to train the hidden units, use
error derivatives w.r.t. hidden activities• Each hidden activity can affect many output units and can
therefore have many separate effects on the error. Theseeffects must be combined
• We can compute error derivatives for all the hidden unitsefficiently
• Once we have the error derivatives for the hidden activities, itseasy to get the error derivatives for the weights going into ahidden unit
• This is just the chain rule!
![Page 51: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/51.jpg)
Computing Gradients: Single Layer Network
• Let’s take a single layer network
![Page 52: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/52.jpg)
Computing Gradients: Single Layer Network
• Let’s take a single layer network
![Page 53: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/53.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok
∂ok∂zk
∂zk∂wki
![Page 54: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/54.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok
∂ok∂zk
∂zk∂wki
![Page 55: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/55.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok
∂ok∂zk
∂zk∂wki
• Error gradient is computable for any continuous activationfunction g(), and any continuous error function
![Page 56: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/56.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok︸︷︷︸δok
∂ok∂zk
∂zk∂wki
![Page 57: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/57.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok
∂ok∂zk
∂zk∂wki
= δok∂ok∂zk
∂zk∂wki
![Page 58: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/58.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok
∂ok∂zk
∂zk∂wki
= δok ·∂ok∂zk︸ ︷︷ ︸δzk
∂zk∂wki
![Page 59: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/59.jpg)
Computing Gradients: Single Layer Network
• Error gradients for single layer network:
∂E
∂wki=∂E
∂ok
∂ok∂zk
∂zk∂wki
= δzk∂zk∂wki
= δzk · xi
![Page 60: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/60.jpg)
Gradient Descent for Single Layer Network
• Assuming the error function is mean-squared error (MSE), ona single training example n, we have
∂E
∂o(n)k
= o(n)k − t
(n)k := δok
Using logistic activation functions:
o(n)k = g
(z(n)k
)=(1 + exp
(−z(n)k
))−1
∂o(n)k
∂z(n)k
= o(n)k
(1− o(n)k
)
• The error gradient is then:
∂E
∂wki=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wki=
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
![Page 61: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/61.jpg)
Gradient Descent for Single Layer Network
• Assuming the error function is mean-squared error (MSE), ona single training example n, we have
∂E
∂o(n)k
= o(n)k − t
(n)k := δok
Using logistic activation functions:
o(n)k = g
(z(n)k
)=(1 + exp
(−z(n)k
))−1
∂o(n)k
∂z(n)k
= o(n)k
(1− o(n)k
)• The error gradient is then:
∂E
∂wki=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wki=
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
![Page 62: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/62.jpg)
Gradient Descent for Single Layer Network
• Assuming the error function is mean-squared error (MSE), ona single training example n, we have
∂E
∂o(n)k
= o(n)k − t
(n)k := δok
Using logistic activation functions:
o(n)k = g
(z(n)k
)=(1 + exp
(−z(n)k
))−1
∂o(n)k
∂z(n)k
= o(n)k
(1− o(n)k
)• The error gradient is then:
∂E
∂wki=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wki=
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
![Page 63: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/63.jpg)
Gradient Descent for Single Layer Network• Assuming the error function is mean-squared error (MSE), on
a single training example n, we have
∂E
∂o(n)k
= o(n)k − t
(n)k := δok
Using logistic activation functions:
o(n)k = g
(z(n)k
)=(1 + exp
(−z(n)k
))−1
∂o(n)k
∂z(n)k
= o(n)k
(1− o(n)k
)• The error gradient is then:
∂E
∂wki=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wki=
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
• The gradient descent update rule is given by:
wki ← wki−η∂E
∂wki= = wki − η
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
![Page 64: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/64.jpg)
Gradient Descent for Single Layer Network• Assuming the error function is mean-squared error (MSE), on
a single training example n, we have
∂E
∂o(n)k
= o(n)k − t
(n)k := δok
Using logistic activation functions:
o(n)k = g
(z(n)k
)=(1 + exp
(−z(n)k
))−1
∂o(n)k
∂z(n)k
= o(n)k
(1− o(n)k
)• The error gradient is then:
∂E
∂wki=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wki=
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
• The gradient descent update rule is given by:
wki ← wki−η∂E
∂wki= = wki − η
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
![Page 65: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/65.jpg)
Gradient Descent for Single Layer Network• Assuming the error function is mean-squared error (MSE), on
a single training example n, we have
∂E
∂o(n)k
= o(n)k − t
(n)k := δok
Using logistic activation functions:
o(n)k = g
(z(n)k
)=(1 + exp
(−z(n)k
))−1
∂o(n)k
∂z(n)k
= o(n)k
(1− o(n)k
)• The error gradient is then:
∂E
∂wki=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wki=
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
• The gradient descent update rule is given by:
wki ← wki−η∂E
∂wki= wki−η
N∑n=1
(o(n)k − t(n)k
)o(n)k
(1− o(n)k
)x(n)i
![Page 66: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/66.jpg)
Multi-layer Neural Network
![Page 67: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/67.jpg)
Back-propagation: Sketch on One Training Case• Convert discrepancy between each output and its target value
into an error derivative
E =1
2
∑k
(ok − tk)2 ;∂E
∂ok= ok − tk
• Compute error derivatives in each hidden layer from errorderivatives in layer above. [assign blame for error at k to eachunit j according to its influence on k (depends on wkj)]
• Use error derivatives w.r.t. activities to get error derivativesw.r.t. the weights.
![Page 68: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/68.jpg)
Back-propagation: Sketch on One Training Case• Convert discrepancy between each output and its target value
into an error derivative
E =1
2
∑k
(ok − tk)2 ;∂E
∂ok= ok − tk
• Compute error derivatives in each hidden layer from errorderivatives in layer above. [assign blame for error at k to eachunit j according to its influence on k (depends on wkj)]
• Use error derivatives w.r.t. activities to get error derivativesw.r.t. the weights.
![Page 69: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/69.jpg)
Back-propagation: Sketch on One Training Case• Convert discrepancy between each output and its target value
into an error derivative
E =1
2
∑k
(ok − tk)2 ;∂E
∂ok= ok − tk
• Compute error derivatives in each hidden layer from errorderivatives in layer above. [assign blame for error at k to eachunit j according to its influence on k (depends on wkj)]
• Use error derivatives w.r.t. activities to get error derivativesw.r.t. the weights.
![Page 70: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/70.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
![Page 71: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/71.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
![Page 72: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/72.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
![Page 73: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/73.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
![Page 74: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/74.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
∂E
∂vji=
N∑n=1
∂E
∂h(n)j
∂h(n)j
∂u(n)j
∂u(n)j
∂vji=
N∑n=1
δh,(n)j f ′
(u(n)j
) ∂u(n)j
∂vji=
N∑n=1
δu,(n)j x
(n)i
![Page 75: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/75.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
∂E
∂vji=
N∑n=1
∂E
∂h(n)j
∂h(n)j
∂u(n)j
∂u(n)j
∂vji=
N∑n=1
δh,(n)j f ′
(u(n)j
)∂u(n)j
∂vji=
N∑n=1
δu,(n)j x
(n)i
![Page 76: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/76.jpg)
Gradient Descent for Multi-layer Network
• The output weight gradients for amulti-layer network are the same as for asingle layer network
∂E
∂wkj=
N∑n=1
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂wkj=
N∑n=1
δz,(n)k h
(n)j
where δk is the error w.r.t. the net input for
unit k
• Hidden weight gradients are then computed via back-prop:
∂E
∂h(n)j
=∑k
∂E
∂o(n)k
∂o(n)k
∂z(n)k
∂z(n)k
∂h(n)j
=∑k
δz,(n)k wkj := δ
h,(n)j
∂E
∂vji=
N∑n=1
∂E
∂h(n)j
∂h(n)j
∂u(n)j
∂u(n)j
∂vji=
N∑n=1
δh,(n)j f ′
(u(n)j
)∂u(n)j
∂vji=
N∑n=1
δu,(n)j x
(n)i
![Page 77: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/77.jpg)
Choosing Activation and Loss Functions
• When using a neural network for regression, sigmoidactivation and MSE as the loss function work well
• For classification, if it is a binary (2-class) problem, thencross-entropy error function often does better (as we saw withlogistic regression)
E = −N∑
n=1
t(n) log o(n) +(1− t(n)
)log(1− o(n)
)
o(n) =
(1 + exp
(−z(n)
)−1
• We can then compute via the chain rule
∂E∂o = (o− t)/(o(1− o))
∂o∂z = o(1− o)
∂E∂z = ∂E
∂o∂o∂z = (o− t)
![Page 78: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/78.jpg)
Choosing Activation and Loss Functions
• When using a neural network for regression, sigmoidactivation and MSE as the loss function work well• For classification, if it is a binary (2-class) problem, then
cross-entropy error function often does better (as we saw withlogistic regression)
E = −N∑
n=1
t(n) log o(n) +(1− t(n)
)log(1− o(n)
)
o(n) =
(1 + exp
(−z(n)
)−1
• We can then compute via the chain rule
∂E∂o = (o− t)/(o(1− o))
∂o∂z = o(1− o)
∂E∂z = ∂E
∂o∂o∂z = (o− t)
![Page 79: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/79.jpg)
Choosing Activation and Loss Functions
• When using a neural network for regression, sigmoidactivation and MSE as the loss function work well• For classification, if it is a binary (2-class) problem, then
cross-entropy error function often does better (as we saw withlogistic regression)
E = −N∑
n=1
t(n) log o(n) +(1− t(n)
)log(1− o(n)
)
o(n) =
(1 + exp
(−z(n)
)−1
• We can then compute via the chain rule
∂E∂o = (o− t)/(o(1− o))
∂o∂z = o(1− o)
∂E∂z = ∂E
∂o∂o∂z = (o− t)
![Page 80: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/80.jpg)
Multi-class Classification
• For multi-class classification problems, usecross-entropy as loss and the softmaxactivation function
E = −∑
n
∑k t
(n)k log o
(n)k
o(n)k =
exp(z(n)k
)∑
j exp(z(n)j
)
• And the derivatives become
∂ok∂zk
= ok (1− ok)
∂E
∂zk=∑j
∂E
∂oj
∂oj∂zk
= (ok − tk) ok (1− ok)
![Page 81: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/81.jpg)
Example Application
• Now trying to classify image ofhandwritten digit: 32× 32 pixels
• 10 output units, 1 per digit
• Use the softmax function:
ok =exp (zk)∑j exp (zj)
zk = wk0 +J∑j=1
hj(x)wkj
• What is J ?
![Page 82: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/82.jpg)
Ways to Use Weight Derivatives
• How often to update
• after a full sweep through the training data (batch gradientdescent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update
• Use a fixed learning rate• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 83: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/83.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update
• Use a fixed learning rate• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 84: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/84.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)
• after a mini-batch of training cases
• How much to update
• Use a fixed learning rate• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 85: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/85.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update
• Use a fixed learning rate• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 86: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/86.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update
• Use a fixed learning rate• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 87: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/87.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update• Use a fixed learning rate
• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 88: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/88.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update• Use a fixed learning rate• Adapt the learning rate
• Add momentumwki ← wki − v
v ← γv + η∂E
∂wki
![Page 89: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/89.jpg)
Ways to Use Weight Derivatives
• How often to update• after a full sweep through the training data (batch gradient
descent)
wki ← wki − η∂E
∂wki= wki − η
N∑n=1
∂E(o(n), t(n);w
)∂wki
• after each training case (stochastic gradient descent)• after a mini-batch of training cases
• How much to update• Use a fixed learning rate• Adapt the learning rate• Add momentum
wki ← wki − v
v ← γv + η∂E
∂wki
![Page 90: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/90.jpg)
Comparing Optimization Methods
![Page 91: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/91.jpg)
Monitor Loss During Training
• Check how your loss behaves during training, to spot wronghyperparameters, bugs, etc
Figure : Left: Good vs bad parameter choices, Right: How a realloss might look like during training. What are the bumps causedby? How could we get a more smooth loss?
![Page 92: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/92.jpg)
Monitor Accuracy on Train/Validation During Training
• Check how your desired performance metrics behaves duringtraining
![Page 93: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/93.jpg)
Why ”Deep”?
![Page 94: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/94.jpg)
Why ”Deep”?
![Page 95: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/95.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input• y is the output (what we want to predict)• hi is the i-th hidden layer• Wi are the parameters of the i-th layer
![Page 96: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/96.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input• y is the output (what we want to predict)• hi is the i-th hidden layer• Wi are the parameters of the i-th layer
![Page 97: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/97.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input• y is the output (what we want to predict)• hi is the i-th hidden layer• Wi are the parameters of the i-th layer
![Page 98: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/98.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input
• y is the output (what we want to predict)• hi is the i-th hidden layer• Wi are the parameters of the i-th layer
![Page 99: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/99.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input• y is the output (what we want to predict)
• hi is the i-th hidden layer• Wi are the parameters of the i-th layer
![Page 100: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/100.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input• y is the output (what we want to predict)• hi is the i-th hidden layer
• Wi are the parameters of the i-th layer
![Page 101: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/101.jpg)
Neural Networks
• Deep learning uses composite of simple functions (e.g., ReLU,sigmoid, tanh, max) to create complex non-linear functions
• Note: a composite of linear functions is linear!
• Example: 2 hidden layer NNet (now matrix and vector form!)with ReLU as nonlinearity
• x is the input• y is the output (what we want to predict)• hi is the i-th hidden layer• Wi are the parameters of the i-th layer
![Page 102: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/102.jpg)
Evaluating the Function
• Assume we have learn the weights and we want to do inference
• Forward Propagation: compute the output given the input
• Do it in a compositional way
h1 = max(0,W T
1 x+ b1)
h2 = max(0,W T
2 h1 + b2)
y =W T3 h2 + b3
![Page 103: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/103.jpg)
Evaluating the Function
• Assume we have learn the weights and we want to do inference
• Forward Propagation: compute the output given the input
• Do it in a compositional way
h1 = max(0,W T
1 x+ b1)
h2 = max(0,W T
2 h1 + b2)
y =W T3 h2 + b3
![Page 104: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/104.jpg)
Evaluating the Function
• Assume we have learn the weights and we want to do inference
• Forward Propagation: compute the output given the input
• Do it in a compositional way
h1 = max(0,W T
1 x+ b1)
h2 = max(0,W T
2 h1 + b2)
y =W T3 h2 + b3
![Page 105: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/105.jpg)
Evaluating the Function
• Assume we have learn the weights and we want to do inference
• Forward Propagation: compute the output given the input
• Do it in a compositional way
h1 = max(0,W T
1 x+ b1)
h2 = max(0,W T
2 h1 + b2)
y =W T3 h2 + b3
![Page 106: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/106.jpg)
Learning
• We want to estimate the parameters, biases andhyper-parameters (e.g., number of layers, number of units)such that we do good predictions
• Collect a training set of input-output pairs{x(n), t(n)
}
• For classification: Encode the output with 1-K encodingt = [0, . . . , 1, . . . , 0]
• Define a loss per training example and minimize the empiricalrisk
L(w) =1
N
∑n
`(w,x(n), t(n)
)with N number of examplesand w contains all parameters
![Page 107: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/107.jpg)
Learning
• We want to estimate the parameters, biases andhyper-parameters (e.g., number of layers, number of units)such that we do good predictions
• Collect a training set of input-output pairs{x(n), t(n)
}• For classification: Encode the output with 1-K encodingt = [0, . . . , 1, . . . , 0]
• Define a loss per training example and minimize the empiricalrisk
L(w) =1
N
∑n
`(w,x(n), t(n)
)with N number of examplesand w contains all parameters
![Page 108: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/108.jpg)
Learning
• We want to estimate the parameters, biases andhyper-parameters (e.g., number of layers, number of units)such that we do good predictions
• Collect a training set of input-output pairs{x(n), t(n)
}• For classification: Encode the output with 1-K encodingt = [0, . . . , 1, . . . , 0]
• Define a loss per training example and minimize the empiricalrisk
L(w) =1
N
∑n
`(w,x(n), t(n)
)with N number of examplesand w contains all parameters
![Page 109: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/109.jpg)
Loss Function: Classification
L(w) =1
N
∑n
`(w,x(n), t(n)
)
• Probability of class k given input (softmax):
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
• Cross entropy is the most used loss function for classification
`(w,x(n), t(n)
)= −
∑k
t(n)k log p (ck|x)
• Use gradient descent to train the network
minw
1
N
∑n
`(w,x(n), t(n)
)
![Page 110: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/110.jpg)
Loss Function: Classification
L(w) =1
N
∑n
`(w,x(n), t(n)
)• Probability of class k given input (softmax):
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
• Cross entropy is the most used loss function for classification
`(w,x(n), t(n)
)= −
∑k
t(n)k log p (ck|x)
• Use gradient descent to train the network
minw
1
N
∑n
`(w,x(n), t(n)
)
![Page 111: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/111.jpg)
Loss Function: Classification
L(w) =1
N
∑n
`(w,x(n), t(n)
)• Probability of class k given input (softmax):
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
• Cross entropy is the most used loss function for classification
`(w,x(n), t(n)
)= −
∑k
t(n)k log p (ck|x)
• Use gradient descent to train the network
minw
1
N
∑n
`(w,x(n), t(n)
)
![Page 112: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/112.jpg)
Loss Function: Classification
L(w) =1
N
∑n
`(w,x(n), t(n)
)• Probability of class k given input (softmax):
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
• Cross entropy is the most used loss function for classification
`(w,x(n), t(n)
)= −
∑k
t(n)k log p (ck|x)
• Use gradient descent to train the network
minw
1
N
∑n
`(w,x(n), t(n)
)
![Page 113: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/113.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
`(x(n), t(n),w
)= −
∑k
t(n)k log p (ck|x)
• Compute the derivative of loss w.r.t. the output
∂`
∂y= p(c|x)− t
• Note that the forward pass is necessary to compute ∂`∂y
![Page 114: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/114.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
`(x(n), t(n),w
)= −
∑k
t(n)k log p (ck|x)
• Compute the derivative of loss w.r.t. the output
∂`
∂y= p(c|x)− t
• Note that the forward pass is necessary to compute ∂`∂y
![Page 115: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/115.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
`(x(n), t(n),w
)= −
∑k
t(n)k log p (ck|x)
• Compute the derivative of loss w.r.t. the output
∂`
∂y= p(c|x)− t
• Note that the forward pass is necessary to compute ∂`∂y
![Page 116: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/116.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
`(x(n), t(n),w
)= −
∑k
t(n)k log p (ck|x)
• Compute the derivative of loss w.r.t. the output
∂`
∂y= p(c|x)− t
• Note that the forward pass is necessary to compute ∂`∂y
![Page 117: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/117.jpg)
Backpropagation
• Efficient computation of the gradients by applying the chainrule
p (ck = 1|x) = exp (yk)∑Cj=1 exp (yj)
`(x(n), t(n),w
)= −
∑k
t(n)k log p (ck|x)
• Compute the derivative of loss w.r.t. the output
∂`
∂y= p(c|x)− t
• Note that the forward pass is necessary to compute ∂`∂y
![Page 118: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/118.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T(p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 119: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/119.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 120: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/120.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 121: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/121.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 122: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/122.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 123: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/123.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 124: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/124.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 125: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/125.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 126: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/126.jpg)
Backpropagation• Efficient computation of the gradients by applying the chain
rule
• We have computed the derivative of loss w.r.t the output
∂`
∂y= p(c|x)− t
• Given ∂`∂y if we can compute the Jacobian of each module
∂`
∂W3=∂`
∂y
∂y
∂W3= (p(c|x)− t)
(h2)T
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Need to compute gradient w.r.t. inputs and parameters ineach layer
![Page 127: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/127.jpg)
Backpropagation
• Efficient computation of the gradients by applying the chainrule
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Given ∂`∂h2 if we can compute the Jacobian of each module
∂`
∂W2=
∂`
∂h2
∂h2
∂W2
∂`
∂h1=
∂`
∂h2
∂h2
∂h1
![Page 128: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/128.jpg)
Backpropagation
• Efficient computation of the gradients by applying the chainrule
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Given ∂`∂h2 if we can compute the Jacobian of each module
∂`
∂W2=
∂`
∂h2
∂h2
∂W2
∂`
∂h1=
∂`
∂h2
∂h2
∂h1
![Page 129: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/129.jpg)
Backpropagation
• Efficient computation of the gradients by applying the chainrule
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Given ∂`∂h2 if we can compute the Jacobian of each module
∂`
∂W2=
∂`
∂h2
∂h2
∂W2
∂`
∂h1=
∂`
∂h2
∂h2
∂h1
![Page 130: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/130.jpg)
Backpropagation
• Efficient computation of the gradients by applying the chainrule
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Given ∂`∂h2 if we can compute the Jacobian of each module
∂`
∂W2=
∂`
∂h2
∂h2
∂W2
∂`
∂h1=
∂`
∂h2
∂h2
∂h1
![Page 131: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/131.jpg)
Backpropagation
• Efficient computation of the gradients by applying the chainrule
∂`
∂h2=∂`
∂y
∂y
∂h2= (W3)
T (p(c|x)− t)
• Given ∂`∂h2 if we can compute the Jacobian of each module
∂`
∂W2=
∂`
∂h2
∂h2
∂W2
∂`
∂h1=
∂`
∂h2
∂h2
∂h1
![Page 132: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/132.jpg)
Toy Code (Matlab): Neural Net Trainer
This code has a few bugs with indices...
![Page 133: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/133.jpg)
Overfitting
• The training data contains information about the regularitiesin the mapping from input to output. But it also containsnoise
• The target values may be unreliable.• There is sampling error: There will be accidental regularities
just because of the particular training cases that were chosen
• When we fit the model, it cannot tell which regularities arereal and which are caused by sampling error.
• So it fits both kinds of regularity• If the model is very flexible it can model the sampling error
really well. This is a disaster.
![Page 134: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/134.jpg)
Overfitting
• The training data contains information about the regularitiesin the mapping from input to output. But it also containsnoise• The target values may be unreliable.
• There is sampling error: There will be accidental regularitiesjust because of the particular training cases that were chosen
• When we fit the model, it cannot tell which regularities arereal and which are caused by sampling error.
• So it fits both kinds of regularity• If the model is very flexible it can model the sampling error
really well. This is a disaster.
![Page 135: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/135.jpg)
Overfitting
• The training data contains information about the regularitiesin the mapping from input to output. But it also containsnoise• The target values may be unreliable.• There is sampling error: There will be accidental regularities
just because of the particular training cases that were chosen
• When we fit the model, it cannot tell which regularities arereal and which are caused by sampling error.
• So it fits both kinds of regularity• If the model is very flexible it can model the sampling error
really well. This is a disaster.
![Page 136: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/136.jpg)
Overfitting
• The training data contains information about the regularitiesin the mapping from input to output. But it also containsnoise• The target values may be unreliable.• There is sampling error: There will be accidental regularities
just because of the particular training cases that were chosen
• When we fit the model, it cannot tell which regularities arereal and which are caused by sampling error.
• So it fits both kinds of regularity• If the model is very flexible it can model the sampling error
really well. This is a disaster.
![Page 137: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/137.jpg)
Overfitting
• The training data contains information about the regularitiesin the mapping from input to output. But it also containsnoise• The target values may be unreliable.• There is sampling error: There will be accidental regularities
just because of the particular training cases that were chosen
• When we fit the model, it cannot tell which regularities arereal and which are caused by sampling error.• So it fits both kinds of regularity
• If the model is very flexible it can model the sampling errorreally well. This is a disaster.
![Page 138: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/138.jpg)
Overfitting
• The training data contains information about the regularitiesin the mapping from input to output. But it also containsnoise• The target values may be unreliable.• There is sampling error: There will be accidental regularities
just because of the particular training cases that were chosen
• When we fit the model, it cannot tell which regularities arereal and which are caused by sampling error.• So it fits both kinds of regularity• If the model is very flexible it can model the sampling error
really well. This is a disaster.
![Page 139: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/139.jpg)
Preventing Overfitting
• Use a model that has the right capacity:
• enough to model the true regularities• not enough to also model the spurious regularities (assuming
they areweaker)
• Standard ways to limit the capacity of a neural net:
• Limit the number of hidden units.• Limit the norm of the weights.• Stop the learning before it has time to overfit.
![Page 140: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/140.jpg)
Preventing Overfitting
• Use a model that has the right capacity:• enough to model the true regularities
• not enough to also model the spurious regularities (assumingthey areweaker)
• Standard ways to limit the capacity of a neural net:
• Limit the number of hidden units.• Limit the norm of the weights.• Stop the learning before it has time to overfit.
![Page 141: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/141.jpg)
Preventing Overfitting
• Use a model that has the right capacity:• enough to model the true regularities• not enough to also model the spurious regularities (assuming
they areweaker)
• Standard ways to limit the capacity of a neural net:
• Limit the number of hidden units.• Limit the norm of the weights.• Stop the learning before it has time to overfit.
![Page 142: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/142.jpg)
Preventing Overfitting
• Use a model that has the right capacity:• enough to model the true regularities• not enough to also model the spurious regularities (assuming
they areweaker)
• Standard ways to limit the capacity of a neural net:
• Limit the number of hidden units.• Limit the norm of the weights.• Stop the learning before it has time to overfit.
![Page 143: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/143.jpg)
Preventing Overfitting
• Use a model that has the right capacity:• enough to model the true regularities• not enough to also model the spurious regularities (assuming
they areweaker)
• Standard ways to limit the capacity of a neural net:• Limit the number of hidden units.
• Limit the norm of the weights.• Stop the learning before it has time to overfit.
![Page 144: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/144.jpg)
Preventing Overfitting
• Use a model that has the right capacity:• enough to model the true regularities• not enough to also model the spurious regularities (assuming
they areweaker)
• Standard ways to limit the capacity of a neural net:• Limit the number of hidden units.• Limit the norm of the weights.
• Stop the learning before it has time to overfit.
![Page 145: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/145.jpg)
Preventing Overfitting
• Use a model that has the right capacity:• enough to model the true regularities• not enough to also model the spurious regularities (assuming
they areweaker)
• Standard ways to limit the capacity of a neural net:• Limit the number of hidden units.• Limit the norm of the weights.• Stop the learning before it has time to overfit.
![Page 146: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/146.jpg)
Limiting the size of the Weights
• Weight-decay involves adding an extra term to the costfunction that penalizes the squared weights
C = `+λ
2
∑i
w2i
• Keeps weights small unless they have big error derivatives
∂C
∂wi=
∂`
∂wi+ λwi
when ∂C∂wi
= 0, wi = − 1λ∂`∂wi
when∂C∂wi
= 0, wi = − 1λ∂`∂wi
when∂C∂wi
= 0, wi = − 1λ∂`∂wi
when ∂C∂wi
= 0, wi = − 1λ∂`∂wi
![Page 147: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/147.jpg)
The Effect of Weight-decay
• It prevents the network from using weights that it does notneed
• This can often improve generalization a lot.• It helps to stop it from fitting the sampling error.• It makes a smoother model in which the output changes more
slowly as the input changes.
• But, if the network has two very similar inputs it prefers toput half the weight on each rather than all the weight on one→ other form of weight decay?
![Page 148: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/148.jpg)
The Effect of Weight-decay
• It prevents the network from using weights that it does notneed• This can often improve generalization a lot.
• It helps to stop it from fitting the sampling error.• It makes a smoother model in which the output changes more
slowly as the input changes.
• But, if the network has two very similar inputs it prefers toput half the weight on each rather than all the weight on one→ other form of weight decay?
![Page 149: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/149.jpg)
The Effect of Weight-decay
• It prevents the network from using weights that it does notneed• This can often improve generalization a lot.• It helps to stop it from fitting the sampling error.
• It makes a smoother model in which the output changes moreslowly as the input changes.
• But, if the network has two very similar inputs it prefers toput half the weight on each rather than all the weight on one→ other form of weight decay?
![Page 150: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/150.jpg)
The Effect of Weight-decay
• It prevents the network from using weights that it does notneed• This can often improve generalization a lot.• It helps to stop it from fitting the sampling error.• It makes a smoother model in which the output changes more
slowly as the input changes.
• But, if the network has two very similar inputs it prefers toput half the weight on each rather than all the weight on one→ other form of weight decay?
![Page 151: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/151.jpg)
The Effect of Weight-decay
• It prevents the network from using weights that it does notneed• This can often improve generalization a lot.• It helps to stop it from fitting the sampling error.• It makes a smoother model in which the output changes more
slowly as the input changes.
• But, if the network has two very similar inputs it prefers toput half the weight on each rather than all the weight on one→ other form of weight decay?
![Page 152: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/152.jpg)
Deciding How Much to Restrict the Capacity
• How do we decide which regularizer to use and how strong tomake it?
• So use a separate validation set to do model selection.
![Page 153: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/153.jpg)
Deciding How Much to Restrict the Capacity
• How do we decide which regularizer to use and how strong tomake it?
• So use a separate validation set to do model selection.
![Page 154: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/154.jpg)
Using a Validation Set
• Divide the total dataset into three subsets:
• Training data is used for learning the parameters of the model.• Validation data is not used for learning but is used for deciding
what type of model and what amount of regularization worksbest
• Test data is used to get a final, unbiased estimate of how wellthe network works. We expect this estimate to be worse thanon the validation data
• We could then re-divide the total dataset to get anotherunbiased estimate of the true error rate
![Page 155: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/155.jpg)
Using a Validation Set
• Divide the total dataset into three subsets:• Training data is used for learning the parameters of the model.
• Validation data is not used for learning but is used for decidingwhat type of model and what amount of regularization worksbest
• Test data is used to get a final, unbiased estimate of how wellthe network works. We expect this estimate to be worse thanon the validation data
• We could then re-divide the total dataset to get anotherunbiased estimate of the true error rate
![Page 156: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/156.jpg)
Using a Validation Set
• Divide the total dataset into three subsets:• Training data is used for learning the parameters of the model.• Validation data is not used for learning but is used for deciding
what type of model and what amount of regularization worksbest
• Test data is used to get a final, unbiased estimate of how wellthe network works. We expect this estimate to be worse thanon the validation data
• We could then re-divide the total dataset to get anotherunbiased estimate of the true error rate
![Page 157: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/157.jpg)
Using a Validation Set
• Divide the total dataset into three subsets:• Training data is used for learning the parameters of the model.• Validation data is not used for learning but is used for deciding
what type of model and what amount of regularization worksbest
• Test data is used to get a final, unbiased estimate of how wellthe network works. We expect this estimate to be worse thanon the validation data
• We could then re-divide the total dataset to get anotherunbiased estimate of the true error rate
![Page 158: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/158.jpg)
Using a Validation Set
• Divide the total dataset into three subsets:• Training data is used for learning the parameters of the model.• Validation data is not used for learning but is used for deciding
what type of model and what amount of regularization worksbest
• Test data is used to get a final, unbiased estimate of how wellthe network works. We expect this estimate to be worse thanon the validation data
• We could then re-divide the total dataset to get anotherunbiased estimate of the true error rate
![Page 159: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/159.jpg)
Preventing Overfitting by Early Stopping
• If we have lots of data and a big model, its very expensive tokeep re-training it with different amounts of weight decay
• It is much cheaper to start with very small weights and letthem grow until the performance on the validation set startsgetting worse
• The capacity of the model is limited because the weights havenot had time to grow big.
![Page 160: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/160.jpg)
Preventing Overfitting by Early Stopping
• If we have lots of data and a big model, its very expensive tokeep re-training it with different amounts of weight decay
• It is much cheaper to start with very small weights and letthem grow until the performance on the validation set startsgetting worse
• The capacity of the model is limited because the weights havenot had time to grow big.
![Page 161: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/161.jpg)
Preventing Overfitting by Early Stopping
• If we have lots of data and a big model, its very expensive tokeep re-training it with different amounts of weight decay
• It is much cheaper to start with very small weights and letthem grow until the performance on the validation set startsgetting worse
• The capacity of the model is limited because the weights havenot had time to grow big.
![Page 162: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/162.jpg)
Why Early Stopping Works
When the weights are very small, every hiddenunit is in its linear range.• When the weights are very small, every
hidden unit is in its linear range.• So a net with a large layer of hidden
units is linear.• It has no more capacity than a linear net
in which the inputs are directlyconnected to the outputs!
• As the weights grow, the hidden unitsstart using their non-linear ranges so thecapacity grows.
![Page 163: Statistical Methods for Data Miningkuangnanfang.com/zb_users/upload/2019/05/201905201558318177337693.pdf · basis functions), then optimization still involves linear combinations](https://reader031.fdocuments.net/reader031/viewer/2022022809/5e521a2ab95aa2777c1710c4/html5/thumbnails/163.jpg)
Thank you