Post on 19-Dec-2015
Agenda
Review of Neural Nets and Backpropagation Backpropagation: The Math Advantages and Disadvantages of Gradient
Descent and other algorithms Enhancements of Gradient Descent Other ways of minimizing error
Review
Approach that developed from an analysis of the human brain
Nodes created as an analog to neurons Mainly used for classification problems (i.e.
character recognition, voice recognition, medical applications, etc.)
Review
Neurons have weighted inputs, threshold values, activation function, and an output
Weighted inputsOutput
Activation function = f((inputs * weight))
Review4 Input AND
Threshold = 1.5
Threshold = 1.5
Threshold = 1.5
All weights = 1 and all outputs = 1 if active 0 otherwise
Inputs
Inputs
Outputs
Review Output space for XOR gate Demonstrates need for hidden layer
(1,1)
(1,0)
(0,1)
(0,0)
Input 1
Input 2
Backpropagation: The Math
General multi-layered neural network
0 1 2 3 4 5 6 7 8 9
0 1 i
0 1
Output Layer
Wi,0W0,0 W1,0
X9,0X0,0 X1,0
Hidden Layer
Input Layer
Backpropagation: The Math
Backpropagation Gradient Descent objective function
Gradient Descent termination condition
Backpropagation: The Math
Backpropagation Output layer weight recalculation
Learning Rate (eg. 0.25)
Error at k
Backpropagation Using Gradient Descent
Advantages Relatively simple implementation Standard method and generally works well
Disadvantages Slow and inefficient Can get stuck in local minima resulting in sub-
optimal solutions
Alternatives To Gradient Descent
Simulated Annealing Advantages
Can guarantee optimal solution (global minimum) Disadvantages
May be slower than gradient descent Much more complicated implementation
Alternatives To Gradient Descent
Genetic Algorithms/Evolutionary Strategies Advantages
Faster than simulated annealing Less likely to get stuck in local minima
Disadvantages Slower than gradient descent Memory intensive for large nets
Alternatives To Gradient Descent
Simplex Algorithm Advantages
Similar to gradient descent but faster Easy to implement
Disadvantages Does not guarantee a global minimum
Enhancements To Gradient Descent
Momentum Adds a percentage of the last movement to the
current movement
Enhancements To Gradient Descent Momentum
Useful to get over small bumps in the error function Often finds a minimum in less steps w(t) = -n*d*y + a*w(t-1)
w is the change in weight n is the learning rate d is the error y is different depending on which layer we are calculating a is the momentum parameter
Enhancements To Gradient Descent Adaptive Backpropagation Algorithm
It assigns each weight a learning rate That learning rate is determined by the sign of the gradient
of the error function from the last iteration If the signs are equal it is more likely to be a shallow slope so
the learning rate is increased The signs are more likely to differ on a steep slope so the
learning rate is decreased This will speed up the advancement when on gradual
slopes
Enhancements To Gradient Descent
Adaptive Backpropagation Possible Problems:
Since we minimize the error for each weight separately the overall error may increase
Solution: Calculate the total output error after each adaptation
and if it is greater than the previous error reject that adaptation and calculate new learning rates
Enhancements To Gradient Descent SuperSAB(Super Self-Adapting Backpropagation)
Combines the momentum and adaptive methods. Uses adaptive method and momentum so long as the sign
of the gradient does not change This is an additive effect of both methods resulting in a faster
traversal of gradual slopes When the sign of the gradient does change the momentum
will cancel the drastic drop in learning rate This allows for the function to roll up the other side of the
minimum possibly escaping local minima
Enhancements To Gradient Descent
SuperSAB Experiments show that the SuperSAB converges
faster than gradient descent Overall this algorithm is less sensitive (and so is less
likely to get caught in local minima)
Other Ways To Minimize Error Varying training data
Cycle through input classes Randomly select from input classes
Add noise to training data Randomly change value of input node (with low
probability) Retrain with expected inputs after initial training
E.g. Speech recognition
Other Ways To Minimize Error
Adding and removing neurons from layers Adding neurons speeds up learning but may cause
loss in generalization Removing neurons has the opposite effect