Basic Math for Neural Networks - Liangliang...
Transcript of Basic Math for Neural Networks - Liangliang...
Basic Math for Neural Networks
Xiaodong Cui
IBM T. J. Watson Research CenterYorktown Heights, NY 10598
Spring, 2017
Deep Learning for Computer Vision, Speech, and Language
Outline
• Feedforward neural networks
• Forward propagation
• Neural networks as universal approximators
• Back propagation
• Jacobian
• Vanishing gradient problem
EECS 6894, Columbia University 2/23
Deep Learning for Computer Vision, Speech, and Language
Feedforward Neural Networks
input layer
hidden layers
output layer
a
z
h
+1
+1
x
a(l)i =
∑j
w(l)ij z
(l−1)j + b
(l)i
z(l)i = h
(l)i
(a
(l)i
)EECS 6894, Columbia University 3/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Sigmoidalso known as logistic function
h(a) = 11 + e−a
h′(a) = e−a
(1 + e−a)2 = h(a)[1 − h(a)
]
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fdf/dx
EECS 6894, Columbia University 4/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Hyperbolic Tangent (tanh)
h(a) = ea − e−a
ea + e−a
h′(a) = 4(ea + e−a)2 = 1 −
[h(a)
]2
-5 -4 -3 -2 -1 0 1 2 3 4 5-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
fdf/dx
EECS 6894, Columbia University 5/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Rectified Linear Unit (ReLU)
h(a) = max(0, a) ={
a if a > 0,
0 if a ≤ 0.
h′(a) ={
1 if a > 0,
0 if a ≤ 0
-3 -2 -1 0 1 2 30
0.5
1
1.5
2
2.5
3
fdf/dx
EECS 6894, Columbia University 6/23
Deep Learning for Computer Vision, Speech, and Language
Activation Functions – Softmax
also known as normalized exponential function, a generalization ofthe logistic function to multiple classes.
h(ak) = eak∑j eaj
∂h(ak)∂aj
= ∂h(ak)∂ log h(ak)
∂ log h(ak)∂aj
= h(ak)(
δkj − h(aj))
Why "softmax"?
max{x1, · · · , xn} ≤ log(ex1 + · · · + exn) ≤ max{x1, · · · , xn} + log n
EECS 6894, Columbia University 7/23
Deep Learning for Computer Vision, Speech, and Language
Loss Functions
• Regression• Classification
EECS 6894, Columbia University 8/23
Deep Learning for Computer Vision, Speech, and Language
Regression: Least Square ErrorLet yn,k and yn,k be the output and target respectively. Forregression, the typical loss function is least square error (LSE):
L(y, y) = 12
∑n
(yn − yn)2
= 12
∑n
∑k
(ynk − ynk)2
where n is the sample index and k is the dimension index.The derivative of LSE with respect to each dimension of eachsample:
∂L(y, y)∂ynk
= (ynk − ynk)
EECS 6894, Columbia University 9/23
Deep Learning for Computer Vision, Speech, and Language
Classification: Cross-Entropy (CE)Suppose there are two (discrete) probability distributions p and q, the "distance"between the two distributions can be measured by the Kullback-Leibler divergence:
DKL(p||q) =∑
k
pk log pk
qk
For classification, the targets yn are given
yn = {yn1, · · · , ynk, · · · , ynK}
The predictions after the softmax layer are posterior probabilities after normalization
yn = {yn1, · · · , ynk, · · · , ynK}
Now measure the "distance" of these two distributions from all samples∑
n
DKL(yn||yn) =∑
n
∑k
ynk log ynk
ynk=
∑n
∑k
ynk log ynk −∑
n
∑k
ynk log ynk
= −∑
n
∑k
ynk log ynk + C
EECS 6894, Columbia University 10/23
Deep Learning for Computer Vision, Speech, and Language
Classification: Cross-Entropy (CE)Cross-entropy as the loss function:
L(y, y) = −∑
n
∑k
ynk log ynk
Typically, targets are given in the form of 1-of-K coding
yn = {0, · · · , 1, · · · , 0}
where
yn ={
1 if k = kn,
0 otherwise
It follows that the cross-entropy has an even simpler form:
L(y, y) = −∑
n
log ynkn
EECS 6894, Columbia University 11/23
Deep Learning for Computer Vision, Speech, and Language
Classification: Cross-Entropy (CE)derivative of the composite function of cross-entropy and softmax:
L(y, y) =∑
n
En =∑
n
(−
∑i
yni log yni
)yni = eani∑
j eanj
∂En
∂ank= ∂
∂ank
(−
∑i
yni log yni
)= −
∑i
yni∂ log yni
∂ank
= −∑
i
yni∂ log yni
∂yni
∂yni
∂ank= −
∑i
yni
yni
∂yni
∂ log yni
∂ log yni
∂ank
= −∑
i
yni
yni· yni · (δik − ynk) = −
∑i
yni(δik − ynk)
= −∑
i
yniδik +∑
i
yniynk = −ynk + ynk
( ∑i
yn,i
)= ynk − ynk
What’s the derivative for an arbitrary differentiable function? – leave it forexercise
EECS 6894, Columbia University 12/23
Deep Learning for Computer Vision, Speech, and Language
A DNN Example In Torch
Algorithm 1 Definition of a DNN with 3 hidden layersfunction create_model()
– MODELlocal n_inputs = 460local n_outputs = 3000local n_hidden = 1024local model = nn.Sequential()model:add(nn.Linear(n_inputs, n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_hidden))model:add(nn.Sigmoid())model:add(nn.Linear(n_hidden,n_outputs))model:add(nn.LogSoftMax())
– LOSS FUNCTIONlocal criterion = nn.ClassNLLCriterion()
return model:cuda(), criterion:cuda()end
EECS 6894, Columbia University 13/23
Deep Learning for Computer Vision, Speech, and Language
Neural Networks As Universal ApproximatorsUniversal Approximation Theorem[1][2][3] Let ϕ(·) be a nonconstant, bounded, andmonotonically-increasing continuous function. Let In represent an n-dimensional unit cube[0, 1]n and C(In) be the space of continuous functions on In. Then given any functionf ∈ C(In) and ϵ > 0, there exist an integer N and real constants ai, bi ∈ R and wi ∈ Rn,where i = 1, 2, · · · , N , such that one may define
F (x) =N∑
i=1aiϕ(wT
i x + bi)
as an approximate realization of the function f where f is independent of ϕ such that
|F (x) − f(x)| < ϵ
for all x ∈ In.
[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks areUniversal Approximators," Neural Networks, 2(5), 359-366.[3] Hornik, K. (1991). "Approximation Capabilities of Multilayer Feedforward Networks," NeuralNetworks, 4(2), 251-257.
EECS 6894, Columbia University 14/23
Deep Learning for Computer Vision, Speech, and Language
Neural Networks As Universal ClassifiersExtends f(x) to decision functions of the form [1]
f(x) = j, iff x ∈ Pj , j = 1, 2, · · · , K
where Pj partition An into K disjoint measurable subsets and An is a compact subset of Rn.
Arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neuralnetworks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity!
[1] Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function," Math.Control Signals Systems, 2, 303-314.[2] Huang, W. Y. and Lippmann, R. P. (1988). "Neural Nets and Traditional Classifiers," inNeural Information Processing Systems (Denver 1987), D. Z. Anderson, Editor. AmericanInstitute of Physics, New York, 387-396.
EECS 6894, Columbia University 15/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation
wki
j
i
k
wij
L How to compute the gradient ofthe loss function with respect toa weight wij?
∂L∂wij
D. E. Rumelhart, G. E. Hintonand R. J. Williams (1986)."Learning representations byback-propagating errors,"Nature, vol. 323, 533-536.
EECS 6894, Columbia University 16/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation
a(l)i =
∑j w
(l)ij z
(l−1)j
z(l)i = h
(l)i
(a
(l)i
)
∂L∂w
(l)ij
= ∂L∂a
(l)i
∂a(l)i
∂w(l)ij
It follows that
∂a(l)i
∂w(l)ij
= z(l)j and δ
(l)i , ∂L
∂a(l)i
δi is often referred to as “errors”. By the chain rule,
δ(l)i = ∂L
∂a(l)i
=∑
k
∂L∂a
(l+1)k
∂a(l+1)k
∂a(l)i
=∑
k
δ(l+1)k
∂a(l+1)k
∂a(l)i
EECS 6894, Columbia University 17/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation
a(l+1)k =
∑i
w(l+1)ki h(a(l)
i )
therefore
∂a(l+1)k
∂a(l)i
= w(l+1)ki h′(a(l)
i )
It follows that
δ(l)i = h′(a(l)
i )∑
k
w(l+1)ki δ
(l+1)k
which tells us that the errors can be evaluated recursively.
EECS 6894, Columbia University 18/23
Deep Learning for Computer Vision, Speech, and Language
Back Propagation: An Example
• A neural network with multiple layers• Softmax output layer• Cross-entropy loss function
• Forward propagation:I push the input xn through the network to get the activations to the hidden layers
a(l)i =
∑j
w(l)ij z
(l−1)j , z
(l)i = h
(l)i
(a
(l)i
)• Back propagation:
I start from the output layer
δ(L)k = ynk − ynk
I back-propagate errors from the output layer all the way down to the input layer
δ(l)i = h′(a(l)
i )∑
k
w(l+1)ki δ
(l+1)k
I evaluate gradients
∂L∂w
(l)ij
= δ(l)i z
(l)j
EECS 6894, Columbia University 19/23
Deep Learning for Computer Vision, Speech, and Language
Gradient CheckingIn practice, when you implement the gradient of a network or amodule, one way to check the correctness of your implementationis via the following gradient checking by a (symmetric) finitedifference:
∂L∂wij
= L(wij + ϵ) − L(wij − ϵ)2ϵ
check the "closeness" of the two gradients (your ownimplementation and the one from the above difference)
||g1 − g2||||g1 + g2||
EECS 6894, Columbia University 20/23
Deep Learning for Computer Vision, Speech, and Language
Jacobian Matrixyk
xi
• measure the sensitivity of outputs with respect to inputs of a(sub-)network
• can be used as a modular operation for error backprop in alarger network
EECS 6894, Columbia University 21/23
Deep Learning for Computer Vision, Speech, and Language
Jacobian MatrixThe Jacobian matrix can be computed using a similar back-propagation procedure.
∂yk
∂xi=
∑l
∂yk
∂al
∂al
∂xi
where al are the activation functions having immediate connections with the inputs xi.
∂yk
∂xi=
∑l
wli∂yk
∂al
Analogous to the error propagation, define
δkl ,∂yk
∂al
and it can be computed recursively
δkl = ∂yk
∂al=
∑j
∂yk
∂aj
∂aj
∂al=
∑j
∂yk
∂aj
[wjlh
′(al)]
=∑
j
δkj
[wjlh
′(al)]
where j sweeps all the connections wjl with l.
What’s the Jacobian matrix of softmax outputs to the inputs? – leave it as an exercise
EECS 6894, Columbia University 22/23
Deep Learning for Computer Vision, Speech, and Language
Vanishing Gradients
function nonlinearity gradient
sigmoid x = 11+e−a
∂L∂a = x(1 − x)∂L
∂x
tanh x = ea−e−a
ea+e−a∂L∂a = (1 − x2)∂L
∂x
softsign x = a1+|a|
∂L∂a = (1 − |x|)2 ∂L
∂x
What’s the problem???
|x| ≤ 1 → ∂L∂a
≤ ∂L∂x
Nonlinearity causes vanishing gradients.EECS 6894, Columbia University 23/23