Yuyun, Neural Network Algorithm for Determining a Majors for Senior High School
Lecture 14: Neural Networks Andrew Senior
Transcript of Lecture 14: Neural Networks Andrew Senior
Speech recognitionLecture 14: Neural Networks
Andrew Senior <[email protected]>
Google NYC
December 12, 2013
Andrew Senior <[email protected]> 1
1 Introduction to Neural networks
2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations
3 Language modelling
Andrew Senior <[email protected]> 2
The perceptron
Input x1
Input x2
Input x3
Input x4
Input x5
Output
w1w2w3w4w5
A perceptron is a linear classifier:
f (x) = 1 if w .x > 0 (1)
= 0 otherwise. (2)
Add an extra “always one” input to provide an offset or “bias”. Theweights w can be learned for a given task with the Perceptron Algorithm.
Andrew Senior <[email protected]> 3
Perceptron algorithm (Rosenblatt, 1957)
Adapt the weights w , example-by example:
1 Initialise the weights and the threshold.
2 For each example j in our training set D, perform the following stepsover the input xj and desired output yj :
3 1 Calculate the actual output:
yj(t) = f [w(t) · xj ] = f [w0(t) + w1(t)xj,1 + w2(t)xj,2 + · · ·+ wn(t)xj,n]
2 Update the weights:
wi (t + 1) = wi (t) + α(yj − yj(t))xj,i , for all nodes 0 ≤ i ≤ n.
4 Repeat Step 2 until the iteration error 1s
∑sj [yj − yj(t)] is less than a
user-specified error threshold γ, or a predetermined number ofiterations have been completed.
Andrew Senior <[email protected]> 4
Nonlinear perceptrons
Introduce a nonlinearity:
yi = σ(∑j
wijxj)
Each unit is a simple nonlinear function of a linear combination of itsinputs
Typically logistic sigmoid:
σ(z) =1
1 + e−z
or tanh:
σ(z) = tanh z
Andrew Senior <[email protected]> 5
Multilayer perceptrons
Extend the network to multiple layersNow a hidden layer of nodes computes a function of the inputs, andoutput nodes compute a function of the hidden nodes’ “activations”.
Input x1
Input x2
Input x3
Input x4
y1
y2
y3
Hiddenlayer
Inputlayer
Outputlayer
Andrew Senior <[email protected]> 6
Cost function
Such networks can be optimized (”trained”) to minimize a cost function(Loss function or objective function) that is a numerical score of thenetwork’s performance with respect to targets yi (t).
• Squared Error
LSE =1
2
∑t
∑i
(yi (t)− yi (t))2
This is a frame-based criterion, where t would ideally be across theentire space of decoding frames, but in practice is across the trainingset, and we measure it across a development set.
• Cross Entropy
LCE =∑t
∑i
yi (t) log yi (t)
Andrew Senior <[email protected]> 7
Targets
We need targets / labels yi (t), for each frame usually provided byforced-alignment. (Lecture 8)
• Viterbi alignment gives one target class for each frame t.
• Baum-Welch soft-alignments gives a target distribution across yi (t)for each t
Andrew Senior <[email protected]> 8
Softmax output layer
If the output units are logistic, then they are suitable for representingMultivariate Bernouilli random variables P(yi = 1|x)To model a multi-class “categorical” distribution then we use the Softmax(?)
yi = P(ci |x) =exp(zi )∑j exp(zj)
which is normalized to sum to one.This reduces to the logistic sigmoid when there are two output classes
Andrew Senior <[email protected]> 9
Gradient descent
To minimize the loss L, compute a gradient ∂L∂w for each parameter w and
update it using simple gradient descent:
w ′ = w − η ∂L∂w
η is a learning rate which is chosen (typically by cross-validation) but maybe set automatically.We can apply the chain rule to compute ∂L
∂w for parameters deep in thenetwork.
Andrew Senior <[email protected]> 10
Back-propagation 0
Derivatives of the loss functions:
∂LCE∂yi
=∂
∂yi
∑j
yj(t) log yj(t) (3)
=yi (t)
yi (t)(4)
∂LSE∂yi
=∂
∂yi
1
2
∑i
(yj(t)− yj(t))2 (5)
= yj(t)− yj(t) (6)
Andrew Senior <[email protected]> 11
Back-propagation 0
Derivative of Logistic activation function:
∂yi∂zi
=∂
∂zi
1
1 + e−zi(7)
=e−zi
(1 + e−zi )2(8)
= yi (1− yi ) (9)
(10)
Because
1− y =(1 + e−z)− 1
(1 + e−z)(11)
(12)
Andrew Senior <[email protected]> 12
Back-propagation 0
Derivative of Softmax activation function:
∂yk∂zi
=∂
∂zi
ezk∑j e
zj(13)
=δik(
∑j e
zj )ezk − ezk ezi
(∑
j ezj )2
(14)
=ezi∑j e
zj
(∑
j ezj )δik − ezk∑
j ezj
(15)
= yi (δik − yk) (16)
Andrew Senior <[email protected]> 13
Back-propagation I
For a weight in the final layer, by the chain rule for one example:
∂L∂wij
=∑k
∂L∂yk
∂yk∂zi
∂zi∂wij
(17)
For Softmax & LCE∂LCE∂yk
=ykyk
[Outer gradient.] (18)
∂yk∂zi
= yi (δik − yk) [Derivative of softmax activation function.] (19)
∂zi∂wij
= xj (20)
∂L∂wij
=∑k
ykyk
yk(δik − yi )xj (21)
= xj∑k
yk(δik − yi ) (22)
= xj(yi − yi ) (23)Andrew Senior <[email protected]> 14
Back-propagation II
Back-propagating (Rumelhart et al., 1986) to an earlier hidden layer withweights w ′jk , activations xj and inputs x ′k :
xj = σ(z ′j ) = σ(∑k
w ′jkx′k) (24)
First find the gradient w.r.t. the hidden layer activation xj :
∂L∂xj
=∑i
∂L∂yi
∂yi∂zi
∂zi∂xj
(25)
∂zi∂xj
= wij (26)
i.e. we pass the vector of gradients ∂L∂yi
back through the nonlinearity andthen project back with the layer’s output back through the weight matrixW.
Andrew Senior <[email protected]> 15
Back-propagation III
∂L∂w ′jk
=∂L∂xj
∂xj∂z ′j
∂z ′j∂w ′jk
[Same form as eqn. 17.] (27)
∂xj∂z ′j
= xj(1− xj) [Derivative of sigmoid activation function.] (28)
∂z ′j∂w ′jk
= x ′k (29)
Continue to arbitrary depth: compute activations’ gradients and thenweight gradients for each layer.
Andrew Senior <[email protected]> 16
Stochastic Gradient Descent
Since L is typically defined on the entire training-set, it takes a long timeto compute it and its derivatives (summed across all exemplars), and it’sonly an approximation to the true loss on the theoretical set of allutterances.We can compute a noisy estimate of ∂L
∂w on a small subset of the trainingset, and make a Stochastic Gradient Descent (SGD) update very quickly.In the limit, we could update on every frame, but a useful compromise isto use a minibatch of around 200 frames.
Andrew Senior <[email protected]> 17
Second-order optimization
• Compute the second derivative and optimize a second-orderapproximation to the error- surface.
• More computation per step.
• Requires less-noisy estimates of gradient / curvature (bigger batches).
• Each step is more effective.
• Variants:
• Newton-Raphson
• Quickprop
• LBFGS
• Hessian-free
• Conjugate gradient
Andrew Senior <[email protected]> 18
1 Introduction to Neural networks
2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations
3 Language modelling
Andrew Senior <[email protected]> 19
Two main paradigms for neural networks for speech
• Use neural networks to compute nonlinear feature representations.• “Bottleneck” or “tandem” features (Hermansky et al., 2000)• Low-dimensional representation is modelled conventionally with GMMs.• Allows all the GMM machinery and tricks to be exploited.
• Use neural networks to estimate CD state probabilities.
Andrew Senior <[email protected]> 20
Outline
1 Introduction to Neural networks
2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations
3 Language modelling
Andrew Senior <[email protected]> 21
Neural network features
Train a neural network to discriminate classes.Use output or a low-dimensional bottleneck layer representation asfeatures.
x1
x2
x3
x4
y1
y2
y3
y4
y5
Hiddenlayers
Inputlayer
Bottlenecklayer
Outputlayer
Andrew Senior <[email protected]> 22
Neural network features
TRAP: Concatenate PLP-HLDA features and NN features. Bottleneckoutperforms posterior features (Grezl et al., 2007)Generally DNN features + GMMs reach about the same performance ashybrid DNN-GMM systems, but are much more complex.
Andrew Senior <[email protected]> 23
Outline
1 Introduction to Neural networks
2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations
3 Language modelling
Andrew Senior <[email protected]> 24
Hybrid networks: Decoding (recap)
Recall (Lecture 1) that we choose the decoder output as the optimal wordsequence w for an observation sequence o:
w = arg maxw∈Σ∗
Pr [w |o] (30)
= arg maxw∈Σ∗
Pr [o|w ]Pr [w ] (31)
and
Pr(o|w) =∑d ,c,p
Pr(o|c)Pr(c |p)Pr(p|w) (32)
Where p is the phone sequence and c is the CD state sequence.
Andrew Senior <[email protected]> 25
Hybrid Neural network decoding
Now we model P(o|c) with a Neural network instead of a GaussianMixture model. Everything else stays the same.
P(o|c) =∏t
P(ot |ct) (33)
P(ot |ct) =P(ct |ot)P(ot)
P(ct)(34)
∝ P(ct |ot)P(ct)
(35)
For observations ot at time t and a CD state sequence ct .We can ignore P(ot) since it is the same for all decoding paths.The last term is called the “scaled posterior”:
logP(ot |ct) = logP(ct |ot)− α logP(ct) (36)
Empirically (by cross validation) we actually find better results with a“prior smoothing” term α ≈ 0.8.Andrew Senior <[email protected]> 26
Input features
Neural networks can handle high-dimensional features with correlatedfeatures.Use (26) stacked filterbank inputs. (40-dimensional mel-spaced filterbanks)Example filters learned in the first layer:
Andrew Senior <[email protected]> 27
Outline
1 Introduction to Neural networks
2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations
3 Language modelling
Andrew Senior <[email protected]> 28
Rough History
• Multi-layer perceptron 1986
• Speech recognition with neural networks 1987–1995
• Superseded by GMMs 1995–2009
• Neural network features 2002–
• Deep networks 2006– (Hinton, 2002)
• Deep networks for speech recognition• Good results on TIMIT (Mohamed et al., 2009)• Results on large vocabulary systems 2010 (Dahl et al., 2011)• Google launches DNN ASR product 2011• Dominant paradigm for ASR 2012 (Hinton et al., 2012)
Andrew Senior <[email protected]> 29
What is new?
• Fast GPU-based training (distributed CPU-based training is evenfaster)
• Pretraining (turns out not to be important)
• Deeper networks - enabled by faster training
• Large datasets
• Machine learning understanding
Andrew Senior <[email protected]> 30
State of the art
Google’s current speech production systems
• 26 frames of 40-dimensional filterbank inputs
• 8 hidden layers of 2560 hidden units.
• Rectified Linear nonlinearity (Zeiler et al., 2013)
• 14,000 outputs
• 85 million parameters, trained on 2,000 hours of speech data.
• Running quantized with 8 bit integer weights.
On Android phones we run a smaller model with 2.7M parameters.
Andrew Senior <[email protected]> 31
Outline
1 Introduction to Neural networks
2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations
3 Language modelling
Andrew Senior <[email protected]> 32
Sequence training for neural networks
Neural networks are trained with a frame-level discriminative criterion(cross-entropy LCE )Far from the minimum WER criterion we care about.GMM-HMMs trained with sequence-level discriminative training (MMI,bMMI (Povey et al., 2008), MPE, MBR etc.) outperformMaximum-Likelihood models.Kingsbury (2009) shows how to compute a gradient for back-propagationfrom the numerator and denominator statistics for truth / alternativehypothesis lattices.Given this “outer gradient” we use back-propagation to computeparameter updates for the neural network.
Andrew Senior <[email protected]> 33
Pretraining
If we have a small amount of supervised data, we can use unlabelled datato get the parameters into reasonable places to model the distribution ofthe inputs, without knowing the labels.Pretraining is done layer-by layer so is faster than supervised training.There are several methods
• Contrastive divergence RBM training;
• Autoencoder;
• Greedy-layerwise [actually supervised]
but none seems necessary for large speech corpora.
Andrew Senior <[email protected]> 34
Alternative nonlinearities
Sigmoid σ(z) =1
1 + e−z(37)
Tanh σ(z) = tanh(z) (38)
ReLU σ(z) = max(z , 0) (39)
Softsign σ(z) =z
1+ | z |(40)
Softplus σ(z) = log(1 + ez) (41)
Andrew Senior <[email protected]> 35
Alternative nonlinearities
Note:
• ReLU gives sparse activations.
• ReLU Gradient is zero x < 0, one x > 0, so propagated gradientsdon’t attenuate as much as in other nonlinearities.
• ReLU & softsign are unbounded.
• Gradients asymptote differently for other nonlinearities.
Andrew Senior <[email protected]> 36
Neural network variants
Many variations
• Convolutional neural networks (Abdel-Hamid et al., 2012)Convolve a filter with the input— weight sharing saves parametersand gives invariance to frequency shifts.
• Recurrent neural networksTake one frame at a time but store a history of the previous frames,so could theoretically model long-term context.
• Long-Short Term Memory(Graves et al., 2013)A successful specialization of therecurrent neural network. Withcomplex memory “cells”.
Andrew Senior <[email protected]> 37
Recurrent neural networks
A recurrent neural network hasadditional output nodes which arecopied back to its inputs with atime delay. (Robinson et al., 1993)Training is with Back-PropagationThrough Time.
x1
x2
x3
x4
y1
y2
y3
y4
y5
r1
r2
r3
r4
r5
r6
Andrew Senior <[email protected]> 38
Neural network language modelling
Model P(wn|wn−1,wn−2,wn−3 . . .) with a neural network instead of withan n-gram (pure frequency counts with back-off).Simply train a softmax for each wn, and use an input representation ofwn−1,wn−2,wn−3, . . .. Even more effectively, train a recurrent neuralnetwork. (Mikolov et al., 2010)Leads to word-embeddings - a linear projection of sparse word identities(O(millions)) into a lower-dimensional (O(hundreds)) dense vector space.Easy to add other features (class, part-of-speech)Best performance when combined with an n-gram.Hard to do real-time decoding, though much of the performance can beretained when knowledge is extracted and stored in a WFST. (Arisoyet al., 2013)
Andrew Senior <[email protected]> 39
Bibliography I
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., and Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model forspeech recognition. In ICASSP, pages 4277–4280. IEEE.
Arisoy, E., Chen, S. F., Ramabhadran, B., and Sethy, A. (2013). Converting neural network language models into back-off language models for efficientdecoding in automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages8242–8246. IEEE.
Dahl, G., Yu, D., Li, D., and Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In ICASSP.
Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In ASRU.
Grezl, Karafiat, and Cernocky (2007). Neural network topologies and bottleneck features. Speech Recognition.
Hermansky, H., Ellis, D., and Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In ICASSP.
Hinton, G., Deng, L., Yu, D., Dahl, G., A., M., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neuralnetworks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29:82–97.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation.
Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In ICASSP, pages 3761–3764.
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech.
Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition. In NIPS.
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., and Visweswariah, K. (2008). Boosted MMI for model and feature-spacediscriminative training. In Proc. ICASSP.
Robinson, A. J., Almeida, L., m. Boite, J., Bourlard, H., Fallside, F., Hochberg, M., Kershaw, D., Kohn, P., Konig, Y., Morgan, N., Neto, J. P., Renals,S., Saerens, M., and Wooters, C. (1993). A neural network based, speaker independent, large vocabulary, continuous speech recognition system:The Wernicke project. In PROC. EUROSPEECH’93, pages 1941–1944.
Rosenblatt, F. (1957). The perceptron–a perceiving and recognizing automaton. Technical Report 85-460-1, Cornell Aeronautical Laboratory.
Rumelhart, D. E., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533–536.
Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., and Hinton, G. (2013). On rectifiedlinear units for speech processing. In ICASSP.
Andrew Senior <[email protected]> 40