Lecture 14: Neural Networks Andrew Senior

Speech recognitionLecture 14: Neural Networks

Andrew Senior <[email protected]>

Google NYC

December 12, 2013

Andrew Senior <[email protected]> 1

1 Introduction to Neural networks

2 Neural networks for speech recognitionNeural network features for speech recognitionHybrid neural networksHistoryVariations

3 Language modelling


The perceptron

Input x1

Input x2

Input x3

Input x4

Input x5

Output

w1w2w3w4w5

A perceptron is a linear classifier:

f (x) = 1 if w .x > 0 (1)

= 0 otherwise. (2)

Add an extra “always one” input to provide an offset or “bias”. Theweights w can be learned for a given task with the Perceptron Algorithm.


Perceptron algorithm (Rosenblatt, 1957)

Adapt the weights w , example-by example:

1 Initialise the weights and the threshold.

2 For each example j in our training set D, perform the following stepsover the input xj and desired output yj :

3 1 Calculate the actual output:

yj(t) = f [w(t) · xj ] = f [w0(t) + w1(t)xj,1 + w2(t)xj,2 + · · ·+ wn(t)xj,n]

2 Update the weights:

wi (t + 1) = wi (t) + α(yj − yj(t))xj,i , for all nodes 0 ≤ i ≤ n.

4 Repeat Step 2 until the iteration error 1s

∑sj [yj − yj(t)] is less than a

user-specified error threshold γ, or a predetermined number ofiterations have been completed.


Nonlinear perceptrons

Introduce a nonlinearity:

yi = σ(∑j

wijxj)

Each unit is a simple nonlinear function of a linear combination of itsinputs

Typically logistic sigmoid:

σ(z) =1

1 + e−z

or tanh:

σ(z) = tanh z


Multilayer perceptrons

Extend the network to multiple layersNow a hidden layer of nodes computes a function of the inputs, andoutput nodes compute a function of the hidden nodes’ “activations”.

Input x1

Input x2

Input x3

Input x4

y1

y2

y3

Hiddenlayer

Inputlayer

Outputlayer


Cost function

Such networks can be optimized (”trained”) to minimize a cost function(Loss function or objective function) that is a numerical score of thenetwork’s performance with respect to targets yi (t).

• Squared Error

LSE =1

2

∑t

∑i

(yi (t)− yi (t))2

This is a frame-based criterion, where t would ideally be across theentire space of decoding frames, but in practice is across the trainingset, and we measure it across a development set.

• Cross Entropy

LCE =∑t

∑i

yi (t) log yi (t)


Targets

We need targets / labels yi (t), for each frame usually provided byforced-alignment. (Lecture 8)

• Viterbi alignment gives one target class for each frame t.

• Baum-Welch soft-alignments gives a target distribution across yi (t)for each t


Softmax output layer

If the output units are logistic, then they are suitable for representingMultivariate Bernouilli random variables P(yi = 1|x)To model a multi-class “categorical” distribution then we use the Softmax(?)

yi = P(ci |x) =exp(zi )∑j exp(zj)

which is normalized to sum to one.This reduces to the logistic sigmoid when there are two output classes


Gradient descent

To minimize the loss L, compute a gradient ∂L∂w for each parameter w and

update it using simple gradient descent:

w ′ = w − η ∂L∂w

η is a learning rate which is chosen (typically by cross-validation) but maybe set automatically.We can apply the chain rule to compute ∂L

∂w for parameters deep in thenetwork.


Back-propagation 0

Derivatives of the loss functions:

∂LCE∂yi

=∂

∂yi

∑j

yj(t) log yj(t) (3)

=yi (t)

yi (t)(4)

∂LSE∂yi

=∂

∂yi

1

2

∑i

(yj(t)− yj(t))2 (5)

= yj(t)− yj(t) (6)


Back-propagation 0

Derivative of Logistic activation function:

∂yi∂zi

=∂

∂zi

1

1 + e−zi(7)

=e−zi

(1 + e−zi )2(8)

= yi (1− yi ) (9)

(10)

Because

1− y =(1 + e−z)− 1

(1 + e−z)(11)

(12)


Back-propagation 0

Derivative of Softmax activation function:

∂yk∂zi

=∂

∂zi

ezk∑j e

zj(13)

=δik(

∑j e

zj )ezk − ezk ezi

(∑

j ezj )2

(14)

=ezi∑j e

zj

(∑

j ezj )δik − ezk∑

j ezj

(15)

= yi (δik − yk) (16)


Back-propagation I

For a weight in the final layer, by the chain rule for one example:

∂L∂wij

=∑k

∂L∂yk

∂yk∂zi

∂zi∂wij

(17)

For Softmax & LCE∂LCE∂yk

=ykyk

[Outer gradient.] (18)

∂yk∂zi

= yi (δik − yk) [Derivative of softmax activation function.] (19)

∂zi∂wij

= xj (20)

∂L∂wij

=∑k

ykyk

yk(δik − yi )xj (21)

= xj∑k

yk(δik − yi ) (22)

= xj(yi − yi ) (23)Andrew Senior <[email protected]> 14

Back-propagation II

Back-propagating (Rumelhart et al., 1986) to an earlier hidden layer withweights w ′jk , activations xj and inputs x ′k :

xj = σ(z ′j ) = σ(∑k

w ′jkx′k) (24)

First find the gradient w.r.t. the hidden layer activation xj :

∂L∂xj

=∑i

∂L∂yi

∂yi∂zi

∂zi∂xj

(25)

∂zi∂xj

= wij (26)

i.e. we pass the vector of gradients ∂L∂yi

back through the nonlinearity andthen project back with the layer’s output back through the weight matrixW.


Back-propagation III

∂L∂w ′jk

=∂L∂xj

∂xj∂z ′j

∂z ′j∂w ′jk

[Same form as eqn. 17.] (27)

∂xj∂z ′j

= xj(1− xj) [Derivative of sigmoid activation function.] (28)

∂z ′j∂w ′jk

= x ′k (29)

Continue to arbitrary depth: compute activations’ gradients and thenweight gradients for each layer.


Stochastic Gradient Descent

Since L is typically defined on the entire training-set, it takes a long timeto compute it and its derivatives (summed across all exemplars), and it’sonly an approximation to the true loss on the theoretical set of allutterances.We can compute a noisy estimate of ∂L

∂w on a small subset of the trainingset, and make a Stochastic Gradient Descent (SGD) update very quickly.In the limit, we could update on every frame, but a useful compromise isto use a minibatch of around 200 frames.


Second-order optimization

• Compute the second derivative and optimize a second-orderapproximation to the error- surface.

• More computation per step.

• Requires less-noisy estimates of gradient / curvature (bigger batches).

• Each step is more effective.

• Variants:

• Newton-Raphson

• Quickprop

• LBFGS

• Hessian-free

• Conjugate gradient


Two main paradigms for neural networks for speech

• Use neural networks to compute nonlinear feature representations.• “Bottleneck” or “tandem” features (Hermansky et al., 2000)• Low-dimensional representation is modelled conventionally with GMMs.• Allows all the GMM machinery and tricks to be exploited.

• Use neural networks to estimate CD state probabilities.


Outline





Neural network features

Train a neural network to discriminate classes.Use output or a low-dimensional bottleneck layer representation asfeatures.

x1

x2

x3

x4

y1

y2

y3

y4

y5

Hiddenlayers

Inputlayer

Bottlenecklayer

Outputlayer


Neural network features

TRAP: Concatenate PLP-HLDA features and NN features. Bottleneckoutperforms posterior features (Grezl et al., 2007)Generally DNN features + GMMs reach about the same performance ashybrid DNN-GMM systems, but are much more complex.


Outline





Hybrid networks: Decoding (recap)

Recall (Lecture 1) that we choose the decoder output as the optimal wordsequence w for an observation sequence o:

w = arg maxw∈Σ∗

Pr [w |o] (30)

= arg maxw∈Σ∗

Pr [o|w ]Pr [w ] (31)

and

Pr(o|w) =∑d ,c,p

Pr(o|c)Pr(c |p)Pr(p|w) (32)

Where p is the phone sequence and c is the CD state sequence.


Hybrid Neural network decoding

Now we model P(o|c) with a Neural network instead of a GaussianMixture model. Everything else stays the same.

P(o|c) =∏t

P(ot |ct) (33)

P(ot |ct) =P(ct |ot)P(ot)

P(ct)(34)

∝ P(ct |ot)P(ct)

(35)

For observations ot at time t and a CD state sequence ct .We can ignore P(ot) since it is the same for all decoding paths.The last term is called the “scaled posterior”:

logP(ot |ct) = logP(ct |ot)− α logP(ct) (36)

Empirically (by cross validation) we actually find better results with a“prior smoothing” term α ≈ 0.8.Andrew Senior <[email protected]> 26

Input features

Neural networks can handle high-dimensional features with correlatedfeatures.Use (26) stacked filterbank inputs. (40-dimensional mel-spaced filterbanks)Example filters learned in the first layer:


Outline





Rough History

• Multi-layer perceptron 1986

• Speech recognition with neural networks 1987–1995

• Superseded by GMMs 1995–2009

• Neural network features 2002–

• Deep networks 2006– (Hinton, 2002)

• Deep networks for speech recognition• Good results on TIMIT (Mohamed et al., 2009)• Results on large vocabulary systems 2010 (Dahl et al., 2011)• Google launches DNN ASR product 2011• Dominant paradigm for ASR 2012 (Hinton et al., 2012)


What is new?

• Fast GPU-based training (distributed CPU-based training is evenfaster)

• Pretraining (turns out not to be important)

• Deeper networks - enabled by faster training

• Large datasets

• Machine learning understanding


State of the art

Google’s current speech production systems

• 26 frames of 40-dimensional filterbank inputs

• 8 hidden layers of 2560 hidden units.

• Rectified Linear nonlinearity (Zeiler et al., 2013)

• 14,000 outputs

• 85 million parameters, trained on 2,000 hours of speech data.

• Running quantized with 8 bit integer weights.

On Android phones we run a smaller model with 2.7M parameters.


Outline





Sequence training for neural networks

Neural networks are trained with a frame-level discriminative criterion(cross-entropy LCE )Far from the minimum WER criterion we care about.GMM-HMMs trained with sequence-level discriminative training (MMI,bMMI (Povey et al., 2008), MPE, MBR etc.) outperformMaximum-Likelihood models.Kingsbury (2009) shows how to compute a gradient for back-propagationfrom the numerator and denominator statistics for truth / alternativehypothesis lattices.Given this “outer gradient” we use back-propagation to computeparameter updates for the neural network.


Pretraining

If we have a small amount of supervised data, we can use unlabelled datato get the parameters into reasonable places to model the distribution ofthe inputs, without knowing the labels.Pretraining is done layer-by layer so is faster than supervised training.There are several methods

• Contrastive divergence RBM training;

• Autoencoder;

• Greedy-layerwise [actually supervised]

but none seems necessary for large speech corpora.


Alternative nonlinearities

Sigmoid σ(z) =1

1 + e−z(37)

Tanh σ(z) = tanh(z) (38)

ReLU σ(z) = max(z , 0) (39)

Softsign σ(z) =z

1+ | z |(40)

Softplus σ(z) = log(1 + ez) (41)


Alternative nonlinearities

Note:

• ReLU gives sparse activations.

• ReLU Gradient is zero x < 0, one x > 0, so propagated gradientsdon’t attenuate as much as in other nonlinearities.

• ReLU & softsign are unbounded.

• Gradients asymptote differently for other nonlinearities.


Neural network variants

Many variations

• Convolutional neural networks (Abdel-Hamid et al., 2012)Convolve a filter with the input— weight sharing saves parametersand gives invariance to frequency shifts.

• Recurrent neural networksTake one frame at a time but store a history of the previous frames,so could theoretically model long-term context.

• Long-Short Term Memory(Graves et al., 2013)A successful specialization of therecurrent neural network. Withcomplex memory “cells”.


Recurrent neural networks

A recurrent neural network hasadditional output nodes which arecopied back to its inputs with atime delay. (Robinson et al., 1993)Training is with Back-PropagationThrough Time.

x1

x2

x3

x4

y1

y2

y3

y4

y5

r1

r2

r3

r4

r5

r6


Neural network language modelling

Model P(wn|wn−1,wn−2,wn−3 . . .) with a neural network instead of withan n-gram (pure frequency counts with back-off).Simply train a softmax for each wn, and use an input representation ofwn−1,wn−2,wn−3, . . .. Even more effectively, train a recurrent neuralnetwork. (Mikolov et al., 2010)Leads to word-embeddings - a linear projection of sparse word identities(O(millions)) into a lower-dimensional (O(hundreds)) dense vector space.Easy to add other features (class, part-of-speech)Best performance when combined with an n-gram.Hard to do real-time decoding, though much of the performance can beretained when knowledge is extracted and stored in a WFST. (Arisoyet al., 2013)


Bibliography I

Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., and Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model forspeech recognition. In ICASSP, pages 4277–4280. IEEE.

Arisoy, E., Chen, S. F., Ramabhadran, B., and Sethy, A. (2013). Converting neural network language models into back-off language models for efficientdecoding in automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages8242–8246. IEEE.

Dahl, G., Yu, D., Li, D., and Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In ICASSP.

Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In ASRU.

Grezl, Karafiat, and Cernocky (2007). Neural network topologies and bottleneck features. Speech Recognition.

Hermansky, H., Ellis, D., and Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In ICASSP.

Hinton, G., Deng, L., Yu, D., Dahl, G., A., M., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neuralnetworks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29:82–97.

Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation.

Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In ICASSP, pages 3761–3764.

Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech.

Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition. In NIPS.

Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., and Visweswariah, K. (2008). Boosted MMI for model and feature-spacediscriminative training. In Proc. ICASSP.

Robinson, A. J., Almeida, L., m. Boite, J., Bourlard, H., Fallside, F., Hochberg, M., Kershaw, D., Kohn, P., Konig, Y., Morgan, N., Neto, J. P., Renals,S., Saerens, M., and Wooters, C. (1993). A neural network based, speaker independent, large vocabulary, continuous speech recognition system:The Wernicke project. In PROC. EUROSPEECH’93, pages 1941–1944.

Rosenblatt, F. (1957). The perceptron–a perceiving and recognizing automaton. Technical Report 85-460-1, Cornell Aeronautical Laboratory.

Rumelhart, D. E., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533–536.

Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., and Hinton, G. (2013). On rectifiedlinear units for speech processing. In ICASSP.


Lecture 14: Neural Networks Andrew Senior

Documents

Transcript of Lecture 14: Neural Networks Andrew Senior