Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6...

27
Instructor: Preethi Jyothi Oct 16, 2017 Lecture 20: Pronunciation Modeling Automatic Speech Recognition (CS753)

Transcript of Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6...

Page 1: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Instructor: Preethi Jyothi Oct 16, 2017

Automatic Speech Recognition (CS753)Lecture 20: Pronunciation Modeling

Automatic Speech Recognition (CS753)

Page 2: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Pronunciation Dictionary/Lexicon

• Pronunciation model/dictionary/lexicon: Lists one or more pronunciations for a word

• Typically derived from language experts: Sequence of phones written down for each word

• Dictionary construction involves:

1. Selecting what words to include in the dictionary

2. Pronunciation of each word (also, check for multiple pronunciations)

Page 3: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Grapheme-based models

Page 4: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Graphemes vs. Phonemes

• Instead of a pronunciation dictionary, one could represent a pronunciation as a sequence of graphemes (or letters). That is, model at the grapheme level.

• Useful technique for low-resourced/under-resourced languages

• Main advantages:

1. Avoid the need for phone-based pronunciations

2. Avoid the need for a phone alphabet

3. Works pretty well for languages with a direct link between graphemes (letters) and phonemes (sounds)

Page 5: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Grapheme-based ASR

Image from: Gales et al., Unicode-based graphemic systems for limited resource languages, ICASSP 15

Language System Script Graphemes†

Kurmanji Kurdish Alphabet Latin 62Tok Pisin Alphabet Latin 52Cebuano Alphabet Latin 53Kazakh Alphabet Cyrillic/Latin 126Telugu Abugida Telugu 60Lithuanian Alphabet Latin 62Levantine Arabic Abjad Arabic 36

Table 2: Attributes of Babel Option Period 2 Languages. † the num-ber of graphemes in the FLP, excluding apostrophe.

Table 2 shows some of the attributes of the seven languagesinvestigated. Three different writing schemes were evaluated: Al-phabet, Abugida, and Abjad. Four forms of writing script were ex-amined: Latin, Cyrillic, Arabic and Telugu. Additionally the tablegives the number of “raw” graphemes, with no mappings, that areobserved in the FLP training transcriptions, or the complete Levan-tine Arabic training transcriptions.

Language Grapheme Mapping #Pack — cap scr atr sgn PhnFLP 126 67 62 54 52 59LLP 117 66 61 53 51 59

VLLP 95 59 54 46 44 59ALP 81 55 51 43 42 59

Table 3: Number of unique tokens in Kazakh (302) (incrementally)removing: cap capitalisation; scr writing script; attr attributes;sgn signs

It is interesting to see how the number of graphemes varies withthe form of grapheme mapping used, and the size of the data (orLP). Table 3 shows the statistics for Kazakh, which has the greatestnumber of observed graphemes as both Cyrillic and Latin script areused. The first point to note is that going from the FLP to the ALP,45 graphemes are not observed in the ALP compared to the FLP.

As the forms of mapping are increased: removing capitalisation;writing script; remaining grapheme attributes; and sign information,the number of graphemes decreases. However comparing the FLPand ALP, there are still 10 graphemes not seen in the ALP. If thelanguage model is only based on the acoustic data transcriptionsthis is not an issue. However if additional language model trainingdata is available, then acoustic models are required for these unseengraphemes. In contrast all the phones are observed in all LPs. Notefor all the phonetic systems, diphthongs are mapped to their individ-ual constituents.

4. EXPERIMENTAL RESULTS

This section contrasts the performance of the proposed unicode-based graphemic systems with phonetic systems, and also an expertderived Levantine Arabic graphemic system. The performance us-ing limited resources on CTS data is poor compared to using largeramounts of resources, or simpler tasks.

4.1. Acoustic and Language Models

The acoustic and language models built on the six Babel languageswere built in a Babel BaseLR configuration [14]. Thus no additionalinformation from other languages, or LPs, was used in building the

systems. HTK [15] was used for training and test, with MLPs trainedusing QuickNet [16]. All acoustic models were constructed from aflat-start based on PLP-features, including HLDA and MPE training.The decision trees used to construct the context-dependent modelswere based on state-specific roots. This enables unseen phones andgraphemes to be synthesised and recognised, even if they do not oc-cur in the acoustic model training data [17]. Additionally it allowsrarely seen phones and graphemes to be handled without alwaysbacking off to monophone models. These baseline acoustic mod-els were then extended to Tandem-SAT systems. Here Bottle-Neck(BN) features were derived using DNNs with PLP plus pitch andprobability of voicing (PoV) obtained using the Kaldi toolkit [18] 4.Context-dependent targets were used. These 26-dimensional BNfeatures were added to the HLDA projected PLP features and pitchfeatures to yield a 71-dimensional feature vector. The baseline mod-els for the Levantine Arabic system were identical to the Babel sys-tems. However the Tandem-SAT system did not include any pitch orPoV features, so the final feature-vector size was 65.

For all systems only the manual transcriptions for the audiotraining data were used for training the language models. To givean idea of the available data for Kazakh the number of words are:FLP 290.9K; LLP 71.2K; VLLP 25.5K; and ALP 8.8K. Trigramlanguage models were built for all languages. For all experimentsin this section, manual segmentation of the test data was used. Thisallows the impact of the quantity of data and lexicon to be assessedwithout having to consider changes in the segmentation.

4.2. Full Language Pack Systems

Language ID System WER (%)Vit CN CNC

Kurmanji 205 Phonetic 67.6 65.8 64.1Kurdish Graphemic 67.0 65.3

Tok Pisin 207 Phonetic 41.8 40.6 39.4Graphemic 42.1 41.1

Cebuano 301 Phonetic 55.5 54.0 52.6Graphemic 55.5 54.2

Kazakh 302 Phonetic 54.9 53.5 51.5Graphemic 54.0 52.7

Telugu 303 Phonetic 70.6 69.1 67.5Graphemic 70.9 69.5

Lithuanian 304 Phonetic 51.5 50.2 48.3Graphemic 50.9 49.5

Table 4: Babel FLP Tandem-SAT Performance: Vit Viterbi decod-ing, CN confusion network (CN) decoding, CNC CN-combination.

To give an idea of relative performance when all available datais used, FLP graphemic and phonetic systems were built for all sixBabel languages. The results for these are shown in Table 4. Forall languages the graphemic and phonetic systems yield compara-ble performance. It is clear that some languages, such as KurmanjiKurdish and Telugu are harder to recognise, with Tok Pisin (a Cre-ole language) being the easiest. As expected combining the phoneticand graphemic systems together yields consistent performance gainsof 1.2% to 1.6% absolute over the best individual systems.

4Though performance gains were obtained using FBANK features overPLP, these gains disappeared when pitch features were added in initial exper-iments.

5188

Page 6: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Graphemes vs. Phonemes

• Instead of a pronunciation dictionary, one could represent a pronunciation as a sequence of graphemes (or letters)

• Useful technique for low-resourced/under-resourced languages

• Main advantages:

1. Avoid the need for phone-based pronunciations

2. Avoid the need for a phone alphabet

3. Works pretty well for languages with a direct link between graphemes (letters) and phonemes (sounds)

Page 7: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Grapheme to phoneme (G2P) conversion

Page 8: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Grapheme to phoneme (G2P) conversion

• Produce a pronunciation (phoneme sequence) given a written word (grapheme sequence)

• Learn G2P mappings from a pronunciation dictionary

• Useful for:

• ASR systems in languages with no pre-built lexicons

• Speech synthesis systems

• Deriving pronunciations for out-of-vocabulary (OOV) words

Page 9: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

G2P conversion (I)

• One popular paradigm: Joint sequence models [BN12]

• Grapheme and phoneme sequences are first aligned using EM-based algorithm

• Results in a sequence of graphones (joint G-P tokens)

• Ngram models trained on these graphone sequences

• WFST-based implementation of such a joint graphone model [Phonetisaurus]

[BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

Page 10: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

G2P conversion (II)

• Neural network based methods are the new state-of-the-art for G2P

• Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models.

• Incorporate alignment information [Yao15]. Beats Ngram models.

• No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

Page 11: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

LSTM + CTC for G2P conversion [Rao15]

4.1.1. Zero-delay

In the simplest approach, without any output delay, the in-put sequence is the series of graphemes and the output se-quence as the series of phonemes. In the (common) case ofunequal number of graphemes and phonemes we pad the se-quence with an empty marker, φ. For example, we have:Input: {g, o, o, g, l, e}Output: {g, u, g, @, l, φ}

4.1.2. Fixed-delay

In this mode, we pad the output phoneme sequence with afixed delay, this allows the LSTM to see several graphemesbefore outputting any phoneme, and builds a contextual win-dow to help predict the correct phoneme. As before, in thecase of unequal input and output size, we pad the sequencewith φ. For example, with a fixed delay of 2, we have:Input: {g, o, o, g, l, e, φ}Output: {φ, φ g, u, g, @, l}

4.1.3. Full-delay

In this approach, we allow the model to see the entire inputsequence before outputting any phoneme. The input sequenceis the series of graphemes followd by an end marker, ∆, andthe output sequence contains a delay equal to size of the inputfollowed by the series of phonemes. Again we pad unequalinput and output sequences with φ. For example;Input: {g, o, o, g, l, e,∆, φ, φ, φ, φ}Output: {φ, φ, φ, φ, φ, φ, g, u, g, @, l}

With the full delay setup we use an additional end markerto indicate that all the input graphemes have been seen andthat the LSTM can start outputting phonemes. We discuss theimpact of these various configurations of output delay on theG2P performance in Section 6.1.

4.2. Bidirectional models

While unidirectional models require artificial delays to builda contextual window, bidirectional LSTMs (BLSTM) achievethis naturally as they see the entire input before outputtingany phoneme. The BLSTM setup is nearly identical to theunidirectional model, but has ”backward” LSTM layers (asdescribed in [14]) which process the input in the reverse di-rection.

4.2.1. Deep Bidirectional LSTM

We found that deep-BLSTM (DBLSTM) with mutiple hid-den layers perform slightly better than a BLSTM with a sin-gle hidden layer. The optimal performance was achieved witha architecture, shown in Figure 1, where a single input layerwas fully connected to two parallel layers of 512 units each;

Fig. 1. The best performing G2P neural network architectureusing a DBLSTM-CTC.

one unidirectional and one bidirectional. This first hiddenlayer was fully connected to a single unidirectional layer of128 units. The second hidden layer was connected to an out-put layer. The model was initialized with random weights andtrained with a learning rate of 0.01.

4.2.2. Connectionist Temporal Classification

Along with the DBLSTM we use a connectionist temporalclassification [18] (CTC) output layer which interprets thenetwork outputs as a probability distribution over all possibleoutput label sequences, conditioned on the input data. TheCTC objective function directly maximizes the probabilitiesof the correct labelings.

The CTC output layer has a softmax output layer with41 units, one each for the 40 output phoneme labels and anadditional ”blank” unit. The probability of the CTC ”blank”unit is interpretted as observing no label at the given time step.This is similar to the use of ϵ described earlier in the joint-sequence models, however, the key difference here is that thisis handled implicitly by the DBSLTM-CTC model instead ofhaving explicit alignments with join-sequence models.

4.3. Combination G2P Implementation

LSTMs and joint n-gram models are two very different ap-proaches to G2P modeling since LSTMs model the G2Ptask at the full sequence (word) level instead of the n-gram(grapheme) level. These two models may generalize in dif-ferent ways and a combination of both approaches may resultin a better overall model. We combine both models by

representing the output of the LSTM G2P as a finite statetransducer (FST) and then intersect it with the output of then-gram model which is also represented as a FST. We selectthe single best path in the resulting FST which corresponds toa single best pronunciation. (We did not find any significantgains by using a scaling factor between the two models.)

5. EXPERIMENTS

In this paper, we report G2P performance on the publiclyavailable CMU pronunciation dictionary. We evaluate per-formance using phoneme error rate (PER) and word errorrate (WER) metrics. PER is defined as the number of in-sertions, deletions and substitutions divided by the numberof true phonemes, while WER is the number of words er-rors divided by the total number of words. The CMU datasetcontains 106,837 words and of these we construct a devel-opment set using 2,670 words to determine stopping criteriawhile training, and a test set using 12,000 words. We use thesame training and testing split as found in [12, 7, 4] and thusthe results are directly comparable.

6. RESULTS AND DISCUSSION

6.1. Impact of Output Delay

Table 1 compares the performance of unidirectional modelswith varying output delays. As expected, we find that whenusing fixed delays increasing the size of the delays helps, andthat full delay outperforms any fixed delay. This confirms theimportance of exploiting future context for the G2P task.

Output Delay Phoneme Error Rate (%)0 32.03 10.24 9.85 9.57 9.5

Full-delay 9.1

Table 1. Accuracy of ULSTM G2P with output delays.

6.2. Impact of CTC and Bi-directional Modeling

Table 2 compares LSTM models to various approaches pro-posed in the literature. The numbers reported for the LSTMare raw outputs, i.e. we do not decode the output with anylanguage model. In our experiments, we found that while uni-directional models benefitted from decoding with a phonemelanguage model (which we implemented as another LSTMtrained on the same training data), the BLSTM with CTCoutputs did not see any improvement with the additionalphoneme language model, likely because it already memo-rizes and enforces contextual dependencies similar to thoseimposed by an external langauge model.

Model Word Error Rate (%)Galescu and Allen [4] 28.5

Chen [7] 24.7Bisani and Ney [2] 24.5Novak et al. [6] 24.4Wu et al. [12] 23.45-gram FST 27.28-gram FST 26.5

Unidirectional LSTM with Full-delay 30.1DBLSTM-CTC 128 Units 27.9DBLSTM-CTC 512 Units 25.8

DBLSTM-CTC 512 + 5-gram FST 21.3

Table 2. Comparison of various G2P technologies.

The table shows that BLSTM architectures outperformunidirectional LSTMs, and also that they compare favorablyto WFST based ngram models (25.8% WER vs 26.5%). Fur-thermore, a combination of the two technologies as describedin 4.3 outperforms both models, and other approaches pro-posed in the literature.

Table 3 compares the sizes of some of the models wetrained and also their execution time in terms of average num-ber of milliseconds per word. It shows that BLSTM architec-tures are quite competitive with ngram models: the 128-unitBLSTM which performs at about the same level of accuracyas the 5-gram model is 10 times smaller and twice as fast, andthe 512-unit model remains extremely compact if arguably alittle slow (no special attempt was made so far at optimizingour LSTM code for speed, so this is less of a concern). Thismakes LSTM G2Ps quite appealing for on-device implemen-tations.

Model Model Size Model Speed5-gram FST 30 MB 35 ms/word8-gram FST 130 MB 30 ms/word

DBLSTM-CTC 128 Units 3 MB 12 ms/wordDBLSTM-CTC 512 Units 11 MB 64 ms/word

Table 3. Model size and speed for n-gram and LSTM G2P.

7. CONCLUSION

We suggested LSTM-based architectures to perform G2Pconversions. We approached the problem as a word-to-pronunciation sequence transcription problem in contrastto the traditional joint grapheme-to-phoneme modeling ap-proach and thus do not require explicit grapheme-to-phonemealignment for training. We trained unidirectional models withvarious output delays to capture some amount of future con-text, and found that models with greater contextual informa-tion perform better. We also trained deep BLSTM models

[Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015

Page 12: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

G2P conversion (II)

• Neural network based methods are the new state-of-the-art for G2P

• Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models.

• Incorporate alignment information [Yao15]. Beats Ngram models.

• No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

Page 13: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Seq2seq models (with alignment information [Yao15])

Figure 1: An encoder-decoder LSTM with two layers. The en-coder LSTM, to the left of the dotted line, reads a time-reversedsequence “hsi T A C” and produces the last hidden layer acti-vation to initialize the decoder LSTM. The decoder LSTM, tothe right of the dotted line, reads “hosi K AE T” as the pastphoneme prediction sequence and uses ”K AE T h/osi” as theoutput sequence to generate. Notice that the input sequence forencoder LSTM is time reversed, as in [5]. hsi denotes letter-sidesentence beginning. hosi and h/osi are the output-side sentencebegin and end symbols.

Following [21,22], Eq. (1) can be estimated using an expo-nential (or maximum entropy) model in the form of

p(pt|x = (p

t�1t�k, l

t+kt�k)) =

exp(

Pi �ifi(x, pt))P

q exp(P

i �ifi(x, q))(2)

where features fi(·) are usually 0 or 1 indicating the identitiesof phones and letters in specific contexts.

Joint modeling has been proposed for grapheme-to-phoneme conversion [20, 21, 23]. In these models, one has avocabulary of grapheme and phoneme pairs, which are calledgraphones. The probability of a graphone sequence is

p(C = c1 · · · cT ) =TY

t=1

p(ct|c1 · · · ct�1), (3)

where each c is a graphone unit. The conditional probabilityp(ct|c1 · · · ct�1) is estimated using an n-gram language model.

To date, these models have produced the best performanceon common benchmark datasets, and are used for comparisonwith the architectures in the following sections.

3. Side-conditioned Generation Models

In this section, we explore the use of side-conditioned languagemodels for generation. This approach is appealing for its sim-plicity, and especially because no explicit alignment informa-tion is needed.

3.1. Encoder-decoder LSTM

In the context of general sequence to sequence learning, theconcept of encoder and decoder networks has recently been pro-posed [3, 5, 19, 24, 25]. The main idea is mapping the entire in-put sequence to a vector, and then using a recurrent neural net-work (RNN) to generate the output sequence conditioned on theencoding vector. Our implementation follows the method in [5],which we denote as encoder-decoder LSTM. Figure 1 depicts amodel of this method. As in [5], we use an LSTM [19] as thebasic recurrent network unit because it has shown better perfor-mance than simple RNNs on language understanding [26] andacoustic modeling [27] tasks.

In this method, there are two sets of LSTMs: one is an en-coder that reads the source-side input sequence and the other

Figure 2: The uni-directional LSTM reads letter sequence “hsiC A T h/si” and past phoneme prediction “hosi hosi K AET”. It outputs phoneme sequence “hosi K AE T h/osi”. Notethat there are separate output-side begin and end-of-sentencesymbols, prefixed by ”o”.

is a decoder that functions as a language model and generatesthe output. The encoder is used to represent the entire input se-quence in the last-time hidden layer activities. These activitiesare used as the initial activities of the decoder network. Thedecoder is a language model that uses past phoneme sequence�

t�11 to predict the next phoneme �t, with its hidden state ini-

tialized as described. It stops predicting after outputting h/osi,the output-side end-of-sentence symbol. Note that in our mod-els, we use hsi and h/si as input-side begin-of-sentence andend-of-sentence tokens, and hosi and h/osi for correspondingoutput symbols.

To train these encoder and decoder networks, we used back-propagation through time (BPTT) [28,29], with the error signaloriginating in the decoder network.

We use a beam search decoder to generate phoneme se-quence during the decoding phase. The hypothesis sequencewith the highest posterior probability is selected as the decod-ing result.

4. Alignment Based Models

In this section, we relax the earlier constraint that the modeltranslates directly from the source-side letters to the target-sidephonemes without the benefit of an explicit alignment.

4.1. Uni-directional LSTM

A model of the uni-directional LSTM is in Figure 2. Given apair of source-side input and target-side output sequences andan alignment A, the posterior probability of output sequencegiven the input sequence is

p(�

T1 |A, l

T1 ) =

TY

t=1

p(�t|�t�11 , l

t1) (4)

where the current phoneme prediction �t depends both on itspast prediction �t�1 and the input letter sequence lt. Because ofthe recurrence in the LSTM, prediction of the current phonemedepends on the phoneme predictions and letter sequence fromthe sentence beginning. Decoding uses the same beam searchdecoder described in Sec. 3.

4.2. Bi-directional LSTM

The bi-directional recurrent neural network was proposed in[30]. In this architecture, one RNN processes the input from

Figure 1: An encoder-decoder LSTM with two layers. The en-coder LSTM, to the left of the dotted line, reads a time-reversedsequence “hsi T A C” and produces the last hidden layer acti-vation to initialize the decoder LSTM. The decoder LSTM, tothe right of the dotted line, reads “hosi K AE T” as the pastphoneme prediction sequence and uses ”K AE T h/osi” as theoutput sequence to generate. Notice that the input sequence forencoder LSTM is time reversed, as in [5]. hsi denotes letter-sidesentence beginning. hosi and h/osi are the output-side sentencebegin and end symbols.

Following [21,22], Eq. (1) can be estimated using an expo-nential (or maximum entropy) model in the form of

p(pt|x = (p

t�1t�k, l

t+kt�k)) =

exp(

Pi �ifi(x, pt))P

q exp(P

i �ifi(x, q))(2)

where features fi(·) are usually 0 or 1 indicating the identitiesof phones and letters in specific contexts.

Joint modeling has been proposed for grapheme-to-phoneme conversion [20, 21, 23]. In these models, one has avocabulary of grapheme and phoneme pairs, which are calledgraphones. The probability of a graphone sequence is

p(C = c1 · · · cT ) =TY

t=1

p(ct|c1 · · · ct�1), (3)

where each c is a graphone unit. The conditional probabilityp(ct|c1 · · · ct�1) is estimated using an n-gram language model.

To date, these models have produced the best performanceon common benchmark datasets, and are used for comparisonwith the architectures in the following sections.

3. Side-conditioned Generation Models

In this section, we explore the use of side-conditioned languagemodels for generation. This approach is appealing for its sim-plicity, and especially because no explicit alignment informa-tion is needed.

3.1. Encoder-decoder LSTM

In the context of general sequence to sequence learning, theconcept of encoder and decoder networks has recently been pro-posed [3, 5, 19, 24, 25]. The main idea is mapping the entire in-put sequence to a vector, and then using a recurrent neural net-work (RNN) to generate the output sequence conditioned on theencoding vector. Our implementation follows the method in [5],which we denote as encoder-decoder LSTM. Figure 1 depicts amodel of this method. As in [5], we use an LSTM [19] as thebasic recurrent network unit because it has shown better perfor-mance than simple RNNs on language understanding [26] andacoustic modeling [27] tasks.

In this method, there are two sets of LSTMs: one is an en-coder that reads the source-side input sequence and the other

Figure 2: The uni-directional LSTM reads letter sequence “hsiC A T h/si” and past phoneme prediction “hosi hosi K AET”. It outputs phoneme sequence “hosi K AE T h/osi”. Notethat there are separate output-side begin and end-of-sentencesymbols, prefixed by ”o”.

is a decoder that functions as a language model and generatesthe output. The encoder is used to represent the entire input se-quence in the last-time hidden layer activities. These activitiesare used as the initial activities of the decoder network. Thedecoder is a language model that uses past phoneme sequence�

t�11 to predict the next phoneme �t, with its hidden state ini-

tialized as described. It stops predicting after outputting h/osi,the output-side end-of-sentence symbol. Note that in our mod-els, we use hsi and h/si as input-side begin-of-sentence andend-of-sentence tokens, and hosi and h/osi for correspondingoutput symbols.

To train these encoder and decoder networks, we used back-propagation through time (BPTT) [28,29], with the error signaloriginating in the decoder network.

We use a beam search decoder to generate phoneme se-quence during the decoding phase. The hypothesis sequencewith the highest posterior probability is selected as the decod-ing result.

4. Alignment Based Models

In this section, we relax the earlier constraint that the modeltranslates directly from the source-side letters to the target-sidephonemes without the benefit of an explicit alignment.

4.1. Uni-directional LSTM

A model of the uni-directional LSTM is in Figure 2. Given apair of source-side input and target-side output sequences andan alignment A, the posterior probability of output sequencegiven the input sequence is

p(�

T1 |A, l

T1 ) =

TY

t=1

p(�t|�t�11 , l

t1) (4)

where the current phoneme prediction �t depends both on itspast prediction �t�1 and the input letter sequence lt. Because ofthe recurrence in the LSTM, prediction of the current phonemedepends on the phoneme predictions and letter sequence fromthe sentence beginning. Decoding uses the same beam searchdecoder described in Sec. 3.

4.2. Bi-directional LSTM

The bi-directional recurrent neural network was proposed in[30]. In this architecture, one RNN processes the input from

Figure 3: The bi-directional LSTM reads letter sequence “hsi CA T h/si” for the forward directional LSTM, the time-reversedsequence “h/si T A C hsi” for the backward directional LSTM,and past phoneme prediction “hosi hosi K AE T”. It outputsphoneme sequence “hosi K AE T h/osi”.

left-to-right, while another processes it right-to-left. The out-puts of the two sub-networks are then combined, for examplebeing fed into a third RNN. The idea has been used for speechrecognition [30] and more recently for language understand-ing [31]. Bi-directional LSTMs have been applied to speechrecognition [19] and machine translation [6].

In the bi-directional model, the phoneme prediction de-pends on the whole source-side letter sequence as follows

p(�

T1 |A, l

T1 ) =

TY

t=1

p(�t|�t�11 l

T1 ) (5)

Figure 3 illustrates this model. Focusing on the third setof inputs, for example, letter lt = A is projected to a hiddenlayer, together with the past phoneme prediction �t�1 = K.The letter lt = A is also projected to a hidden layer in thenetwork that runs in the backward direction. The hidden layeractivation from the forward and backward networks is then usedas the input to a final network running in the forward direction.The output of the topmost recurrent layer is used to predict thecurrent phoneme �t = AE.

We found that performance is better when feeding the pastphoneme prediction to the bottom LSTM layer, instead of otherlayers such as the softmax layer. However, this architecture canbe further extended, e.g., by feeding the past phoneme predic-tions to both the top and bottom layers, which we may investi-gate in future work.

In the figure, we draw one layer of bi-directional LSTMs. InSection 5, we also report results for deeper networks, in whichthe forward and backward layers are duplicated several times;each layer in the stack takes the concatenated outputs of theforward-backward networks below as its input.

Note that the backward direction LSTM is independent ofthe past phoneme predictions. Therefore, during decoding, wefirst pre-compute its activities. We then treat the output from thebackward direction LSTM as additional input to the top-layerLSTM that also has input from the lower layer forward directionLSTM. The same beam search decoder described before canthen be used.

5. Experiments

5.1. Datasets

Our experiments were conducted on the three US Englishdatasets1: the CMUDict, NetTalk, and Pronlex datasets thathave been evaluated in [20, 21]. We report phoneme error rate(PER) and word error rate (WER) 2. In the phoneme error ratecomputation, following [20, 21], in the case of multiple refer-ence pronunciations, the variant with the smallest edit distanceis used. Similarly, if there are multiple reference pronunciationsfor a word, a word error occurs only if the predicted pronuncia-tion doesn’t match any of the references.

The CMUDict contains 107877 training words, 5401 vali-dation words, and 12753 words for testing. The Pronlex datacontains 83182 words for training, 1000 words for validation,and 4800 words for testing. The NetTalk data contains 14985words for training and 5002 words for testing, and does not havea validation set.

5.2. Training details

For the CMUDict and Pronlex experiments, all meta-parameterswere set via experimentation with the validation set. For theNetTalk experiments, we used the same model structures aswith the Pronlex experiments.

To generate the alignments used for training the alignment-based methods of Sec. 4, we used the alignment package of [32].We used BPTT to train the LSTMs. We used sentence levelminibatches without truncation. To speed-up training, we useddata parallelism with 100 sentences per minibatch, except forthe CMUDict data, where one sentence per minibatch gave thebest performance on the development data. For the alignment-based methods, we sorted sentences according to their lengths,and each minibatch had sentences with the same length. Forencoder-decoder LSTMs, we didn’t sort sentences in the samelengths as done in the alignment-based methods, and instead,followed [5].

For the encoder-decoder LSTM in Sec. 3, we used 500 di-mensional projection and hidden layers. When increasing thedepth of the encoder-decoder LSTMs, we increased the depthof both encoder and decoder networks. For the bi-directionalLSTMs, we used a 50 dimensional projection layer and 300dimensional hidden layer. For the uni-directional LSTM ex-periments on CMUDict, we used a 400 dimensional projectionlayer, 400 dimensional hidden layer, and the above describeddata parallelism.

For both encoder-decoder LSTMs and the alignment-basedmethods, we randomly permuted the order of the training sen-tences in each epoch. We found that the encoder-decoder LSTMneeded to start from a small learning rate, approximately 0.007per sample. For bi-directional LSTMs, we used initial learn-ing rates of 0.1 or 0.2. For the uni-directional LSTM, the ini-tial learning rate was 0.05. The learning rate was controlledby monitoring the improvement of cross-entropy scores on val-idation sets. If there was no improvement of the cross-entropyscore, we halved the learning rate. NetTalk dataset doesn’t havea validation set. Therefore, on NetTalk, we first ran 10 itera-tions with a fixed per-sample learning rate of 0.1, reduced thelearning rate by half for 2 more iterations, and finally used 0.01for 70 iterations.

1We thank Stanley F. Chen who kindly shared the data set partitionhe used in [21].

2We observed a strong correlation of BLEU and WER scores onthese tasks. Therefore we didn’t report BLEU scores in this paper.

Method PER (%) WER (%)encoder-decoder LSTM 7.53 29.21encoder-decoder LSTM (2 layers) 7.63 28.61uni-directional LSTM 8.22 32.64uni-directional LSTM (window size 6) 6.58 28.56bi-directional LSTM 5.98 25.72bi-directional LSTM (2 layers) 5.84 25.02bi-directional LSTM (3 layers) 5.45 23.55

Table 2: Results on the CMUDict dataset.

The models of Secs. 3 and 4 require using a beam search de-coder. Based on validation results, we report results with beamwidth of 1.0 in likelihood. We did not observe an improvementwith larger beams. Unless otherwise noted, we used a windowof 3 letters in the models. We plan to release our training recipesto public through computation network toolkit (CNTK) [33].

5.3. Results

We first report results for all our models on the CMUDictdataset [21]. The first two lines of Table 2 show results for theencoder-decoder models. While the error rates are reasonable,the best previously reported results of 24.53% WER [20] aresomewhat better. It is possible that combining multiple systemsas in [5] would achieve the same result, we have chosen not toengage in system combination.

The effect of using alignment based models is shown atthe bottom of Table 2. Here, the bi-directional models producean unambiguous improvement over the earlier models, and bytraining a three-layer bi-directional LSTM, we are able to sig-nificantly exceed the previous state-of-the-art.

We noticed that the uni-directional LSTM with default win-dow size had the highest WER, perhaps because one does notobserve the entire input sequence as is the case with both theencoder-decoder and bi-directional LSTMs. To validate thisclaim, we increased the window size to 6 to include the cur-rent and five future letters as its source-side input. Becausethe average number of letters is 7.5 on CMUDict dataset, theuni-directional model in many cases thus sees the entire lettersequences. With a window size of 6 and additional informa-tion from the alignments, the uni-directional model was able toperform better than the encoder-decoder LSTM.

5.4. Comparison with past results

We now present additional results for the NetTalk and Pron-lex datasets, and compare with the best previous results. Themethod of [20] uses 9-gram graphone models, and [21] uses8-gram maximum entropy model.

Changes in WER of 0.77, 1.30, and 1.27 for CMUDict,NetTalk and Pronlex datasets respectively are significant at the95% confidence level. For PER, the corresponding values are0.15, 0.29, and 0.28. On both the CMUDict and NetTalkdatasets, the bi-directional LSTM outperforms the previous re-sults at the 95% significance level.

6. Related Work

Grapheme-to-phoneme has important applications in text-to-speech and speech recognition. It has been well studied in thepast decades. Although many methods have been proposed inthe past, the best performance on the standard dataset so far

Data Method PER (%) WER (%)CMUDict past results [20] 5.88 24.53

bi-directional LSTM 5.45 23.55NetTalk past results [20] 8.26 33.67

bi-directional LSTM 7.38 30.77Pronlex past results [20, 21] 6.78 27.33

bi-directional LSTM 6.51 26.69

Table 3: The PERs and WERs using bi-directional LSTM incomparison to the previous best performances in the literature.

was achieved using a joint sequence model [20] of grapheme-phoneme joint multi-gram or graphone, and a maximum en-tropy model [21].

To our best knowledge, our methods are the first sin-gle neural-network-based system that outperform the previousstate-of-the-art methods [20,21] on these common datasets. It ispossible to improve performances by combining multiple sys-tems and methods [34, 35], we have chosen not to engage inbuilding hybrid models.

Our work can be cast in the general sequence to sequencetranslation category, which includes tasks such as machinetranslation and speech recognition. Therefore, perhaps the mostclosely related work is [6]. However, instead of the marginalgains in their bi-direction models, our model obtained signifi-cant gains from using bi-direction information. Also, their workdoesn’t include experimenting with deeper structures, which wefound beneficial. We plan to conduct machine translation tasksto compare our models and theirs.

7. Conclusion

In this paper, we have applied both encoder-decoder neuralnetworks and alignment based models to the grapheme-to-phoneme task. The encoder-decoder models have the signifi-cant advantage of not requiring a separate alignment step. Per-formance with these models comes close to the best previousalignment-based results. When we go further, and inform a bi-directional neural network models with alignment information,we are able to make significant advances over previous meth-ods.

8. References

[1] L. H. Son, A. Allauzen, and F. Yvon, “Continuous spacetranslation models with neural networks,” in Proceedingsof the 2012 conference of the north american chapter ofthe association for computational linguistics: Human lan-guage technologies. Association for Computational Lin-guistics, 2012, pp. 39–48.

[2] M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint lan-guage and translation modeling with recurrent neural net-works.,” in EMNLP, 2013, pp. 1044–1054.

[3] N. Kalchbrenner and P. Blunsom, “Recurrent continuoustranslation nodels,” in EMNLP, 2013.

[4] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, andJ. Makhoul, “Fast and robust neural network joint modelsfor statistical machine translation,” in ACL, 2014.

[5] H. Sutskever, O. Vinyals, and Q. V. Le, “Sequence tosequence learning with neural networks,” in NIPS, 2014.

Method PER (%) WER (%)encoder-decoder LSTM 7.53 29.21encoder-decoder LSTM (2 layers) 7.63 28.61uni-directional LSTM 8.22 32.64uni-directional LSTM (window size 6) 6.58 28.56bi-directional LSTM 5.98 25.72bi-directional LSTM (2 layers) 5.84 25.02bi-directional LSTM (3 layers) 5.45 23.55

Table 2: Results on the CMUDict dataset.

The models of Secs. 3 and 4 require using a beam search de-coder. Based on validation results, we report results with beamwidth of 1.0 in likelihood. We did not observe an improvementwith larger beams. Unless otherwise noted, we used a windowof 3 letters in the models. We plan to release our training recipesto public through computation network toolkit (CNTK) [33].

5.3. Results

We first report results for all our models on the CMUDictdataset [21]. The first two lines of Table 2 show results for theencoder-decoder models. While the error rates are reasonable,the best previously reported results of 24.53% WER [20] aresomewhat better. It is possible that combining multiple systemsas in [5] would achieve the same result, we have chosen not toengage in system combination.

The effect of using alignment based models is shown atthe bottom of Table 2. Here, the bi-directional models producean unambiguous improvement over the earlier models, and bytraining a three-layer bi-directional LSTM, we are able to sig-nificantly exceed the previous state-of-the-art.

We noticed that the uni-directional LSTM with default win-dow size had the highest WER, perhaps because one does notobserve the entire input sequence as is the case with both theencoder-decoder and bi-directional LSTMs. To validate thisclaim, we increased the window size to 6 to include the cur-rent and five future letters as its source-side input. Becausethe average number of letters is 7.5 on CMUDict dataset, theuni-directional model in many cases thus sees the entire lettersequences. With a window size of 6 and additional informa-tion from the alignments, the uni-directional model was able toperform better than the encoder-decoder LSTM.

5.4. Comparison with past results

We now present additional results for the NetTalk and Pron-lex datasets, and compare with the best previous results. Themethod of [20] uses 9-gram graphone models, and [21] uses8-gram maximum entropy model.

Changes in WER of 0.77, 1.30, and 1.27 for CMUDict,NetTalk and Pronlex datasets respectively are significant at the95% confidence level. For PER, the corresponding values are0.15, 0.29, and 0.28. On both the CMUDict and NetTalkdatasets, the bi-directional LSTM outperforms the previous re-sults at the 95% significance level.

6. Related Work

Grapheme-to-phoneme has important applications in text-to-speech and speech recognition. It has been well studied in thepast decades. Although many methods have been proposed inthe past, the best performance on the standard dataset so far

Data Method PER (%) WER (%)CMUDict past results [20] 5.88 24.53

bi-directional LSTM 5.45 23.55NetTalk past results [20] 8.26 33.67

bi-directional LSTM 7.38 30.77Pronlex past results [20, 21] 6.78 27.33

bi-directional LSTM 6.51 26.69

Table 3: The PERs and WERs using bi-directional LSTM incomparison to the previous best performances in the literature.

was achieved using a joint sequence model [20] of grapheme-phoneme joint multi-gram or graphone, and a maximum en-tropy model [21].

To our best knowledge, our methods are the first sin-gle neural-network-based system that outperform the previousstate-of-the-art methods [20,21] on these common datasets. It ispossible to improve performances by combining multiple sys-tems and methods [34, 35], we have chosen not to engage inbuilding hybrid models.

Our work can be cast in the general sequence to sequencetranslation category, which includes tasks such as machinetranslation and speech recognition. Therefore, perhaps the mostclosely related work is [6]. However, instead of the marginalgains in their bi-direction models, our model obtained signifi-cant gains from using bi-direction information. Also, their workdoesn’t include experimenting with deeper structures, which wefound beneficial. We plan to conduct machine translation tasksto compare our models and theirs.

7. Conclusion

In this paper, we have applied both encoder-decoder neuralnetworks and alignment based models to the grapheme-to-phoneme task. The encoder-decoder models have the signifi-cant advantage of not requiring a separate alignment step. Per-formance with these models comes close to the best previousalignment-based results. When we go further, and inform a bi-directional neural network models with alignment information,we are able to make significant advances over previous meth-ods.

8. References

[1] L. H. Son, A. Allauzen, and F. Yvon, “Continuous spacetranslation models with neural networks,” in Proceedingsof the 2012 conference of the north american chapter ofthe association for computational linguistics: Human lan-guage technologies. Association for Computational Lin-guistics, 2012, pp. 39–48.

[2] M. Auli, M. Galley, C. Quirk, and G. Zweig, “Joint lan-guage and translation modeling with recurrent neural net-works.,” in EMNLP, 2013, pp. 1044–1054.

[3] N. Kalchbrenner and P. Blunsom, “Recurrent continuoustranslation nodels,” in EMNLP, 2013.

[4] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, andJ. Makhoul, “Fast and robust neural network joint modelsfor statistical machine translation,” in ACL, 2014.

[5] H. Sutskever, O. Vinyals, and Q. V. Le, “Sequence tosequence learning with neural networks,” in NIPS, 2014.

[Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015

Page 14: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

G2P conversion (II)

• Neural network based methods are the new state-of-the-art for G2P

• Bidirectional LSTM-based networks using a CTC output layer [Rao15]. Comparable to Ngram models.

• Incorporate alignment information [Yao15]. Beats Ngram models.

• No alignment. Encoder-decoder with attention. Beats the above systems [Toshniwal16].

[Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015

[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Page 15: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Encoder-decoder + attention for G2P [Toshniwal16]

[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

LSTM network with explicit alignments. Among these previous ap-proaches, the best performance by a single model is obtained by Yaoand Zweig’s alignment-based approach, although Rao et al. obtaineven better performance on one data set by combining their LSTMmodel with an (alignment-based) n-gram model.

In this paper, we explore the use of attention in the encoder-decoder framework as a way of removing the dependency on align-ments. The use of a neural attention model was first explored byBahdanau et al. for machine translation [7] (though a precursor ofthis model was the windowing approach of Graves [14]), which hassince been applied to a variety of tasks including speech recogni-tion [8] and image caption generation [9]. The G2P problem is in factlargely analogous to the translation problem, with a many-to-manymapping between subsequences of input labels and subsequences ofoutput labels and with potentially long-range dependencies (as in theeffect of the final “e” in paste on the pronunciation of the “a”). In ex-periments presented below, we find that this type of attention modelindeed removes our dependency on an external aligner and achievesimproved performance on standard data sets.

3. MODEL

We next describe the main components of our models both withoutand with attention.

ct

↵t

yt

dt

h1

x1x2xTg

h2h3hTg

Attention Layer

Encoder

x3

Decoder

Fig. 1. A global attention encoder-decoder model reading the inputsequence x1, · · · , xTg and outputting the sequence y1, · · · , yt, · · ·

3.1. Encoder-decoder models

We briefly describe the encoder-decoder (“sequence-to-sequence”)approach, as proposed by [13]. An encoder-decoder model includesan encoder, which reads in the input (grapheme) sequence, and adecoder, which generates the output (phoneme) sequence. A typ-ical encoder-decoder model is shown in Figure 1. In our model,the encoder is a bidirectional long short-term memory (BiLSTM)network; we use a bidirectional network in order to capture thecontext on both sides of each grapheme. The encoder takes as in-put the grapheme sequence, represented as a sequence of vectorsx = (x1, · · · ,xTg ), obtained by multiplying the one-hot vectorsrepresenting the input characters with a character embedding matrixwhich is learned jointly with the rest of the model. The encodercomputes a sequence of hidden state vectors, h = (h1, · · · ,hTg ),

given by:�!hi = f(xi,

��!hi�1)

�hi = f

0(xi, ��hi+1)

hi = (�!hi; �hi)

We use separate stacked (deep) LSTMs to model f and f

0.3 A “con-text vector” c is computed from the encoder’s state sequence:

c = q({h1, · · · ,hTg})

In our case, we use a linear combination of��!hTg and

�h1, with pa-

rameters learned during training. Since our models are stacked, wecarry out this linear combination at every layer.

This context vector is passed as an input to the decoder. Thedecoder, g(·), is modeled as another stacked (unidirectional) LSTM,which predicts each phoneme yt given the context vector c and all ofthe previously predicted phonemes {y1, · · · , yt�1} in the followingway:

dt = g(yt�1,dt�1, c)

p(yt|y<t,x) = softmax(Wsdt + bs)

where dt�1 is the hidden state of the decoder LSTM and yt�1 is thevector obtained by projecting the one hot vector corresponding toyt�1 using a phoneme embedding matrix E. The embedding matrixE is jointly learned with other parameters of the model. In basicencoder-decoder models, the context vector c is just used as an initialstate for the decoder LSTM, d0 = c, and is not used after that.

3.2. Global Attention

One of the important extensions of encoder-decoder models is theuse of attention mechanism to adapt the context vector c for everyoutput label prediction [7]. Rather than just using the context vectoras an initial state for the decoder LSTM, we use a different contextvector ct at every decoder time step, where ct is a linear combinationof all of the encoder hidden states. The choice of initial state forthe decoder LSTM is now less important; we simply use the lasthidden state of the encoder’s backward LSTM. The ability to attendto different encoder states when decoding each output label meansthat the attention mechanism can be seen as a soft alignment betweenthe input (grapheme) sequence and output (phoneme) sequence. Weuse the attention mechanism proposed by [16], where the contextvector ct at time t is given by:

uit = v

T tanh(W1hi +W2dt + ba)

↵t = softmax(ut)

ct =

TgX

i=1

↵ithi

where the vectors v, ba and the matrices W1,W2 are parameterslearned jointly with the rest of the encoder-decoder model. The score↵it is a weight that represents the importance of the hidden encoderstate hi in generating the phoneme yt. It should be noted that thevector hi is really a stack of vectors and for attention calculationswe only use its top layer.

The decoder then uses ct in the following way:

p(yt|y<t,x) = softmax(Ws[ct;dt] + bs)

3For brevity we exclude the LSTM equations. The details can be foundin Zaremba et al. [15].

Page 16: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Encoder-decoder + attention for G2P [Toshniwal16]

[Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

LSTM network with explicit alignments. Among these previous ap-proaches, the best performance by a single model is obtained by Yaoand Zweig’s alignment-based approach, although Rao et al. obtaineven better performance on one data set by combining their LSTMmodel with an (alignment-based) n-gram model.

In this paper, we explore the use of attention in the encoder-decoder framework as a way of removing the dependency on align-ments. The use of a neural attention model was first explored byBahdanau et al. for machine translation [7] (though a precursor ofthis model was the windowing approach of Graves [14]), which hassince been applied to a variety of tasks including speech recogni-tion [8] and image caption generation [9]. The G2P problem is in factlargely analogous to the translation problem, with a many-to-manymapping between subsequences of input labels and subsequences ofoutput labels and with potentially long-range dependencies (as in theeffect of the final “e” in paste on the pronunciation of the “a”). In ex-periments presented below, we find that this type of attention modelindeed removes our dependency on an external aligner and achievesimproved performance on standard data sets.

3. MODEL

We next describe the main components of our models both withoutand with attention.

ct

↵t

yt

dt

h1

x1x2xTg

h2h3hTg

Attention Layer

Encoder

x3

Decoder

Fig. 1. A global attention encoder-decoder model reading the inputsequence x1, · · · , xTg and outputting the sequence y1, · · · , yt, · · ·

3.1. Encoder-decoder models

We briefly describe the encoder-decoder (“sequence-to-sequence”)approach, as proposed by [13]. An encoder-decoder model includesan encoder, which reads in the input (grapheme) sequence, and adecoder, which generates the output (phoneme) sequence. A typ-ical encoder-decoder model is shown in Figure 1. In our model,the encoder is a bidirectional long short-term memory (BiLSTM)network; we use a bidirectional network in order to capture thecontext on both sides of each grapheme. The encoder takes as in-put the grapheme sequence, represented as a sequence of vectorsx = (x1, · · · ,xTg ), obtained by multiplying the one-hot vectorsrepresenting the input characters with a character embedding matrixwhich is learned jointly with the rest of the model. The encodercomputes a sequence of hidden state vectors, h = (h1, · · · ,hTg ),

given by:�!hi = f(xi,

��!hi�1)

�hi = f

0(xi, ��hi+1)

hi = (�!hi; �hi)

We use separate stacked (deep) LSTMs to model f and f

0.3 A “con-text vector” c is computed from the encoder’s state sequence:

c = q({h1, · · · ,hTg})

In our case, we use a linear combination of��!hTg and

�h1, with pa-

rameters learned during training. Since our models are stacked, wecarry out this linear combination at every layer.

This context vector is passed as an input to the decoder. Thedecoder, g(·), is modeled as another stacked (unidirectional) LSTM,which predicts each phoneme yt given the context vector c and all ofthe previously predicted phonemes {y1, · · · , yt�1} in the followingway:

dt = g(yt�1,dt�1, c)

p(yt|y<t,x) = softmax(Wsdt + bs)

where dt�1 is the hidden state of the decoder LSTM and yt�1 is thevector obtained by projecting the one hot vector corresponding toyt�1 using a phoneme embedding matrix E. The embedding matrixE is jointly learned with other parameters of the model. In basicencoder-decoder models, the context vector c is just used as an initialstate for the decoder LSTM, d0 = c, and is not used after that.

3.2. Global Attention

One of the important extensions of encoder-decoder models is theuse of attention mechanism to adapt the context vector c for everyoutput label prediction [7]. Rather than just using the context vectoras an initial state for the decoder LSTM, we use a different contextvector ct at every decoder time step, where ct is a linear combinationof all of the encoder hidden states. The choice of initial state forthe decoder LSTM is now less important; we simply use the lasthidden state of the encoder’s backward LSTM. The ability to attendto different encoder states when decoding each output label meansthat the attention mechanism can be seen as a soft alignment betweenthe input (grapheme) sequence and output (phoneme) sequence. Weuse the attention mechanism proposed by [16], where the contextvector ct at time t is given by:

uit = v

T tanh(W1hi +W2dt + ba)

↵t = softmax(ut)

ct =

TgX

i=1

↵ithi

where the vectors v, ba and the matrices W1,W2 are parameterslearned jointly with the rest of the encoder-decoder model. The score↵it is a weight that represents the importance of the hidden encoderstate hi in generating the phoneme yt. It should be noted that thevector hi is really a stack of vectors and for attention calculationswe only use its top layer.

The decoder then uses ct in the following way:

p(yt|y<t,x) = softmax(Ws[ct;dt] + bs)

3For brevity we exclude the LSTM equations. The details can be foundin Zaremba et al. [15].

Data Method PER (%) WER (%)CMUDict BiDir LSTM + Alignment [6] 5.45 23.55

DBLSTM-CTC [5] - 25.8DBLSTM-CTC + 5-gram model [5] - 21.3Encoder-decoder + global attn 5.04± 0.03 21.69± 0.21Encoder-decoder + local-m attn 5.11± 0.03 21.85± 0.21Encoder-decoder + local-p attn 5.39± 0.04 22.83± 0.22Ensemble of 5 [Encoder-decoder + global attn] models 4.69 20.24

Pronlex BiDir LSTM + Alignment [6] 6.51 26.69Encoder-decoder + global attn 6.24± 0.1 25.39± 0.61Encoder-decoder + local-m attn 5.99± 0.11 24.23± 0.42Encoder-decoder + local-p attn 6.49± 0.06 25.64± 0.42

NetTalk BiDir LSTM + Alignment [6] 7.38 30.77Encoder-decoder + global attn 7.14± 0.72 29.20± 2.18 5

Encoder-decoder + local-m attn 7.13± 0.11 29.67± 0.49Encoder-decoder + local-p attn 8.41± 0.19 32.32± 0.41

Table 1. Comparison of our models’ performance with the best previous results for CMUDict, Pronlex and NetTalk. ± indicates the standarddeviation across 5 training runs of the model.

4.4. Inference

We use a greedy decoder (beam size = 1) to decode the phonemesequence during inference. That is, at each decoding time step weconsider the output phone to be the argmax of the softmax output ofthe decoder at that time frame. We got no reliable gains by usingbeam search with any beam size greater than 1.

4.5. Results

Table 1 presents the main results with our tuned models on the threetest sets, compared to the best previously reported results. For allthree data sets, the best prior results to our knowledge with a sin-gle model are those of Yao and Zweig [6] with an alignment-baseddeep bidirectional LSTM. For CMUDict, a better WER was ob-tained by [5] by ensembling their CTC bidirectional LSTM withan alignment-based 5-gram model, but no corresponding PER wasreported.

Our best attention models clearly outperform all of the previousbest models in terms of PER. In terms of WER, the attention modeloutperforms all prior single (non-ensembled) models. For CMU-Dict, we also include the result of ensembling five of our global at-tention models using different random initializations, by voting ontheir outputs (with random tie-breaking), which outperforms the en-semble of Rao et al..

Among the three attention models, global and local-m atten-tion model perform well across all three data sets, while the local-pmodel performs well on CMUDict and Pronlex but not on NetTalk.The success of the global model may be explained by the fact thatthe source sequence length in this task (i.e., the word length) is rathershort, always less than 25 characters in these three data sets. There-fore, it is feasible for the global attention model to consider all of theencoder states and weight them in an appropriate manner.

The local-m attention model, even with its simplistic assump-tion about alignment, outperforms the local-p model on every dataset and is a clear best performer on Pronlex. Although the assump-tion of monotonic alignment turns out to be too simplistic for othertasks, such as machine translation [17], it is a reasonable choice forG2P.

Surprisingly, the local-p attention model remains a distant thirdamong the three attention models across all three data sets. More-over, it suffers a higher PER even when it obtains comparable WERs,

Model Changes Dev WER (%)No changes (full global attention model) 21.81No sampling 22.05No dropout 22.17No input feeding 22.06No attention 22.98No attention (rev. unidirectional encoder) 22.65# of LSTM units - 256 24.00# of LSTM units - 50 32.702-layer LSTM 22.361-layer LSTM 23.67Rev. unidirectional encoder 22.12Rev. unidirectional encoder + GRU 23.78

Table 2. Ablation study on CMUDict development set.

as is the case for Pronlex. This means that words with errors tend tohave a large number of errors. This seems to suggest that if an align-ment error is made near the beginning of the word, then it is hard forthe local-p model to recover. This points towards a need for a betteralignment prediction strategy. The particular poor performance oflocal-p on NetTalk also suggests that it may need a larger trainingset for learning the alignment prediction parameters.

For all of our results, we make the choice of dropout probabilityand input feeding based on tuning on the development set. The typi-cal choice of dropout probability tends to be around 0.2-0.3 while thedecision of using input feeding or not tends to vary with the choiceof attention model and the data set used. However, performance doesnot vary greatly with these two parameters.

4.5.1. Ablation analysis for CMUDict

In order to measure the contributions of various components of ourattention models, we performed an ablation analysis for the globalattention model evaluated on the CMUDict development data. Theresults are shown in Table 2. As can be seen from the table, the re-moval of input feeding results in a very minor drop in performance.The use of regularization in the form of dropout and scheduled sam-pling also provides a minor boost. The importance of attention isreflected in almost a 1% absolute drop in performance when atten-

Page 17: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Sub-phonetic feature-based models

Page 18: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Pronunciation Model

Tends to be highly language dependent

Each word is a sequence of phones

Phone-Based Articulatory Features

Parallel streams of articulator movements

Based on theory of articulatory phonology1

PHONE s eh n s

1[C. P. Browman and L. Goldstein, Phonology ‘86]

Page 19: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Pronunciation Model

LIP open/labial

TON.TIP critical/alveolar mid/alveolar closed/alveolar critical/alveolar

TON.BODY mid/uvular mid/palatal mid/uvular

GLOTTIS open critical open

VELUM closed open closed

are more easily explained if a single feature value is allowed to span (what appearson the surface to be) more than one segment. Autosegmental phonology posits somerelationships (or associations) between segments in different tiers, which limit thetypes of transformations that can occur. We will not make use of the details ofthis theory, other than the motivation that features inherently lie in different tiers ofrepresentation.

2.4.3 Articulatory phonology

In the late 1980s, Browman and Goldstein proposed articulatory phonology [BG86,BG92], a theory that differs from previous ones in that the basic units in the lexiconare not abstract binary features but rather articulatory gestures. A gesture is essen-tially an instruction to the vocal tract to produce a certain degree of constriction ata given location with a given set of articulators. For example, one gesture might be“narrow lip opening”, an instruction to the lips and jaw to position themselves so asto effect a narrow opening at the lips. Figure 2-3 shows the main articulators of thevocal tract to which articulatory gestures refer. We are mainly concerned with thelips, tongue, glottis (controlling voicing), and velum (controlling nasality).

Figure 2-3: A midsagittal section showing the major articulators of the vocal tract,reproduced from [oL04].

39

TT-LOC

LIP-LOC

LIP-OPEN

TB-LOC

TT-OPEN

TB-OPEN VELUM

GLOTTIS

PHONE s eh n s

Articulatory Features

Parallel streams of articulator movements

Based on theory of articulatory phonology1

Page 20: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Example: Pronunciations for word “sense”

Simple asynchrony across feature streams can appear as many phone alterations

[Adapted from Livescu ’05]

PHONE

TB

TT

GLOT

VEL

mid/uvular mid/palatal mid/uvularcritical/alveolar mid/alveolar closed/alveolar critical/alveolar

open critical openclosed closedopen

s eh n s

open/labial

CANONICAL

LIP

PHONE

TB

TT

GLOT

VEL

mid/uvular mid/palatal mid/uvularcritical/alveolar mid/alveolar closed/alveolar critical/alveolar

open critical openclosed closedopen

s eh_n n s

open/labial

E.g. OBSERVED

t

LIP

Page 21: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Dynamic Bayesian Networks (DBNs)

• Provides a natural framework to efficiently encode multiple streams of articulatory features

• Simple DBN with three random variables in each time frame

At-1 At At+1

Bt-1 Bt Bt+1

Ct-1 Ct Ct+1

frame t-1 frame t frame t+1

Page 22: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

L-Lag T-Lag G-Lag

Phn

L-Phn

Lip-Op

T-Phn

TT-Op

G-Phn

Glot

surLip-Op

surTT-Op

surGlot

Posn

WordDBN model of pronunciation 0 1 2 3

solve s ao l vWord

Posn

Observed feature values

P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech’12

Trans

Prev-Phone

s ao l v

0 s ao l v1 - s ao l

L-LagPhn

Page 23: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

L-Lag T-Lag G-Lag

Phn

L-Phn

Lip-Op

T-Phn

TT-Op

G-Phn

Glot

surLip-Op

surTT-Op

surGlot

Posn

Word

Trans

Prev-Phone

Set1

Set2

Set3

Set4

Set5

Factorized DBN model1

Page 24: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Cascade of Finite State Machines

L-Phn

Lip-op

T-Phn

TT-op

G-Phn

Glot

Posn

surLip-op

surTT-op

surGlot

L-Lag T-Lag

Phn

Word

G-Lag

Trans

Prev-Phn

Word

Phn, Trans

Posn

L-LagT-LagG-LagPhn, Trans,

L-Lag,T-Lag, G-Lag

F1

F2

Prev-Phn Phn L-Phn,

T-Phn,G-Phn

Lip-op, TT-op,Glot

surLip-op,surTT-op,surGlot

F3

F4

F51[P. Jyothi, E. Fosler-Lussier, K. Livescu, Interspeech ’12]

Page 25: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Weighted Finite State Machine

x1:y1/1.5x2:y2/1.3

x3:y3/2.0

x4:y4/0.6

Page 26: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Weighted Finite State Machine

Decoding: For input X, find the path with minimum cost.

a* = argmin wα(X, a)path a

wα(X, a) : weight of path a on input X .where α are learned parameters

Linear model: wα(X, a) = α · φ(X, a) .where φ is a feature function .

x1:y1/1.5x2:y2/1.3

x3:y3/2.0

x4:y4/0.6

Page 27: Automatic Speech Recognition (CS753)pjyothi/cs753_aut17/slides/...Tok Pisin 207 Phonetic 41.8 40.6 39.4 Graphemic 42.1 41.1 Cebuano 301 Phonetic 55.5 54.0 52.6 Graphemic 55.5 54.2

Discriminative Training

• Online discriminative training algorithm to learn α

• Similar to structured perceptron [Collins ’02]:

• Each training sample gives a “decoded path” and a “correct path”. Update α to bias towards correct path.

• Use a large-margin training algorithm adapted to work with a cascade of finite state machines1

x1:y1/1.5x2:y2/1.3

x3:y3/2.0

x4:y4/0.6

1[P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech-13]