Modeling pronunciation variation for ASR: A survey of...

22
Modeling pronunciation variation for ASR: A survey of the literature Helmer Strik * , Catia Cucchiarini A 2 RT, Department of Language and Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands Abstract The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversa- tional speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for. This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately. In this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in this special issue and the papers presented at Ôthe Rolduc workshopÕ. 1 First, the most important characteristics that distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evalu- ation and comparison are addressed. Particular attention is paid to some of the most important factors that make it dicult to compare the dierent methods in an objective way. Finally, some conclusions are drawn as to the importance of objective evaluation and the way in which it could be carried out. Ó 1999 Elsevier Science B.V. All rights reserved. Zusammenfassung Die Forschungsrichtung der automatischen Spracherkennung (ASR) hat sich nach und nach vom Erkennen iso- lierter W orter in Richtung Erkennung frei gesprochener Sprache entwickelt. Das hat zur Folge, daß die Aussprache- variation, so wie sie in der freien Rede zutage tritt, bei der Spracherkennung ein intervenierender Faktor geworden ist. Die Leistung eines ASR-Systems wird n amlich erheblich beeintr achtigt, wenn man diesen Faktor nicht ber ucksichtigt. Dies ist vermutlich der Hauptgrund daf ur, warum die systematische Ber ucksichtigung der Aussprachevariation bei der ASR in letzter Zeit stark zugenommen hat. Dieser Artikel stellt einen Uberblick der Literatur zu diesem Thema dar, wobei den Beitr agen in diesem Ôspecial issueÕ sowie denen des ÔRolduc workshopÕ besondere Aufmerksamkeit geschenkt wird. Zun achst werden die wichtigsten Unterschiede der zahlreichen Arbeiten zur Modellbildung der Aussprache- variation diskutiert. Dann folgt eine Besprechung der Beurteilung und des Vergleichs verschiedener Methoden, die der Modellbildung zugrunde liegen. Dabei wird den wichtigsten Faktoren, die einen objektiven Vergleich der Methoden erschweren, besondere Aufmerksamkeit geschenkt. Letztendlich schließen sich einige Schlußfolgerungen im Hinblick auf die Relevanz objektiver Beurteilung und deren m ogliche Realisierung an. Ó 1999 Elsevier Science B.V. All rights reserved. www.elsevier.nl/locate/specom Speech Communication 29 (1999) 225–246 * Corresponding author. Tel.: +31-24-3616104; fax: +31-24-3612907. E-mail address: [email protected] (H. Strik) 1 Whenever we mention Ôthe Rolduc workshopÕ in the text we refer to the ESCA Tutorial and Research Workshop ‘‘Modeling pronunciation variation for ASR’’ that was held in Rolduc from 4 to 6 May 1998. This special issue of Speech Communication contains a selection of papers presented at that workshop. 0167-6393/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 3 8 - 2

Transcript of Modeling pronunciation variation for ASR: A survey of...

Modeling pronunciation variation for ASR: A survey of theliterature

Helmer Strik *, Catia Cucchiarini

A2RT, Department of Language and Speech, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands

Abstract

The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversa-

tional speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually

increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for.

This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately. In

this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in

this special issue and the papers presented at Ôthe Rolduc workshopÕ. 1 First, the most important characteristics that

distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evalu-

ation and comparison are addressed. Particular attention is paid to some of the most important factors that make it

di�cult to compare the di�erent methods in an objective way. Finally, some conclusions are drawn as to the importance

of objective evaluation and the way in which it could be carried out. Ó 1999 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Die Forschungsrichtung der automatischen Spracherkennung (ASR) hat sich nach und nach vom Erkennen iso-

lierter W�orter in Richtung Erkennung frei gesprochener Sprache entwickelt. Das hat zur Folge, daû die Aussprache-

variation, so wie sie in der freien Rede zutage tritt, bei der Spracherkennung ein intervenierender Faktor geworden ist.

Die Leistung eines ASR-Systems wird n�amlich erheblich beeintr�achtigt, wenn man diesen Faktor nicht ber�ucksichtigt.

Dies ist vermutlich der Hauptgrund daf�ur, warum die systematische Ber�ucksichtigung der Aussprachevariation bei der

ASR in letzter Zeit stark zugenommen hat. Dieser Artikel stellt einen �Uberblick der Literatur zu diesem Thema dar,

wobei den Beitr�agen in diesem Ôspecial issueÕ sowie denen des ÔRolduc workshopÕ besondere Aufmerksamkeit geschenkt

wird. Zun�achst werden die wichtigsten Unterschiede der zahlreichen Arbeiten zur Modellbildung der Aussprache-

variation diskutiert. Dann folgt eine Besprechung der Beurteilung und des Vergleichs verschiedener Methoden, die der

Modellbildung zugrunde liegen. Dabei wird den wichtigsten Faktoren, die einen objektiven Vergleich der Methoden

erschweren, besondere Aufmerksamkeit geschenkt. Letztendlich schlieûen sich einige Schluûfolgerungen im Hinblick

auf die Relevanz objektiver Beurteilung und deren m�ogliche Realisierung an. Ó 1999 Elsevier Science B.V. All rights

reserved.

www.elsevier.nl/locate/specomSpeech Communication 29 (1999) 225±246

* Corresponding author. Tel.: +31-24-3616104; fax: +31-24-3612907.

E-mail address: [email protected] (H. Strik)1 Whenever we mention Ôthe Rolduc workshopÕ in the text we refer to the ESCA Tutorial and Research Workshop ``Modeling

pronunciation variation for ASR'' that was held in Rolduc from 4 to 6 May 1998. This special issue of Speech Communication contains

a selection of papers presented at that workshop.

0167-6393/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 9 9 ) 0 0 0 3 8 - 2

R�esum�e

Le centre dÕint�eret dans la recherche de la reconnaissance automatique de la parole (ASR), parti des mots isol�es, sÕest

engag�e vers le discours conversationnel. Par cons�equence, la quantit�e de variation de prononciation pr�esente dans le

discours dont nous rapportons les r�esultats a graduellement augment�e. La variation de prononciation d�et�eriorera la

performance dÕun syst�eme ASR si lÕon nÕen rend pas compte. CÕest probablement la raison principale pourquoi la

recherche dans le domaine de la mod�elisation de la variation de prononciation pour ASR a augment�e r�ecemment. Dans

cette contribution on fournit une vue dÕensemble des publications sur ce sujet, et en particulier on r�ef�ere aux articles de

cette edition sp�eciale et aux contributions pr�esent�ees dans les sessions qui ont eu lieu a ÔRolducÕ. DÕabord, les ca-

ract�eristiques les plus importantes qui distinguent les diverses �etudes sur mod�elisation de variation de prononciation

sont discut�ees. Puis les questions dÕ�evaluation et de comparaison sont adress�ees. Une attention particuli�ere est pret�ee �acertains des facteurs les plus importants qui rendent di�cile de comparer les di��erentes m�ethodes dÕune maniere ob-

jective. En®n quelques conclusions sont tir�ees quant �a lÕimportance de lÕ�evaluation objective et de la facßon dans laquelle

elle pourrait etre e�ectu�ee. Ó 1999 Elsevier Science B.V. All rights reserved.

Keywords: Pronunciation variation; Automatic speech recognition

1. Introduction

If words were always pronounced in the sameway, automatic speech recognition (ASR) wouldbe relatively easy. However, for various reasonswords are almost always pronounced di�erently.The most important sources of pronunciationvariation will be discussed here.

A ®rst major distinction in pronunciation vari-ation can be drawn between intraspeaker and in-terspeaker variations. Intraspeaker variation refersto the fact that the same speaker can pronouncethe same word in di�erent ways depending onvarious factors. The ®rst important factor thatmay a�ect the way in which words are pronouncedis the fact that ``they are strung together intoconnected speech'' (Kaisse, 1985, p. 1) as opposedto when they are pronounced in isolation. Inconnected speech, all sorts of interactions maytake place between words, which will result in theapplication of various phonological processes suchas assimilation, co-articulation, reduction, deletionand insertion. The degree to which these phe-nomena occur will vary depending on the style ofspeaking the speaker is engaged in (for an over-view of the literature on speaking styles see(Eskenazi, 1993)). Stylistic variations are usuallyinterpreted as variations in the degree of formalityof speech (Labov, 1972). As speech becomes lessformal, the syllabic structure of words may be

reorganized, speech rate may increase, and theremay be changes in pitch and loudness (Laver,1994, pp. 66±69). Besides stylistic variation there isalso free variation, in which the speaker is free tochoose from among di�erent pronunciations of thesame word, without this having any implicationsfor speaking style.

Another important source of variation inspeech is the interlocutor, since it is known thatspeakers are in¯uenced by the person they aretalking to (Coupland, 1984; Giles and Powesland,1975; Giles and Smith, 1979). In certain accountsof language variation, ``the interlocutor'' is notviewed as a separate factor, but is incorporated inthe style dimension, since style is considered as the``speakersÕ response to their audience'' (Bell, 1984,p. 145).

The types of variation mentioned so far can alltake place in the speech of one and the samespeaker, though the degree to which they occur islikely to vary between speakers. In addition to this,there is variation in pronunciation betweenspeakers since speakers of the same language mayspeak di�erent dialects or speak with a di�erentaccent (Laver, 1994, pp. 55±56). The speci®c dia-lect or accent of a speaker will depend on factorssuch as region of origin, socioeconomic back-ground, level of education, sex, age, group mem-bership and mother tongue (an overview of suchfactors, and their e�ect on pronunciation, is

226 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

provided in (Scherer and Giles, 1979)). In this re-spect, it is important to note that intraspeakervariation and interspeaker variation (in the so-ciolinguistic literature often referred to as stylisticand social variation, respectively) are not inde-pendent of each other (Labov, 1972; Romaine,1980). On the contrary, according to Bell (1984, p.158) ``intraspeaker variation derives from andechoes interspeaker variation''.

Over and above the variation due to the factorsmentioned so far, which can be called linguisticvariation, there is also variation caused by ana-tomical di�erences between speakers, such as dif-ferences in vocal tract length, variation due toenvironmental factors such as background noise(Lombard e�ect) and variation caused by para-linguistic factors such as emotional status (joy,anger, sorrow, etc.) (Polzin and Waibel, 1998). Agood review of the literature on human vocalemotion is given in (Murray and Arnott, 1993).

Owing to all these sources of variation, eachword in a language can be pronounced in manydi�erent ways, which constitutes a major problemfor ASR. In the beginning of ASR research, theamount of pronunciation variation was limited byusing isolated words. In isolated word recognition,the speakers have to pause between words, whichof course reduces the degree of interaction betweenwords. Moreover, in this case speakers also havethe tendency to articulate more carefully. Al-though using isolated words makes the task of anASR system easier, it certainly does not do thesame for the speaker, because pausing betweenwords is highly unnatural. Therefore, attemptswere made in ASR research to improve technolo-gy, so that it could handle less arti®cial speech.Consequently, the type of speech used in ASRresearch has gradually progressed from isolatedwords, via connected words and carefully readspeech, to conversational or spontaneous speech.Although many current applications still make useof isolated word recognition (e.g., dictation), inASR research the emphasis is now on spontaneousor conversational speech. It is clear that in goingfrom isolated words to conversational speech theamount of pronunciation variation increases.Since the presence of variation in pronunciationmay cause errors in ASR, modeling pronunciation

variation is seen as a possible way of improving theperformance of the current systems.

The fact that pronunciation variation should beaccounted for in ASR was already noted in theearly 1970s. For instance, in many articles in theproceedings of the ``IEEE Symposium on SpeechRecognition'' from April 1974 (Erman, 1974) it ismentioned that multiple pronunciations should bepresent in the lexicon and that phonological rulescan be used to generate them (Barnett, 1974; Co-hen and Mercer, 1974; Friedman, 1974; Jelineket al., 1974; OÕMalley and Cole, 1974; Oshika et al.,1974; Rabinowitz, 1974; Rovner et al., 1974;Shockey and Erman, 1974; Tappert, 1974). In thelast decade, there has been an increase in theamount of research on this topic, which is evidentfrom the growing number of contributions toconferences (see e.g. Strik, 1998), from the orga-nization of the Rolduc workshop (Strik et al.,1998), and also from the appearance of this specialissue of Speech Communication.

In spite of the importance that this topic hasacquired, it seems that giving a precise de®nitionof pronunciation variation modeling for ASR isnot easy. Strictly speaking, one could say that al-most all ASR research is about modeling pro-nunciation variation. As a matter of fact, theubiquitous ``hidden Markov models'' (HMMs) area way of accounting for segmental and temporalvariation. In order to better account for coarticu-lation e�ects, HMMs have been further re®nedand made speci®c for the various contexts, thecontext-dependent HMMs. Furthermore, the useof multiple Gaussian mixtures has been introducedas a better way of modeling segmental variation.By now, these techniques have become standard,and they are no longer considered as ways ofmodeling pronunciation variation. In general,when speaking about pronunciation variationmodeling for ASR, one thinks of techniques otherthan these standard ones, as appears from a reviewof the papers that are presented at conferencesunder this heading. It follows that it is di�cult tosay where standard ASR ends and pronunciationvariation modeling for ASR begins.

Similarly, it is di�cult to de®ne the type ofpronunciation variation that has been modeled forthe purpose of ASR. First of all, this variation

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 227

cannot be characterized in terms of the categoriesthat are usually applied in (socio)linguistic re-search. In general, it can be stated that when theterm pronunciation variation is used within thecontext of ASR, it usually refers to the types oflinguistic variation mentioned above under thecategories of intraspeaker and interspeaker varia-tion. Although linguistic variation takes placeboth at the segmental and at the suprasegmentallevel, in general only segmental variation has beenmodeled so far in ASR. Furthermore, variationdue to environmental characteristics and paralin-guistic factors, normally is not explicitly modeled.Interspeaker variation is sometimes (partly) mod-eled by using speaker adaptation (e.g., anatomicaldi�erences are modeled with vocal tract lengthnormalization techniques). Second, the choice ofwhat should be modeled is based on the phenom-ena observed in the speech material and on thepossible e�ects of these phenomena on recognitionperformance. For instance, researchers choose tomodel French liaison, /n/-deletion or voicing as-similation regardless of the factors that may havecaused these processes. In many cases, researchersdo not even choose which processes to model, theyjust emerge from analyses of the speech material.This means that it is often di�cult to give a precisede®nition of the type of variation that is modeledin a speci®c approach, because in most cases it is acombination of di�erent types. In turn, this makesit di�cult to compare the various approaches, andto interpret the results of this kind of researchfrom the point of view of human speech process-ing.

A survey of the literature on pronunciationvariation modeling in ASR reveals that most ofthese papers are directly concerned with testing aspeci®c method for variation modeling to deter-mine whether it leads to an improvement in rec-ognition performance. In addition, there have beenstudies that were speci®cally aimed at getting moreinsight into pronunciation variation so as to de-velop better approaches to modeling it in ASR(e.g., Adda-Decker and Lamel, 1998, 1999; Fosler-Lussier and Morgan, 1998, 1999; Greenberg, 1998,1999; Peters and Stubley, 1998; Strik and Cuc-chiarini, 1998). However, studies of this kind areless frequent.

In this paper we intend to give an overview ofthe various approaches to modeling pronunciationvariation that have been proposed so far, payingparticular attention to the papers contained in thisspecial issue and to those presented at the Rolducworkshop (Strik et al., 1998). Where necessary,reference will be made to related research that hasbeen presented elsewhere. Providing such anoverview turned out to be di�cult because not allauthors present their work with the same per-spective and the same degree of detail. For in-stance, the majority of the papers on this topicpresented in the reference section of the presentarticle, are papers that have appeared in proceed-ings of conferences and workshops. Since the sizeof such proceedings papers is always limited, it isinevitable that details will be missing.

The presentation of the various methods inSection 2 will be organized around some of themajor characteristics that distinguish pronuncia-tion variation modeling techniques from eachother. After having presented the di�erent tech-niques (in Section 2), we will address the issues ofevaluation and comparison (in Section 3), whichare crucial if we want to draw conclusions as to themerits of the various proposals. In particular, wewill discuss the most important factors that makeit di�cult to compare the di�erent methods in anobjective way.

2. Characteristics of the methods

As explained above, it is di�cult to classify thetype of pronunciation variation modeled in ASRin terms of the categories that are usually appliedin linguistics and phonetics. That is why we de-cided to adopt another framework for classi®ca-tion. The framework we will use is based on thedecisions that are made when one has to choosefor a method for modeling pronunciation varia-tion. These decisions concern the following ques-tions:1. Which type of pronunciation variation should

be modeled?2. Where should the information on variation

come from?3. Should the information be formalized or not?

228 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

4. In which component of the automatic speechrecognizer should variation be modeled?

It is obvious that each of these questions cannot beanswered in isolation. On the contrary, the an-swers will be highly interdependent. Depending onthe decision taken for each of the above questions,di�erent methods for pronunciation variationmodeling can be distinguished, as will appear fromthe following sections. For each question it ispossible to identify a speci®c dimension alongwhich a choice can be made. In this way a de-scriptive framework can be obtained to classify thevarious contributions to modeling pronunciationvariation in ASR. Although it is certainly possiblethat the extremes of some of these dimensions willnot occur in practice, this is irrelevant since theirmain function is to provide us with a frameworkfor description.

2.1. Type of pronunciation variation

The majority of the contributions are concernedwith variation at the segmental level. A commonway of describing segmental pronunciation varia-tion in the context of ASR is by indicating whetherit refers to word-internal or to cross-word pro-cesses, because this choice is strongly related to theproperties of the speech recognizer being used. Asa matter of fact, the choice for word-internalvariation, cross-word variation or both, is deter-mined by factors such as the type of ASR, thelanguage, and the level at which modeling will takeplace.

Modeling within-word variation is an obviouschoice if the ASR system makes use of a lexiconwith word entries, because in this case variants cansimply be added to the lexicon. Given that almostall ASR systems use a lexicon, within-word vari-ation is modeled in the majority of the methods(Adda-Decker and Lamel, 1998, 1999; Aubert andDugast, 1995; Bacchiani and Ostendorf, 1998,1999; Beulen et al., 1998; Blackburn and Young,1995, 1996; Bonaventura et al., 1998; Cohen andMercer, 1975; Cremelie and Martens, 1995, 1997,1998, 1999; Ferreiros et al., 1998; Finke andWaibel, 1997; Fosler-Lussier and Morgan, 1998,1999; Fukada and Sagisaka, 1997; Fukada et al.,1998, 1999; Heine et al., 1998; Holter, 1997; Holter

and Svendsen, 1998, 1999; Imai et al., 1995;Kessens and Wester, 1997; Kessens et al., 1999;Lamel and Adda, 1996; Lehtinen and Safra, 1998;Mercer and Cohen, 1987; Mirghafori et al.,1995; Mokbel and Jouvet, 1998; Ravishankar andEskenazi, 1997; Riley et al., 1998, 1999; Ristad andYianilos, 1998; Schiel et al., 1998; Sloboda andWaibel, 1996; Svendsen et al., 1995; Torre et al.,1997; Wester et al., 1998a; Williams and Renals,1998; Zeppenfeld et al., 1997).

Besides within-word variation, cross-wordvariation also occurs, especially in continuousspeech. Therefore, cross-word variation shouldalso be accounted for. A sort of compromisesolution between the ease of modeling at the levelof the lexicon and the need to model cross-wordvariation is to use multi-words (Beulen et al.,1998; Finke and Waibel, 1997; Kessens et al.,1999; Nock and Young, 1998; Pousse and Pere-nnou, 1997; Ravishankar and Eskenazi, 1997;Riley et al., 1998; Sloboda and Waibel, 1996;Wester et al., 1998a). In this approach, sequencesof words (usually called multi-words) are treatedas one entity in the lexicon (see also Section2.4.1) and the variations that result when thewords are strung together are modeled by in-cluding di�erent variants of the multi-words. It isimportant to note that, in general, with this ap-proach only a small portion of cross-word vari-ation is modeled, e.g., that occurring betweenwords that ®gure in very frequent sequences.Besides the multi-word approach, other methodshave been proposed to model cross-word varia-tion such as (Aubert and Dugast, 1995; Black-burn and Young, 1995, 1996; Cohen and Mercer,1975; Cremelie and Martens, 1995, 1997, 1998,1999; Mercer and Cohen, 1987; Perennou andBrieussel-Pousse, 1998; Pousse and Perennou,1997; Safra et al., 1998; Schiel et al., 1998;Wiseman and Downey, 1998).

Given that both within-word and cross-wordvariation occur in running speech, it will probablybe necessary to model both of them. This has al-ready been done in (Beulen et al., 1998; Blackburnand Young, 1995, 1996; Cohen and Mercer, 1975;Cremelie and Martens, 1995, 1997, 1998, 1999;Finke and Waibel, 1997; Kessens et al., 1999;Mercer and Cohen, 1987; Riley et al., 1998, 1999;

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 229

Schiel et al., 1998; Sloboda and Waibel, 1996;Wester et al., 1998a).

2.2. Information sources

Another feature that distinguishes the variousapproaches to modeling pronunciation variationin ASR is the source from which information onpronunciation variation is derived. In this con-nection, a distinction can be drawn between data-driven versus knowledge-based methods. Themajor di�erence between these two types of ap-proaches is that in the former case the assumptionis that the information on pronunciation variationhas to be obtained in the ®rst place. In knowledge-based approaches, on the other hand, it is assumedthat this information is already available in theliterature.

The idea behind data-driven methods is thatinformation on pronunciation variation has to beobtained directly from the signals (Bacchiani andOstendorf, 1998, 1999; Blackburn and Young,1995, 1996; Cremelie and Martens, 1995, 1997,1998, 1999; Fosler-Lussier and Morgan, 1998,1999; Fukada and Sagisaka, 1997; Fukada et al.,1998, 1999; Greenberg, 1998, 1999; Heine et al.,1998; Holmes and Russell, 1996; Holter, 1997;Holter and Svendsen, 1998, 1999; Imai et al., 1995;Mirghafori et al., 1995; Mokbel and Jouvet, 1998;Nock and Young, 1998; Peters and Stubley, 1998;Polzin and Waibel, 1998; Ravishankar and Eske-nazi, 1997; Riley et al., 1998, 1999; Ristad andYianilos, 1998; Sloboda and Waibel, 1996;Svendsen et al., 1995; Torre et al., 1997; Williamsand Renals, 1998). To this end, the acoustic signalsare analyzed in order to determine all possibleways in which the same word or phoneme is real-ized. A common stage in this analysis is tran-scribing the acoustic signals. Subsequently, thetranscriptions can be used for di�erent purposes,as will be explained in Sections 2.3 and 2.4.Transcriptions of the acoustic signals can be ob-tained either manually (Cremelie and Martens,1995, 1997; Downey and Wiseman, 1997; Fosler-Lussier and Morgan, 1998, 1999; Greenberg, 1998,1999; Heine et al., 1998; Mirghafori et al., 1995;Riley et al., 1998, 1999; Ristad and Yianilos, 1998;Wiseman and Downey, 1998) or (semi-)automati-

cally (Adda-Decker and Lamel, 1998, 1999; Ba-cchiani and Ostendorf, 1998, 1999; Beulen et al.,1998; Cremelie and Martens, 1997, 1998, 1999;Fukada and Sagisaka, 1997; Fukada et al., 1998,1999; Holter, 1997; Kessens and Wester, 1997;Kessens et al., 1999; Lehtinen and Safra, 1998;Mokbel and Jouvet, 1998; Ravishankar andEskenazi, 1997; Riley et al., 1998, 1999; Schielet al., 1998; Wester et al., 1998a; Svendsen et al.,1995; Torre et al., 1997; Williams and Renals,1998). The latter is usually done either with aphone(me) recognizer (Fukada and Sagisaka,1997; Fukada et al., 1998, 1999; Mokbel andJouvet, 1998; Ravishankar and Eskenazi, 1997;Torre et al., 1997; Williams and Renals, 1998) orby means of forced recognition (Adda-Decker andLamel, 1998, 1999; Bacchiani and Ostendorf, 1998,1999; Beulen et al., 1998; Cremelie and Martens,1997, 1998, 1999; Kessens and Wester, 1997;Kessens et al., 1999; Lehtinen and Safra, 1998;Riley et al., 1998, 1999; Schiel et al., 1998; Westeret al., 1998a).

In forced recognition (which is also calledforced alignment), the ASR system can onlychoose between the pronunciation variants of aword, and not between all words present in thelexicon, as is the case for a ``normal'' ASR system.Consequently, forced recognition can be employedto decide which pronunciation variant bestmatches the signal, and in this way a new tran-scription can be obtained (see also Section 2.4.2).The performance of forced recognition has beenevaluated in (Wester et al., 1998a,b, 1999), bycomparing it with the performance of humans thatcarried out the same tasks. It turned out that forthe tasks studied in (Wester et al., 1998a,b, 1999),the performance of the human listeners and forcedrecognition were similar, and that on average, thedegree of agreement between ASR system andlisteners is only slightly lower than that betweenlisteners. Therefore, forced recognition seems to bea suitable tool for obtaining information on pro-nunciation (variation). However, since in (Westeret al., 1999) it is shown that the agreement dependson the properties of the ASR system used, oneshould be cautious in applying such a tool.

Given that making manual transcriptions isextremely time-consuming, and therefore costly, it

230 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

is not feasible to obtain manual transcriptions oflarge corpora. As a consequence, the use of auto-matically obtained transcriptions is becomingmore common. Moreover, there is another reasonwhy transcriptions obtained automatically withthe ASR system itself could be bene®cial, viz., thatthese transcriptions are more in line with thephone strings obtained later during recognitionwith the same ASR system. This is also mentionedby Riley et al. (1998, 1999).

In knowledge-based studies, information onpronunciation variation is primarily derived fromsources that are already available (Adda-Deckerand Lamel, 1998, 1999; Aubert and Dugast, 1995;Bonaventura et al., 1998; Cohen and Mercer, 1975;Downey and Wiseman, 1997; Ferreiros et al.,1998; Finke and Waibel, 1997; Kessens and Wes-ter, 1997; Kessens et al., 1999; Kipp et al., 1996;Kipp et al., 1997; Lamel and Adda, 1996; Lehtinenand Safra, 1998; Mercer and Cohen, 1987;Mouria-Beji, 1998; Nock and Young, 1998; Pere-nnou and Brieussel-Pousse, 1998; Pousse andPerennou, 1997; Roach and Arn®eld, 1998; Safraet al., 1998; Schiel et al., 1998; Wesenick, 1996;Wester et al., 1998a; Wiseman and Downey, 1998;Zeppenfeld et al., 1997). The existing sources canbe pronunciation dictionaries (see e.g., Roach andArn®eld, 1998) and linguistic studies on pronun-ciation variation. However, these sources usuallyonly provide information as to the form of thepossible variants, while quantitative informationon the frequency of the alternative variants stillhas to be obtained from the acoustic signals, as isthe case for data-driven methods. Furthermore,probably not many suitable pronunciation dict-ionaries do exist.

The distinction between data-driven andknowledge-based is related to that between bot-tom±up and top±down, which are also commonlyused terms in ASR literature. However, in thispaper these terms will not be used interchangeably.More explicitly, in our taxonomy the terms data-driven and knowledge-based are taken to refer tothe starting point of the research, be it the acousticsignals (data) or the literature (knowledge). On theother hand, the terms bottom±up and top±downrefer to the direction of the developing process,which can be upward or downward.

Although many studies contained in this issueare not completely data-driven or knowledge-based, it is generally possible to say whether thestarting point of the research was mainly data-driven or knowledge-based (see the referencesabove). However, most of them cannot be said tobe completely bottom±up or top±down, because innone of these studies is the direction of the devel-oping process solely upward or downward, but the¯ow of information is in both directions. For ex-ample, in many data-driven studies the results ofthe bottom±up analyses are used to change thelexicon and the altered lexicon is then used duringrecognition in a top±down manner. Similarly,knowledge-based methods are usually not strictlytop±down, e.g. because in many of them the rulesapplied to generate pronunciation variants may bealtered on the basis of information derived fromanalysis of the acoustic signals.

In general terms, it is not possible to saywhether a data-driven study is to be preferred to aknowledge-based one. A possible drawback ofknowledge-based studies is that there could be amismatch between the information found in theliterature and the data for which it has to be used.In the introduction it was stated that many currentsystems are designed for spontaneous speech.However, the knowledge on pronunciation varia-tion that can be found in the literature usuallyconcerns other speaking styles. Therefore, it ispossible that the information obtained from theliterature does not cover the type of variation inquestion, whereas information obtained from datacould be more e�ective for this purpose. This formof mismatch between the knowledge and the datamay lead to overcoverage, i.e., the addition ofvariants that do not ®gure in the corpus, or toundercoverage, i.e., the exclusion of variants thatdo ®gure in the corpus. To overcome these prob-lems, one can resort to a combination of top±down and bottom±up approaches, as explainedabove.

On the other hand, a possible drawback of da-ta-driven studies is that for every new corpus and/or ASR system the whole process of transcrib-ing the speech material and deriving informa-tion on pronunciation variation has to be startedagain. In other words, information obtained on

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 231

the basis of data-driven studies does not easilygeneralize to situations other than the one inquestion. Moreover, the problem of undercover-age may also arise in data-driven approaches, ifthe corpus is not representative enough.

A good option might be to use a method con-sisting of two stages. In the ®rst stage, a know-ledge-based approach is used, which has theadvantage that it can easily be ported to a newtask. In the second stage, a data-driven approachis used to model (part of) the remaining pronun-ciation variation. The data-driven approachshould also be used to test existing knowledge andacquire new knowledge. In this way, the amount ofpronunciation variation modeled in the ®rst stagewill gradually increase, and the importance of thesecond stage will gradually diminish.

2.3. Information representation

Regardless of whether a data-driven or aknowledge-based approach is used, it is possible tochoose between formalizing the information onpronunciation variation or not. In general, for-malization means that a more abstract and com-pact representation is chosen, e.g., rewrite rules orarti®cial neural networks.

In a data-driven method, the formalizations arederived from the data (Cremelie and Martens,1995, 1997, 1998, 1999; Deshmukh et al., 1996;Fukada and Sagisaka, 1997; Fukada et al., 1998,1999; Imai et al., 1995; Ravishankar and Eskenazi,1997; Torre et al., 1997). In general, this is done inthe following manner. The bottom±up transcrip-tion of an utterance is aligned with its corre-sponding top±down transcription obtained byconcatenating the transcriptions of the individualwords contained in the lexicon. Alignment is doneby means of a dynamic programming (DP) algo-rithm (Cremelie and Martens, 1995, 1997, 1998,1999; Fosler-Lussier and Morgan, 1998, 1999;Fukada and Sagisaka, 1997; Fukada et al., 1998,1999; Heine et al., 1998; Ravishankar and Eske-nazi, 1997; Riley et al., 1998, 1999; Torre et al.,1997; Williams and Renals, 1998; Wiseman andDowney, 1998). The resulting DP-alignments canthen be used to:

· derive rewrite rules (Cremelie and Martens,1995, 1997, 1998, 1999; Ravishankar and Eske-nazi, 1997);

· train decision trees (Fosler-Lussier and Morgan,1998, 1999; Riley et al., 1998, 1999),

· train an arti®cial neural network (ANN) (De-shmukh et al., 1996; Fukada and Sagisaka,1997; Fukada et al., 1998, 1999);

· calculate a phone confusion matrix (Torre et al.,1997).

In these four cases, the information about pro-nunciation variation present in the DP-alignmentsis formalized in terms of rewrite rules, decisiontrees, ANNs and a phone confusion matrix, re-spectively.

In a knowledge-based approach, formalizedinformation on pronunciation variation can beobtained from linguistic studies in which ruleshave been formulated. In general, these are op-tional phonological rules concerning deletions,insertions and substitutions of phones (Adda-Decker and Lamel, 1998, 1999; Aubert and Du-gast, 1995; Cohen and Mercer, 1975; Ferreiroset al., 1998; Finke and Waibel, 1997; Kessens et al.,1999; Lamel and Adda, 1996; Lehtinen and Safra,1998; Mercer and Cohen, 1987; Nock and Young,1998; Perennou and Brieussel-Pousse, 1998;Pousse and Perennou, 1997; Safra et al., 1998;Schiel et al., 1998; Wester et al., 1998a; Wisemanand Downey, 1998; Zeppenfeld et al., 1997). Theformalizations (either obtained from data or fromlinguistic studies) can then be used to generatesurface forms (pronunciation variants) from thebase forms.

The obvious alternative to using formalizationsis to use information that is not formalized, butenumerated. Again, this can be done either in adata-driven or in a knowledge-based manner. Indata-driven studies, the bottom±up transcriptionscan be used to list all pronunciation variants ofone and the same word. These variants and theirtranscriptions (or a selection of them) can then beadded to the lexicon. Alternatively, in knowledge-based studies it is possible to add all the variants ofone and the same word contained in a pronunci-ation dictionary. Quite clearly, when no formal-ization is used, it is not necessary to generate thevariants because they are already available.

232 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

It is not easy to determine a priori whetherformalized information will work better thanenumerated information. It may at ®rst seem thatusing formalizations has two important advan-tages. First, one has complete control over theprocess of variant generation. At any moment it ispossible to select variants automatically in di�er-ent ways. Second, since the information on pro-nunciation variation is expressed in more abstractterms, it follows that it is not limited to a speci®ccorpus and that it can easily be applied to othercorpora. Both these operations will be less easywith enumerated information. On the other hand,the use of formalizations also has some disad-vantages, like overgeneration and undergenera-tion, owing to incorrect speci®cations of the rulesapplied (Cohen, 1989, p. 51). Both types of prob-lems should not arise when using enumerated in-formation.

2.4. Level of modeling

Given that the recognition engines of most ASRsystems consist of three components, there arethree levels at which variation can be modeled: thelexicon, the acoustic models, and the languagemodel. This is not to say that modeling at one levelprecludes modeling at one of the other levels; onthe contrary, to obtain a good recognition system,it is necessary that concerted modeling happens onthe three levels. Therefore, in most studies mod-eling takes place at more than one level. Never-theless, in order to categorize the various studies,each category will be discussed separately in thefollowing subsections.

2.4.1. LexiconAt the level of the lexicon, pronunciation vari-

ation is usually modeled by adding pronunciationvariants (and their transcriptions) to the lexicon(Adda-Decker and Lamel, 1998, 1999; Aubert andDugast, 1995; Beulen et al., 1998; Bonaventuraet al., 1998; Cohen and Mercer, 1975; Cremelieand Martens, 1995, 1997, 1998, 1999; Downey andWiseman, 1997; Ferreiros et al., 1998; Finke andWaibel, 1997; Fukada et al., 1998, 1999; Holter,1997; Holter and Svendsen, 1998, 1999; Imai et al.,1995; Kessens and Wester, 1997; Kessens et al.,

1999; Lamel and Adda, 1996; Lehtinen and Safra,1998; Mercer and Cohen, 1987; Mokbel and Jou-vet, 1998; Nock and Young, 1998; Ravishankarand Eskenazi, 1997; Riley et al., 1998, 1999; Roachand Arn®eld, 1998; Sloboda and Waibel, 1996;Torre et al., 1997; Wester et al., 1998a; Williamsand Renals, 1998; Wiseman and Downey, 1998;Zeppenfeld et al., 1997). The rationale behind thisprocedure is that with multiple transcriptions ofthe same word the chance is increased that for anincoming signal the speech recognizer selects atranscription belonging to the correct word. Inturn, this should lead to lower error rates.

However, adding pronunciation variants to thelexicon usually also introduces new errors becausethe acoustic confusability within the lexicon in-creases, i.e., the added variants can be confusedwith those of other entries in the lexicon. This canbe minimized by making an appropriate selectionof the pronunciation variants, by, for instance,adding only the set of variants for which the bal-ance between solving old errors and introducingnew ones is positive. Therefore, in many studiestests are carried out to determine which set ofpronunciation variants leads to the largest gain inperformance (Cremelie and Martens, 1995, 1997,1998, 1999; Fukada et al., 1998, 1999; Holter,1997; Holter and Svendsen, 1998, 1999; Imai et al.,1995; Kessens and Wester, 1997; Kessens et al.,1999; Lehtinen and Safra, 1998; Mokbel andJouvet, 1998; Nock and Young, 1998; Riley et al.,1998, 1999; Sloboda and Waibel, 1996; Torre et al.,1997; Wester et al., 1998a). For this purpose, dif-ferent criteria can be used, such as:· frequency of occurrence of the variants (Kessens

and Wester, 1997; Kessens et al., 1999; Ravish-ankar and Eskenazi, 1997; Riley et al., 1998,1999; Schiel et al., 1998; Wester et al., 1998a;Williams and Renals, 1998),

· a maximum likelihood criterion (Holter, 1997;Holter and Svendsen, 1998, 1999; Imai et al.,1995),

· con®dence measures (Sloboda and Waibel,1996), and

· the degree of confusability between the variants(Sloboda and Waibel, 1996; Torre et al., 1997).

A description of a method to detect confusablepairs of words or transcriptions is also given in

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 233

(Roe and Riley, 1994). If rules are used to generatepronunciation variants, then certain rules can beselected (and others discarded), as in (Cremelieand Martens, 1995, 1997, 1998, 1999; Lehtinenand Safra, 1998; Schiel et al., 1998) where rules areselected on the basis of their frequency and ap-plication likelihood.

As was mentioned earlier, multi-words canalso be added to the lexicon, in an attempt tomodel cross-word variation at the level of thelexicon. Optionally, the pronunciation variantsof multi-words can also be included in the lexi-con. By using multi-words Beulen et al. (1998)and Wester et al. (Kessens et al., 1999; Westeret al., 1998a) achieve a substantial improvement,while for Nock and Young (1998) this was notthe case.

Before variants can be selected, they have to beobtained, in the ®rst place. Sometimes the pro-nunciation variants are generated manually (Au-bert and Dugast, 1995; Riley et al., 1998, 1999) orselected from enumerated lists (Flach, 1995), butusually they are generated automatically by meansof various procedures:· rules (Adda-Decker and Lamel, 1998, 1999; Au-

bert and Dugast, 1995; Cohen and Mercer,1975; Cremelie and Martens, 1995, 1997, 1998,1999; Flach, 1995; Kessens and Wester, 1997;Kessens et al., 1999; Mercer and Cohen, 1987;Nock and Young, 1998; Ravishankar and Eske-nazi, 1997; Schiel et al., 1998; Wester et al.,1998a),

· ANNs (Fukada and Sagisaka, 1997; Fukadaet al., 1998, 1999),

· grapheme-to-phoneme converters (Lehtinen andSafra, 1998),

· phone(me) recognizers (Mokbel and Jouvet,1998; Nock and Young, 1998; Ravishankarand Eskenazi, 1997; Sloboda and Waibel,1996; Williams and Renals, 1998),

· optimization with maximum likelihood criterion(Holter, 1997; Holter and Svendsen, 1998, 1999)and

· decision trees (Fosler-Lussier and Morgan,1998, 1999; Riley et al., 1998, 1999).

In (Aubert and Dugast, 1995; Beulen et al., 1998)the transcriptions of the variants of function wordsand foreign names are generated manually, while

the variants for all other words are generated byrule.

Since rule-based methods are probably usedmost often, it is interesting to note that Nock andYoung (1998) conclude that (for their Englishtask) ``rule-based learning methods may not be themost appropriate for learning pronunciationswhen starting from a carefully constructed, mul-tiple pronunciation dictionary''. The question hereis whether this conclusion is also valid for otherapplications in other languages, and whether it ispossible to decide in which cases the starting pointis a carefully constructed, multiple pronunciationdictionary (see also Section 3).

Given that there are various ways to obtainpronunciation variants, one might wonder whatthe optimal way is. So far not many studies havebeen reported in which di�erent methods werecompared. An exception is (Flach, 1995), in whichtwo types of methods for obtaining variants, rule-based and enumerated, are compared. The base-line system makes use of a canonical lexicon with194 words. If the variants generated by rule areadded to the canonical lexicon, making a total of291 entries, a substantial improvement is observed.However, if all variants observed in the tran-scriptions of a corpus are added to the canonicallexicon, making a total of 897 entries, an evenlarger improvement is found. In this particularexample, adding all variants found in the corpuswould seem to produce better results than adding asmaller number of variants generated by rule. Inthis respect, some comment is in order.

First, in this example the number of entries inthe lexicon was small. It is not clear whether sim-ilar results would be obtained with larger lexica.One could imagine that confusability does not in-crease linearly, and with many entries and manyvariants it could lead to less positive results.

Second, the fact that a method in which vari-ants are taken directly from transcriptions of theacoustic signals works better than a rule-based onecould also be due to the particular nature of therules in question. As was pointed out in Section2.2, rules taken from the literature are not alwaysthe optimal ones to model variation in spontane-ous speech, while information obtained from datamay be much better suited for this purpose.

234 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

2.4.2. Acoustic modelsPronunciation variation can also be represented

at the level of the acoustic models, for instance byoptimizing the acoustic models (Aubert and Du-gast, 1995; Bacchiani and Ostendorf, 1998, 1999;Beulen et al., 1998; Bonaventura et al., 1998; Dengand Sun, 1994; Finke and Waibel, 1997; Godfreyet al., 1997; Greenberg, 1998, 1999; Heine et al.,1998; Holter, 1997; Kessens and Wester, 1997;Kessens et al., 1999; Lamel and Adda, 1996; Mi-rghafori et al., 1995; Nock and Young, 1998; Rileyet al., 1998, 1999; Schiel et al., 1998; Sloboda andWaibel, 1996; Wester et al., 1998a). Optimizationcan be attained in di�erent ways, as will be dis-cussed in the current section.

An obvious way of optimizing the acousticmodels is by using forced recognition. In Section2.2 we already explained how forced recognitioncan be employed to calculate new transcriptions ofthe signals. In turn, the new transcriptions can beused to train new acoustic models. These newacoustic models can then be used to do forcedrecognition again, etc. In other words, this processcan be iterated. We will refer to this procedure asiterative transcribing. Forced recognition and it-erative transcribing have been used often to obtainimproved transcriptions and improved acousticmodels (Aubert and Dugast, 1995; Bacchiani andOstendorf, 1998; Bacchiani and Ostendorf, 1999;Beulen et al., 1998; Finke and Waibel, 1997; Kes-sens and Wester, 1997; Kessens et al., 1999; Rileyet al., 1998, 1999; Schiel et al., 1998; Sloboda andWaibel, 1996; Wester et al., 1998a).

In order to evaluate these procedures, the errorrates obtained with the new acoustic models can becompared to those obtained with the old acousticmodels. In general, these procedures seem to im-prove the performance of the ASR systems (Au-bert and Dugast, 1995; Bacchiani and Ostendorf,1998, 1999; Finke and Waibel, 1997; Kessens andWester, 1997; Kessens et al., 1999; Riley et al.,1998, 1999; Schiel et al., 1998; Sloboda and Wai-bel, 1996; Wester et al., 1998a). However, Beulenet al. (1998) found that in some cases the perfor-mance does not improve, but remains unchangedor even deteriorates. In our own research (Kessensand Wester, 1997; Kessens et al., 1999; Wester et al.,1998a), we found that for iterative transcribing the

improvement during the ®rst iteration is almostalways (much) larger than that obtained during thefollowing iterations, so usually one iteration issu�cient.

In forced recognition, pronunciation variantsare present in the lexicon during training in orderto train new acoustic models. Optionally, pro-nunciation variants can be retained in the lexiconduring recognition (testing). In general, using thevariants during recognition is more bene®cial thanusing variants during training, while the best re-sults are obtained when multiple variants are in-cluded during both training and recognition(Kessens and Wester, 1997; Kessens et al., 1999;Lamel and Adda, 1996; Wester et al., 1998a).Therefore, it seems worthwhile to test the proce-dure of forced recognition because it is a relativelystraightforward procedure that can be applied al-most completely automatically and because itusually gives an improvement over and above thatof using multiple variants during recognition only.

Optimizing the existing acoustic models is oneway in which one could try to improve the per-formance of an ASR system. Another way is bysearching for other (hopefully better) basic units.In most ASR systems, the phone is used as thebasic unit and, consequently, the lexicon containstranscriptions in the form of strings of phonesymbols. However, in some studies experimentsare performed with basic units of recognition otherthan the phone.

For this purpose, sub-phonemic models havebeen proposed (Deng and Sun, 1994; Godfrey etal., 1997). In (Deng and Sun, 1994) a set of multi-valued phonological features is used. First, thefeature values of the speech units in isolation arede®ned followed by the (often optional) spreadingof features for speech units in context. On thebasis of the resulting feature-overlap pattern apronunciation network is created. The startingpoint in (Godfrey et al., 1997) is a set of symbolsfor (allo-) phones and sub-phonemic units. Thesesymbols are used to model pronunciation varia-tion due to context, coarticulation, dialect,speaking style and speaking rate. The resultingdescriptions (in which almost half of the segmentsare optional) are used to create pronunciationnetworks. In both cases, the ASR system decides

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 235

during decoding what the optimal path in thepronunciation networks is.

Besides sub-phonemic models it is also possibleto use basic units larger than phones, e.g., (demi-)syllables (Greenberg, 1998, 1999; Heine et al.,1998) or even whole words. It is clear that usingword models is only feasible for tasks with a lim-ited vocabulary (e.g., digit recognition). For mosttasks the number of words, and thus the numberof word models to be trained, is simply too large.Therefore, in some cases, word models are onlytrained for the words occurring most frequently,while for the less frequent words sub-word modelsare used. Since the number of syllables is usuallymuch smaller than the number of words (Green-berg, 1998, 1999; Heine et al., 1998), the syllablewould seem to be suited as the basic unit of rec-ognition. Greenberg (1998, 1999) mentions severalother reasons why, given the existing pronuncia-tion variation, the syllable is a suitable candidate.

If syllable models are used, the within-syllablevariation can be modeled by the stochastic modelfor the syllable, just as the within-phone variationis modeled by the acoustic model of the phone (seee.g., Heine et al., 1998). For instance, in phone-based systems deletions, insertions and substitu-tions of phones have to be modeled explicitly (e.g.,by including multiple pronunciations in the lexi-con), while in a syllable-based system these pro-cesses would result in di�erent realizations of thesyllable.

In most ASR systems, the basic units are de-®ned a priori. Furthermore, while the acousticmodels for these basic units are calculated with anoptimization procedure, the pronunciations in thelexicon are usually handcrafted. However, it is alsopossible to allow an optimization procedure todecide what the optimal pronunciations in thelexicon and the optimal basic units (i.e., both theirsize and the corresponding acoustic models) are(Bacchiani and Ostendorf, 1998, 1999; Holter,1997). In both (Bacchiani and Ostendorf, 1998,1999) and (Holter, 1997) the optimization is donewith a maximum likelihood criterion.

In (Greenberg, 1998, 1999) no tests are de-scribed. For the syllable models in (Heine et al.,1998) the resulting levels of performance are lowerthan those of standard ASR systems. Further-

more, in (Bacchiani and Ostendorf, 1998, 1999;Deng and Sun, 1994; Godfrey et al., 1997; Holter,1997) the observed levels of performance arecomparable to those of phone-based ASR systems(usually for limited tasks). Although these resultsare interesting, it remains to be seen whether thesemethods are more suitable for modeling pronun-ciation variation than standard phone-based ASRsystems, especially for tasks in which a largeamount of pronunciation variation is present (e.g.,for conversational speech).

2.4.3. Language modelsAnother component in which pronunciation

variation can be taken into account is the languagemodel (LM) (Cremelie and Martens, 1995, 1997,1998, 1999; Deshmukh et al., 1996; Finke andWaibel, 1997; Fukada et al., 1998, 1999; Kessenset al., 1999; Lehtinen and Safra, 1998; Perennouand Brieussel-Pousse, 1998; Pousse and Perennou,1997; Schiel et al., 1998; Wester et al., 1998a;Zeppenfeld et al., 1997). This can be done in sev-eral ways, as will be discussed below.

Let X be the speech signal that has to be rec-ognized. The goal during decoding then is to ®ndthe string of words W that maximizesP�X jW � � P �W �. Usually N-grams are used tocalculate P(W). If there is one entry for every wordin the lexicon, the N-grams can be calculated in thestandard way. As we have seen above, the mostcommon way to model pronunciation variation isby adding pronunciation variants to the lexicon.The problem then is how to deal with these pro-nunciation variants at the level of the LM.

Method 1. The easiest solution is to simply addthe variants to the lexicon, and not to change theLMs at all. In this case, for every variant theprobabilities for the word it belongs to are used.Since the statistics for the variants are not used, itis obvious that this is a sub-optimal solution. Inthe following two methods the statistics for thevariants are employed.

Method 2. The second solution is to use thevariants themselves (instead of the underlyingwords) to calculate the N-grams (Kessens et al.,1999; Schiel et al., 1998; Wester et al., 1998a). Forthis procedure, a transcribed corpus is neededwhich contains information about the realized

236 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

pronunciation variants. These transcriptions canbe obtained in various ways, as has been discussedin Sections 2.2 and 2.4. The goal of this method isto ®nd the string of variants V which maximizesP �X jV � � P �V �.

Method 3. A third possibility is to introduce anintermediate level: P �X jV � � P �V jW � � P �W �. Thegoal now is to ®nd the string of words W and thecorresponding string of variants V that maximizesthe latter equation (Cremelie and Martens, 1995,1997, 1998, 1999; Fukada et al., 1998, 1999;Perennou and Brieussel-Pousse, 1998; Pousse andPerennou, 1997). P�V jW � determines the proba-bility of the variants given the words, while P(W)describes the probabilities of sequences of words.In this case, the ®rst probabilities can also be cal-culated on the basis of a transcribed corpus.However, they can also be obtained otherwise. Ifthe pronunciation variants are generated by rule,the probabilities of these rules can be used to de-termine the probabilities of the pronunciationvariants (Cremelie and Martens, 1998, 1999; Leh-tinen and Safra, 1998). Likewise, if an ANN isused to generate pronunciation variants, the ANNitself can produce probabilities of the pronuncia-tion variants (Deshmukh et al., 1996; Fukada et al.,1998, 1999).

It is obvious that the number of pronunciationvariants is larger than the number of words. As aconsequence, more parameters have to be trainedfor the second method than for the third. Thiscould be a disadvantage of the second method,since sparsity of data is a common problem duringthe training of LMs. A way of reducing the num-ber of parameters for both methods is to usethresholds, i.e., only pronunciation variants whichoccur often enough are taken into account.

Another important di�erence between the twomethods is that in the third method the context-dependence of pronunciation variants is notmodeled directly in the LM. This can be a disad-vantage as pronunciation variation is often con-text-dependent, e.g., liaison in French (Perennouand Brieussel-Pousse, 1998; Pousse and Perennou,1997). Within the third method, this de®ciency canbe overcome by using classes of words instead ofthe words themselves, i.e., the classes of words thatdo or do not allow liaison (Perennou and Brieus-

sel-Pousse, 1998; Pousse and Perennou, 1997). Theprobability of a pronunciation variant for a certainclass is then represented in P �V jW �, while theprobability of sequences of word classes is storedin P �W �.

3. Evaluation and comparison

In the previous section, the various approachesto modeling pronunciation variation were de-scribed according to their major properties. In thatpresentation, the emphasis was on the variouscharacteristics of the methods, and not so much ontheir merits. That is not to say that the e�ective-ness of a method is not important. On the con-trary, the extent to which each study contributes tomodeling pronunciation variation in ASR, be it byreducing the number of errors caused by pronun-ciation variation or by getting more insight intopronunciation variation, is a fundamental aspect,especially if we want to draw general conclusionsas to the di�erent ways in which pronunciationvariation in ASR can best be addressed.

Considering that pronunciation variation re-search in ASR still has a long way to go, it wouldseem that studies that provide insight into theprocesses underlying pronunciation variation (e.g.,Adda-Decker and Lamel, 1998, 1999; Fosler-Lus-sier and Morgan, 1998, 1999; Greenberg, 1998,1999; Peters and Stubley, 1998; Strik and Cuc-chiarini, 1998) are very useful. However, the ma-jority of the papers about modeling pronunciationvariation for ASR focus on the e�ectiveness of agiven method of variation modeling in improvingrecognition performance. It is obvious that studiesof this kind could also contribute to gaining in-sight, if, for instance, some kind of error analysiswere carried out, but this does not seem to be apriority.

The e�ectiveness of word-error-rate (WER) re-ducing studies is usually established by comparingthe performance of the baseline system (the start-ing point) with the performance obtained after themethod has been applied. For every individualstudy, this seems a plausible procedure. Theamounts of improvement reported in the literature(see e.g., the papers in this special issue and in the

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 237

proceedings of the Rolduc workshop; Strik et al.,1998) di�er from almost none (and occasionallyeven a deterioration) to substantial ones.

In trying to draw general conclusions as to thee�ectiveness of these methods, one is tempted toconclude that the method for which the largestimprovement was observed is the best one. In thisrespect some comment is in order. First, it is un-likely that there will be one single best approach,as the tasks of the various systems are very dif-ferent. Second, we are not interested in ®nding awinner, but in understanding how pronunciationvariation can best be approached. So, even amethod that does not produce any signi®cant re-duction in WER may turn out to be extremelyvaluable because it increases our understanding ofpronunciation variation. Third, it is wrong to takethe change in WER as the only criterion forevaluation, because this change is dependent on atleast three di�erent factors: (1) the corpora, (2) theASR system and (3) the baseline system. Thismeans that improvements in WER can be com-pared with each other only if in the methods understudy these three elements were identical or at leastsimilar. It is obvious that in the majority of themethods presented these three elements are notkept constant, but are usually very di�erent. In thefollowing sections we discuss these di�erences andtry to explain why this makes it di�cult to com-pare the various methods and, in particular, theresults obtained with each of them.

3.1. Di�erences between corpora

Corpora are used to gauge the performance ofASR systems. In studies on pronunciation varia-tion modeling, many di�erent corpora are used.The choice of a given corpus at the same timeimplies the choice of the task, the type of speechand the language. This means that there are atleast three respects in which corpora may di�erfrom each other.

Both with respect to task and type of speech it ispossible to distinguish between cases with littlepronunciation variation (carefully read speech)and cases with much more variation (conversa-tional, spontaneous speech). Given this di�erencein amount of variation, it is possible that a method

for pronunciation variation modeling that per-forms well for read speech does not performequally well for conversational speech.

Another important aspect of the corpus is thelanguage. Since pronunciation variation will alsodi�er between languages, a method which givesgood results in one language need not be as e�ec-tive in another language. For example, Beulen et al.(1998) report improvements for English corporawhile with the same method no improvementswere obtained for a German corpus. Another ex-ample concerns the pronunciation variationcaused by liaison in French. Perennou et al.(Perennou and Brieussel-Pousse, 1998; Pousse andPerennou, 1997) propose a method to model thistype of pronunciation variation, and for theirFrench corpus this yields an improvement. How-ever, it remains to be seen how e�ective theirmethod is in modeling pronunciation variation inlanguages in which other processes take place.

3.2. Di�erences between ASR systems

As we all know, not all ASR systems are simi-lar. A method that is successful with one ASRsystem, may be less successful with another. Thiswill already be the case for ASR systems with asimilar architecture (e.g., a ``standard ASR sys-tem'' with the common phone-based HMMs), butit will certainly be true for ASR systems with to-tally di�erent architectures. For instance, Cremelieand Martens (1998, 1999) obtain large improve-ments with a rule-based method for their segment-based ASR system, which does not imply that thesame method will be equally successful for anothertype of ASR system.

Moreover, a method can be successful with agiven ASR system, not so much because it modelspronunciation variation in the correct way, butbecause it corrects for the peculiarities of the ASRsystem. To illustrate this point let us assume that aspeci®c ASR system very often recognizes /n/ incertain contexts as /m/. If the method for pro-nunciation variation modeling replaces the properoccurrences of /n/ by /m/ in the lexicon, the per-formance will certainly go up. Such a transfor-mation is likely to occur in a data-driven methodin which a DP-alignment is used (see Section 2.3.).

238 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

By looking at the numbers alone (the performancebefore and after the method was applied), onecould conclude that the method is successful.However, in this particular case, the method issuccessful because it corrects the errors made bythe ASR system. Although one could argue thatthe error made by the ASR system (i.e., recogniz-ing certain /n/s as /m/) is in fact due to pronunci-ation variation, the example clearly demonstratesthat certain methods may work with a speci®cASR system, but do not necessarily generalize toother systems.

Let us state clearly that being able to correct forthe peculiarities of an ASR system is not a badproperty of a method. On the contrary, if amethod has this property it is almost certain that itwill increase the performance of the ASR system.This is probably why in (Riley et al., 1998, 1999) itis argued that the ASR system itself should be usedto make the transcriptions. The point to be madein the example above is that a posteriori it is noteasy to determine which part of the improvementis due to correct modeling of pronunciation vari-ation by the method or due to other reasons. Inturn, this will make it di�cult to estimate howsuccessful a method will be for another ASR sys-tem. After all, the peculiarities of all ASR systemsare not the same.

3.3. Di�erences in the measures used for evaluation

In the various studies, di�erent measures areused to express the change in WER. The threemeasures used most often are discussed in this sub-section. In order to illustrate these three measuressome examples with ®ctitious numbers are pre-sented in Table 1.

First, the performance is calculated for thebaseline system, say WERbegin. Then the method isapplied, e.g., by adding pronunciation variants to

the lexicon, and the performance of the new sys-tem is determined, say WERend. The absolute im-provement then is

%abs �WERbegin ÿWERend:

This can also be expressed in relative terms:

%rel1 � 100 � �WERbegin ÿWERend�=WERbegin:

The measure %rel1 yields higher numbers than themeasure %abs, but even higher numbers can beobtained by using

%rel2 � 100 � �WERbegin ÿWERend�=WERend:

The last equation is generally considered to beless correct. Furthermore, for most people %rel1

is more in agreement with their intuition than%rel2, i.e., most people would say that an im-provement from 20% to 10% WER is an im-provement of 50% and not an improvement of100% (see Table 1).

In general, %rel1 is also a better measure than%abs. Most people will agree that the improve-ments obtained in examples 2 and 3 are moresimilar than those in examples 1 and 2. To sum-marize, %rel1 is a measure that is more in line withour intuitions about the amount of improvementthan %abs or %rel2. Therefore, it is probablybetter to use the measure %rel1 to express thechanges in WER. In addition, WERbegin and op-tionally WERend should also be speci®ed.

3.4. Di�erences in the baseline systems

Another reason why it is di�cult to comparemethods is related to the baseline systems (thestarting points) used. Let us explain why. What-ever equation is used, it is clear that the outcomeof the equation depends on two numbers:WERbegin and WERend. In most studies, a lot ofwork is done in order to decrease WERend, and this

Table 1

Some ®ctitious numerical examples to illustrate the measures %abs, %rel1 and %rel2

Example WERbegin (%) WERend (%) %abs (%) %rel1 (%) %rel2 (%)

1 50 40 10 20 25

2 20 10 10 50 100

3 50 25 25 50 100

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 239

work is generally described in detail. However,more often than not the baseline system is notclearly described and no attempt is made to im-prove it. Usually, the starting point is simply anASR system that was available at the beginning ofthe research, or an ASR system that is quicklytrained with resources available at the beginning ofthe research. It is clear that for a relatively badbaseline system it is much easier to obtain im-provements than for a good baseline system. Forinstance, a baseline system may contain errors,e.g., errors in the canonical lexicon. During re-search, part of these errors may be corrected, e.g.,by changing the transcriptions in the lexicon. Ifcorrections are made, similar corrections shouldalso be made in the baseline system and WERbegin

should be calculated again. If this is not done, partof the resulting improvement is due to the cor-rection of errors and possibly other sources. Thismakes it di�cult to estimate which part of theimprovement is really due to the modeling ofpronunciation variation.

Besides the presence of errors, other propertiesof the canonical lexicon will also, to a large extent,determine the amount of improvement obtainedwith a certain method. Let us assume, for the sakeof argument, that the canonical lexicon containspronunciations (i.e., transcriptions) for a certainaccent and speaking style (e.g., read speech). Amethod is then tested with a corpus that containsspeech of another accent and another speakingstyle (e.g., conversational speech). The methodsucceeds in improving the lexicon in the sense thatthe new pronunciations in the lexicon are moreappropriate for the speech in the corpus, and alarge improvement in the performance is observed.Although it is clear that the method has succeededin modeling pronunciation variation, it is alsoclear that the amount of improvement would havebeen (much) smaller if the lexicon had containedmore appropriate transcriptions from the start andnot those of another accent and another speechtype.

In short, a large amount of research and writtenexplanation is devoted to the reduction ofWERend, while relatively little e�ort is put inWERbegin. Since both quantities determine theamount of improvement, and since the baseline

systems di�er between studies, it becomes di�cultto compare the various methods.

3.5. Objective evaluation

The question that arises at this point is: Is anobjective evaluation and comparison of thesemethods at all possible? This question is not easyto answer. An obvious solution seems to be to usebenchmark corpora and standard methods forevaluation (e.g., to give everyone the same ca-nonical lexicon), like the NIST evaluations forautomatic speech recognition and automaticspeaker veri®cation. This would solve a number ofthe problems mentioned above, but certainly notall of them. The most important problem that re-mains is the choice of the language. Like manyother benchmark tests it could be (American)English. However, pronunciation variation andthe ways in which it should be modeled can di�erbetween languages, as argued above. Furthermore,for various reasons it would favor groups who doresearch on (American) English. Finally, usingbenchmarks would not solve the problem of dif-ferences between ASR systems.

Still, the large scale (D)ARPA projects and theNIST evaluations have shown that the combi-nation of competition and objective evaluation(i.e., the possibility to obtain an objective com-parison of methods) is very useful. Therefore, itseems advisable to strive towards objective evalu-ation methods within the ®eld of pronunciationmodeling.

4. Discussion and conclusions

One of the most common ways of modelingpronunciation variation is to add pronunciationvariants to the lexicon (see Section 2.4.1). Thismethod can be applied fairly easily and it appearsto improve recognition performance. However, aproblem with this approach is that certain wordshave numerous variants with very di�erent fre-quencies of occurrence. Some quantitative data onthis phenomenon can be found in Table 2 on page50 of Greenberg (1998). For instance, if we look atthe data for ``that'', we can see that this word

240 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

appears 328 times in the corpus used by Green-berg, that it has 117 di�erent pronunciations andthat the single most common variant only covers11% of the pronunciations. The coverage of theother variants will probably decrease graduallyfrom 11% to almost zero. In principle one couldinclude all 117 variants in the lexicon and it ispossible that this will improve recognition of theword ``that''. However, this is also likely to in-crease confusability. If many variants of a largenumber of words are included in the lexicon theconfusability can increase to such an extent thatrecognition performance may eventually decrease.This implies that variant selection constitutes anessential part of this approach.

An obvious criterion for variant selection isfrequency of occurrence. Adding very frequentvariants is likely to produce a more substantialimprovement than adding infrequent variants.However, there is no clear distinction betweenfrequent and infrequent pronunciation variants.Furthermore, besides frequency of occurrence,there will be other important factors that in¯uencerecognition performance. For instance, some pro-nunciation variants will probably constitute noproblem for the ASR system, in the sense that theywill be recognized correctly even though they(slightly) di�er from the representation stored inthe lexicon. Other spoken variants will probablycause frequent errors during recognition. In orderto improve the performance of the ASR system, itis necessary to know which variants cause recog-nition errors (and which do not). Furthermore,adding pronunciation variants to the lexicon cansolve some recognition errors, but it will certainlyalso introduce new ones. To optimize performanceone should add those variants for which the bal-ance between solving old errors and introducingnew errors is positive. Confusability during thedecoding process is a central issue in this respect.However, it will be di�cult to predict a priori whatthe confusability during decoding will be. Amanner in which our insight on this topic could beenhanced, is by doing error analysis, as will bediscussed below.

In most studies mentioned above the emphasiswas on reduction of the error rates. However, thedi�erence in the error rates of two systems is only a

global measure which does not provide informa-tion about the details of the di�erences in therecognition results. Consequently, in most studiesit is not possible to ®nd out how and why im-provements were obtained. In order to do so therecognition errors should be studied in more de-tail, i.e., more detailed error analysis should becarried out. This can be done by comparing theerrors in the recognition results between the oldsystem and the new one. In addition, error analysiscould be used not only post hoc, to test the e�ectof a speci®c method, but also before applying themethod. For instance, it would be informative toknow beforehand how many and what kind oferrors are made so as to be able to estimate themaximum amount of improvement that can beachieved. In turn this could constitute a criterionin deciding whether to test the method at all.

One of the reasons why error analysis is oftenomitted is related to the availability of data. Inorder to test the performance of an ASR system atest corpus is needed. According to the rules of thegame, nothing must be known about the test cor-pus. As soon as you start doing a detailed erroranalysis on the test corpus you learn about thecorpus. In turn this entails that this speci®c corpuscan no longer be used for objective evaluation. Forinstance, suppose that error analysis revealed thatrecognition of the word ``that'' is problematic. Inthis case one could try to solve this problem, thusimproving the performance of the ASR system onthat speci®c corpus. Quite clearly, this is not fair.A possible alternative would be to use the trainingcorpus for error analysis. Since in this case thematerial used for training and for testing is thesame, this could in¯uence the outcome of the erroranalysis. The best option probably would be to usean independent development test set for erroranalysis.

At this point, it may be useful to try to make ageneral assessment of the state of the art in re-search on modeling pronunciation variation forASR. For example, we could start by relating theresults obtained so far to the expectations re-searchers had at the beginning. It is di�cult,though, to estimate the researchersÕ expectationsabout the gain in recognition performance thatcould be obtained by modeling pronunciation

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 241

variation. In any case, judging by the e�ort thathas gone in this type of research one could con-clude that there were high expectations. The re-sults reported so far vary from 0% to 20% relativereduction in the WER. These ®ndings can be in-terpreted either positively or negatively. Positively:modeling pronunciation variation often improvesrecognition performance, sometimes even by 20%.Negatively: sometimes recognition performanceincreases by about 20%, but in most cases im-provements are marginal. At the Rolduc work-shop the general feeling seemed to be that theresults obtained so far did not live up to the ex-pectations.

However, the general idea also seemed to bethat research on pronunciation variation modelinghas made di�erent useful contributions to ASR.For example, this research has shown the impor-tance of systematic lexicon design and has pro-duced improved, more consistent lexica.Furthermore, di�erent methods for pronunciationvariation modeling have been proposed that, ingeneral, lead to some improvement. For instance,it has now become standard procedure to includemultiple variants for some of the words in thelexicon. Finally, this research has produced theknowledge that the problem of pronunciationvariation is not a simple one.

In any case it is clear that the right solutionshave not been found yet. On the contrary, themodest improvements suggest that we are just atthe beginning of the path towards the optimalsolutions. In other words, more research should becarried out, but the questions are: what kind ofresearch, in which direction?

First of all, more fundamental research isneeded to gather more knowledge on pronuncia-tion variation. The papers at the Rolduc workshopthat mainly aimed at gaining insight into pro-nunciation variation are (Adda-Decker and La-mel, 1998, 1999; Fosler-Lussier and Morgan, 1998,1999; Greenberg, 1998, 1999; Peters and Stubley,1998; Strik and Cucchiarini, 1998). In (Adda-Decker and Lamel, 1998, 1999; Fosler-Lussier andMorgan, 1998, 1999; Greenberg, 1998, 1999) var-ious frequency counts on the basis of corpora arereported, in (Peters and Stubley, 1998) a method ispresented for visualizing speech trajectories, and in

(Strik and Cucchiarini, 1998) an overview of theliterature is presented. In addition to analysis ofspeech corpora, it would be useful to study thespeech production processes that lead to pronun-ciation variation and the type of problems thatpronunciation variation causes in human speechperception and in ASR.

Finally, it is worth mentioning that at presentmost researchers use ``standard ASR systems''based on discrete segmental representations,HMMs to model the segments, and features thatare computed per frame (usually cepstral featuresand their derivatives). Possibly, the underlyingassumptions in these standard ASR systems arenot optimal. One of the assumptions is that speechis made up of discrete segments, usually pho-ne(me)s. Although this has long been one of theassumptions in linguistics too, the idea that speechcan be phonologically represented as a sequence ofdiscrete entities (the ``absolute slicing hypothesis'',as formulated in (Goldsmith, 1976, pp. 16±17)) hasproved to be untenable. In non-linear, autoseg-mental phonology (Goldsmith, 1976, 1990) ananalysis has been proposed in which di�erent fea-tures are placed on di�erent tiers. The various tiersrepresent the parallel activities of the articulatorsin speech, which do not necessarily begin and endsimultaneously. In turn the tiers are connected byassociation lines. In this way, it is possible to in-dicate that the mapping between tiers is not alwaysone to one. Assimilation phenomena can then berepresented by the spreading of one feature fromone segment to the adjacent one. On the basis ofthis theory, Li Deng and his colleagues have builtASR systems with which promising results havebeen obtained (Deng and Sun, 1994).

Another important assumption of standardASR systems, which is of course related to the ®rstone, is that feature values can be calculated locally,per individual frame. In addition to this know-ledge on the static properties of features, infor-mation on their dynamic characteristics can beobtained by computing derivatives of these fea-tures. However, the problem remains that theanalysis window on which feature values are cal-culated is very small, while it is known from re-search on human perception that for perceivingone speech sound subjects rely on information

242 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

contained in adjacent sounds. If this works satis-factorily in human perception, it is perhapsworthwhile to investigate whether there are betterfeature representations for ASR, which go beyondsegment boundaries and span a larger window.Since the standard ASR systems described abovehave given promising results so far, researchers arereluctant to abandon this path. The future willshow whether this is the right path or a dead end.

Acknowledgements

The research of Dr. H. Strik has been madepossible by a fellowship of the Royal NetherlandsAcademy of Arts and Sciences. We would like tothank Loe Boves, Judith Kessens, Mirjam Wester,Febe de Wet and three anonymous reviewers fortheir comments on previous versions of this article.

References

Adda-Decker, M., Lamel, L., 1998. Pronunciation variants

across systems, languages and speaking style. In: Strik, H.,

Kessens, J.M., Wester, M. (Eds.), Proceedings of the ESCA

Workshop on Modeling Pronunciation Variation for Au-

tomatic Speech Recognition, Rolduc, Kerkrade, 4±6 May

1998. A2RT, University of Nijmegen, pp. 1±6.

Adda-Decker, M., Lamel, L., 1999. Pronunciation variants

across system con®guration, language and speaking style.

Speech Communication 29 (2±4), 83±98.

Aubert, X., Dugast, C., 1995. Improved acoustic±phonetic

modeling in PhilipsÕ dictation system by handling liaisons

and multiple pronunciations. In: Proceedings of Euro-

speech-95, Madrid, pp. 767±770.

Bacchiani, M., Ostendorf, M., 1998. Joint acoustic unit design

and lexicon generation. In: Strik, H., Kessens, J.M., Wester,

M. (Eds.), Proceedings of the ESCA Workshop on Mod-

eling Pronunciation Variation for Automatic Speech Rec-

ognition, Rolduc, Kerkrade, 4±6 May 1998. A2RT,

University of Nijmegen, pp. 7±12.

Bacchiani, M., Ostendorf, M., 1999. Joint lexicon, acoustic unit

inventory and model design. Speech Communication 29

(2±4), 99±114.

Barnett, J., 1974. A phonological rule compiler. In: Erman., L.

(Ed.), Proceedings of the IEEE Symposium on Speech

Recognition, Carnegie-Mellon University (IEEE Catalog

No. 74CH0878-9 AE), 15±19 April, Pittsburgh, PA,

pp. 188±192.

Bell, A., 1984. Language style as audience design. Language in

Society 13 (2), 145±204.

Beulen, K., Ortmanns, S., Eiden, A., Martin, S., Welling, L.,

Overmann, J., Ney, H., 1998. Pronunciation modelling in

the RWTH large vocabulary speech recognizer. In: Strik,

H., Kessens, J.M., Wester, M. (Eds.), Proceedings of the

ESCA Workshop on Modeling Pronunciation Variation for

Automatic Speech Recognition, Rolduc, Kerkrade, 4±6

May 1998. A2RT, University of Nijmegen, pp. 13±16.

Blackburn, C.S., Young, S.J., 1995. Towards improved speech

recognition using a speech production model. In: Proceed-

ings of EuroSpeech-95, Madrid, pp. 1623±1626.

Blackburn, C.S., Young, S.J., 1996. Pseudo-articulatory speech

synthesis for recognition using automatic feature extraction

from X-ray data. In: Proceedings of ICSLP-96, Philadel-

phia, pp. 969±972.

Bonaventura, P., Gallocchio, F., Mari, J., Micca, G., 1998.

Speech recognition methods for non-native pronunciation

variations. In: Strik, H., Kessens, J.M., Wester, M. (Eds.),

Proceedings of the ESCA Workshop on Modeling Pronun-

ciation Variation for Automatic Speech Recognition,

Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 17±22.

Cohen, M., 1989. Phonological structures for speech recogni-

tion. Ph.D. Thesis, University of California, Berkeley, USA.

Cohen, P.S., Mercer, R.L., 1974. The phonological component of

an automatic speech-recognition system. In: Erman, L. (Ed.),

Proceedings of the IEEE Symposium on Speech Recognition,

Carnegie-Mellon University (IEEE Catalog No. 74CH0878-9

AE), 15±19 April, Pittsburgh, PA, pp. 177±187.

Cohen, P.S., Mercer, R.L., 1975. The phonological component

of an automatic speech-recognition system. In: Reddy, D.R.

(Ed.), Speech Recognition. Academic Press, New York,

1975, pp. 275±320.

Coupland, N., 1984. Accommodation at work: Some phono-

logical data and their implications. International Journal of

the Sociology of Language 46, 49±70.

Cremelie, N., Martens, J.-P., 1995. On the use of pronunciation

rules for improved word recognition. In: Proceedings of

Eurospeech-95, Madrid, pp. 1747±1750.

Cremelie, N., Martens, J.-P., 1997. Automatic rule-based

generation of word pronunciation networks. In: Proceed-

ings of EuroSpeech-97, Rhodes, pp. 2459±2462.

Cremelie, N., Martens, J.-P., 1998. In search of pronunciation

rules. In: Strik, H., Kessens, J.M., Wester, M. (Eds.),

Proceedings of the ESCA Workshop on Modeling Pronun-

ciation Variation for Automatic Speech Recognition,

Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 23±28.

Cremelie, N., Martens, J.-P., 1999. In search of better pronun-

ciation models for speech recognition. Speech Communica-

tion 29 (2±4), 115±136.

Deng, L., Sun, D., 1994. A statistical approach to automatic

speech recognition using the atomic speech units construct-

ed from overlapping articulatory features. Journal of the

Acoustical Society of America 95 (5), 2702±2719.

Deshmukh, N., Weber, M., Picone, J., 1996. Automated

generation of N-best pronunciations of proper nouns. In:

Proceedings of ICASSP-96, Atlanta, pp. 283±286.

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 243

Downey, S., Wiseman, R., 1997. Dynamic and static improve-

ments to lexical baseforms. In: Proceedings of Eurospeech-

97, Rhodes, pp. 1027±1030.

Erman, L., 1974. Proceedings of the IEEE Symposium on

Speech Recognition, Carnegie-Mellon University, (IEEE

Catalog No. 74CH0878-9 AE) 15±19 April 1974, Pittsburgh,

PA, 295 pp.

Eskenazi, M., 1993. Trends in speaking styles research. In:

Proceedings of Eurospeech-93, Berlin, pp. 501±509.

Ferreiros, J., Mac�õas-Guarasa, J., Pardo, J.M., Villarrubia, L.,

1998. Introducing multiple pronunciations in Spanish

speech recognition systems. In: Strik, H., Kessens, J.M.,

Wester, M. (Eds.), Proceedings of the ESCA Workshop on

Modeling Pronunciation Variation for Automatic Speech

Recognition, Rolduc, Kerkrade, 4±6 May 1998. A2RT,

University of Nijmegen, pp. 29±34.

Finke, M., Waibel, A., 1997. Speaking mode dependent

pronunciation modeling in large vocabulary conversational

speech recognition. In: Proceedings of EuroSpeech-97,

Rhodes, pp. 2379±2382.

Flach, G., 1995. Modelling pronunciation variability for

spectral domains. In: Proceedings of Eurospeech-95, Ma-

drid, pp. 1743±1746.

Fosler-Lussier, E., Morgan, N., 1998. E�ects of speaking rate

and word frequency on conversational pronunciations. In:

Strik, H., Kessens, J.M., Wester, M. (Eds.), Proceedings of

the ESCA Workshop on Modeling Pronunciation Varia-

tion for Automatic Speech Recognition, Rolduc, Kerk-

rade, 4±6 May 1998. A2RT, University of Nijmegen,

pp. 35±40.

Fosler-Lussier, E., Morgan, N., 1999. E�ects of speaking rate

and word frequency on pronunciations in conversational

speech. Speech Communication 29 (2±4) 137±158.

Friedman, J., 1974. Computer exploration of fast speech rules.

In: Erman, L. (Ed.), Proceedings of the IEEE Symposium

on Speech Recognition, Carnegie-Mellon University (IEEE

Catalog No. 74CH0878-9 AE), 15±19 April 1974, Pitts-

burgh, PA, pp. 197±203.

Fukada, T., Sagisaka, Y., 1997. Automatic generation of a

pronunciation dictionary based on a pronunciation net-

work. In: Proceedings of EuroSpeech-97, Rhodes, pp. 2471±

2474.

Fukada, T., Yoshimura, T., Sagisaka, Y., 1998. Automatic

generation of multiple pronunciations based on neural

networks and language statistics. In: Strik, H., Kessens,

J.M., Wester, M. (Eds.), Proceedings of the ESCA Work-

shop on Modeling Pronunciation Variation for Automatic

Speech Recognition, Rolduc, Kerkrade, 4±6 May 1998.

A2RT, University of Nijmegen, pp. 41±46.

Fukada, T., Yoshimura, T., Sagisaka, Y., 1999. Automatic

generation of multiple pronunciations based on neural

networks. Speech Communication 27 (1), 63±73.

Giles, H., Powesland, P., 1975. Speech Style and Social

Evaluation. Cambridge University Press, Cambridge.

Giles, H., Smith, P., 1979. Accommodation theory: Optimal

levels of convergence. In: Giles, H., stClair, R. (Eds.),

Language and Social Psychology, Blackwell, Oxford.

Godfrey, J.J., Ganapathiraju, A., Ramalingam, C.S., Picone, J.,

1997. Microsegment-based connected digit recognition. In:

Proceedings of ICASSP-97, Munich, pp. 1755±1758.

Goldsmith, J., 1976. Autosegmental phonology. Doctoral

thesis, Massachussets Institute of Technology, Cambridge.

Indiana University Linguistics Club, Bloomington, Indiana;

Garland, New York, 1979.

Goldsmith, J.A., 1990. Autosegmental and Metrical Phonolo-

gy. Blackwell, Oxford.

Greenberg, S., 1998. Speaking in shorthand ± A syllable-centric

perspective for understanding pronunciation variation. In:

Strik, H., Kessens, J.M., Wester, M. (Eds.), Proceedings of

the ESCA Workshop on Modeling Pronunciation Variation

for Automatic Speech Recognition, Rolduc, Kerkrade, 4±6

May 1998. A2RT, University of Nijmegen, pp. 47±56.

Greenberg, S., 1999. Speaking in shorthand ± A syllable-centric

perspective for understanding pronunciation variation.

Speech Communication 29 (2±4) 159±176.

Heine, H., Evermann, G., Jost, U., 1998. An HMM-based

probabilistic lexicon. In: Strik, H., Kessens, J.M., Wester,

M. (Eds.), Proceedings of the ESCA Workshop on Mod-

eling Pronunciation Variation for Automatic Speech Rec-

ognition, Rolduc, Kerkrade, 4±6 May 1998. A2RT,

University of Nijmegen, pp. 57±62.

Holmes, W.J., Russell, M.J., 1996. Modeling speech variability

with segmental HMMs. In: Proceedings of ICASSP-96,

Atlanta, pp. 447±450.

Holter, T., 1997. Maximum likelihood modelling of pronunci-

ation in automatic speech recognition. Ph.D. Thesis, Nor-

wegian University of Science and Technology, December

1997.

Holter, T., Svendsen, T., 1998. Maximum likelihood modelling

of pronunciation variation. In: Strik, H., Kessens, J.M.,

Wester, M. (Eds.), Proceedings of the ESCA Workshop on

Modeling Pronunciation Variation for Automatic Speech

Recognition, Rolduc, Kerkrade, 4±6 May 1998. A2RT,

University of Nijmegen, pp. 63±66.

Holter, T., Svendsen, T., 1999. Maximum likelihood modelling

of pronunciation variation. Speech Communication 29 (2±4)

177±191.

Imai, T., Ando, A., Miyasaka, E., 1995. A new method for

automatic generation of speaker-dependent phonological

rules. In: Proceedings of ICASSP-95, Detroit, pp. 864±867.

Jelinek, F., Bahl, L.R., Mercer, R.L., 1974. Design of a

linguistic statistical decoder for the recognition of contin-

uous speech. In: Erman, L. (Ed.), Proceedings of the IEEE

Symposium on Speech Recognition, Carnegie-Mellon Uni-

versity (IEEE Catalog No. 74CH0878-9 AE), 15±19 April

1974, Pittsburgh, PA, pp. 255±260.

Kaisse, E., 1985. Connected Speech: The Interaction of Syntax

and Phonology. Academic Press, Orlando.

Kessens, J., Wester, M., 1997. Improving recognition perfor-

mance by modelling pronunciation variation. In: Proceed-

ings of the CLS opening Academic Year Ô97±Õ98, pp. 1±20

(http://lands.let.kun.nl/literature/kessens.1997.1.html).

Kessens, J.M., Wester, M., Strik, H., 1999. Improving the

performance of a Dutch CSR by modelling within-word and

244 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246

cross-word pronunciation variation. Speech Communica-

tion 29 (2±4) 193±207.

Kipp, A., Wesenick, M.-B., Schiel, F., 1996. Automatic

detection and segmentation of pronunciation variants in

German speech corpora. In: Proceedings of ICSLP-96,

Philadelphia, pp. 106±109.

Kipp, A., Wesenick, M.-B., Schiel, F., 1997. Pronunciation

modeling applied to automatic segmentation of spontane-

ous speech. In: Proceedings of EuroSpeech-97, Rhodes,

pp. 1023±1026.

Labov, W., 1972. Sociolinguistic Patterns. University of Penn-

sylvania Press, Philadelphia.

Lamel, L., Adda, G., 1996. On designing pronunciation

lexicons for large vocabulary continuous speech recogni-

tion. In: Proceedings of ICSLP-96, Philadelphia, pp. 6±9.

Laver, J., 1994. Principles of Phonetics. Cambridge University

Press, Cambridge.

Lehtinen, G., Safra, S., 1998. Generation and selection of

pronunciation variants for a ¯exible word recognizer. In:

Strik, H., Kessens, J.M., Wester, M. (Eds.), Proceedings of

the ESCA Workshop on Modeling Pronunciation Variation

for Automatic Speech Recognition, Rolduc, Kerkrade, 4±6

May 1998. A2RT, University of Nijmegen, pp. 67±72.

Mercer, R., Cohen, P., 1987. A method for e�cient storage and

rapid application of context-sensitive phonological rules for

automatic speech recognition. IBM J. Res. Develop. 31 (1),

81±90.

Mirghafori, N., Fosler, E., Morgan, N., 1995. Fast speakers in

large vocabulary continuous speech recognition: analysis

and antidotes. In: Proceedings of EuroSpeech-95, Madrid,

pp. 491±494.

Mokbel, H., Jouvet, D., 1998. Derivation of the optimal

phonetic transcription set for a word from its acoustic

realisations. In: Strik, H., Kessens, J.M., Wester, M. (Eds.),

Proceedings of the ESCA Workshop on Modeling Pronun-

ciation Variation for Automatic Speech Recognition,

Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 73±78.

Mouria-Beji, F., 1998. Context and speed dependent phonemic

models for continuous speech recognition. In: Strik, H.,

Kessens, J.M., Wester, M. (Eds.), Proceedings of the ESCA

Workshop on Modeling Pronunciation Variation for Au-

tomatic Speech Recognition, Rolduc, Kerkrade, 4±6 May

1998. A2RT, University of Nijmegen, pp. 79±84.

Murray, I.R., Arnott, J.L., 1993. Toward the simulation of

emotion in synthetic speech: A review of the literature on

human vocal emotion. Journal of the Acoustical Society of

America 93 (2), 1097±1108.

Nock, H.J., Young, S.J., 1998. Detecting and correcting poor

pronunciations for multiword units. In: Strik, H., Kessens,

J.M., Wester, M. (Eds.), Proceedings of the ESCA Work-

shop on Modeling Pronunciation Variation for Automatic

Speech Recognition, Rolduc, Kerkrade, 4±6 May 1998.

A2RT, University of Nijmegen, pp. 85±90.

OÕMalley, M.H., Cole, A., 1974. Testing phonological rules. In:

Erman, L. (Ed.), Proceedings of the IEEE Symposium on

Speech Recognition, Carnegie-Mellon University (IEEE

Catalog No. 74CH0878-9 AE), 15±19 April 1974, Pitts-

burgh, PA, pp. 193±196.

Oshika, B.T., Zue, V.W., Weeks, R.V., Neu, H., 1974. The role

of phonological rules in speech understanding research. In:

Erman, L. (Ed.), Proceedings of the IEEE Symposium on

Speech Recognition, Carnegie-Mellon University (IEEE

Catalog No. 74CH0878-9 AE), 15±19 April 1974, Pitts-

burgh, PA pp. 204±207.

Perennou, G., Brieussel-Pousse, L., 1998. Phonological com-

ponent in automatic speech recognition. In: Strik, H.,

Kessens, J.M., Wester, M. (Eds.), Proceedings of the ESCA

Workshop on Modeling Pronunciation Variation for Au-

tomatic Speech Recognition, Rolduc, Kerkrade, 4±6 May

1998. A2RT, University of Nijmegen, pp. 91±96.

Peters, S.D., Stubley, P., 1998. Visualizing speech trajectories.

In: Strik, H., Kessens, J.M., Wester, M. (Eds.), Proceedings

of the ESCA Workshop on Modeling Pronunciation Vari-

ation for Automatic Speech Recognition, Rolduc, Kerkrade,

4±6 May 1998. A2RT, University of Nijmegen, pp. 97±102.

Polzin, T.S., Waibel, A.H., 1998. Pronunciation variations in

emotional speech. In: Strik, H., Kessens, J.M., Wester, M.

(Eds.), Proceedings of the ESCA Workshop on Modeling

Pronunciation Variation for Automatic Speech Recogni-

tion, Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 103±108.

Pousse, L., Perennou, G., 1997. Dealing with pronunciation

variants at the language model level for automatic contin-

uous speech recognition of French. In: Proceedings of

Eurospeech-97, Rhodes, pp. 2727±2730.

Rabinowitz, A.S., 1974. Phonetic to graphemic transformation

by use of a stack procedure. In: Erman, L. (Ed.), Proceed-

ings of the IEEE Symposium on Speech Recognition,

Carnegie-Mellon University (IEEE Catalog No.

74CH0878-9 AE), 15±19 April 1974, Pittsburgh, PA,

pp. 212±217.

Ravishankar, M., Eskenazi, M., 1997. Automatic generation of

context-dependent pronunciations. In: Proceedings of Euro-

Speech-97, Rhodes, pp. 2467±2470.

Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A.,

McDonough, J., Nock, H., Saraclar, M., Wooters, C.,

Zavaliagkos, G., 1998. Stochastic pronunciation modelling

from hand-labelled phonetic corpora. In: Strik, H., Kessens,

J.M., Wester, M. (Eds.), Proceedings of the ESCA Work-

shop on Modeling Pronunciation Variation for Automatic

Speech Recognition, Rolduc, Kerkrade, 4±6 May 1998.

A2RT, University of Nijmegen, pp. 109±116.

Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A.,

McDonough, J., Nock, H., Saraclar, M., Wooters, C.,

Zavaliagkos, G., 1999. Stochastic pronunciation modelling

from hand-labelled phonetic corpora. Speech Communica-

tion 29 (2±4) 209±224.

Ristad, E.S., Yianilos, P.N., 1998. A sur®cial pronunciation

model. In: Strik, H., Kessens, J.M., Wester, M. (Eds.),

Proceedings of the ESCA Workshop on Modeling Pronun-

ciation Variation for Automatic Speech Recognition,

Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 117±120.

H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246 245

Roach, P., Arn®eld, S., 1998. Variation information in

pronunciation dictionaries. In: Strik, H., Kessens, J.M.,

Wester, M. (Eds.), Proceedings of the ESCA Workshop on

Modeling Pronunciation Variation for Automatic Speech

Recognition, Rolduc, Kerkrade, 4±6 May 1998. A2RT,

University of Nijmegen, pp. 121±124.

Roe, D.B., Riley, M.D., 1994. Prediction of word confusabili-

ties for speech recognition. In: Proceedings of ICSLP-94,

Yokohama, pp. 227±230.

Romaine, S., 1980. Stylistic variation and evaluative reactions

to speech. Language and Speech 23, 213±232.

Rovner, P., Makhoul, J., Wolf, J., Colarusso, J., 1974. Where

the words are: lexical retrieval in a speech understanding

system. In: Erman, L. (Ed.), Proceedings of the IEEE

Symposium on Speech Recognition, Carnegie-Mellon Uni-

versity (IEEE Catalog No. 74CH0878-9 AE), 15±19 April

1974, Pittsburgh, PA, pp. 160±164.

Safra, S., Lehtinen, G., Huber, K., 1998. Modeling pronunci-

ation variations and coarticulation with ®nite-state trans-

ducers in CSR. In: Strik, H., Kessens, J.M., Wester, M.

(Eds.), Proceedings of the ESCA Workshop on Modeling

Pronunciation Variation for Automatic Speech Recogni-

tion, Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 125±130.

Scherer, K.R., Giles, H., 1979. Social Markers in Speech.

Cambridge University Press, Cambridge.

Schiel, F., Kipp, A., Tillmann, H.G., 1998. Statistical modelling

of pronunciation: ItÕs not the model, itÕs the data. In: Strik,

H., Kessens, J.M., Wester, M. (Eds.), Proceedings of the

ESCA Workshop on Modeling Pronunciation Variation for

Automatic Speech Recognition, Rolduc, Kerkrade, 4±6

May 1998. A2RT, University of Nijmegen, pp. 131±136.

Shockey, L., Erman, L.D., 1974. Sub-lexical levels in the

HEARSAY II speech understanding system. In: Erman, L.

(Ed.), Proceedings of the IEEE Symposium on Speech

Recognition, Carnegie-Mellon University (IEEE Catalog

No. 74CH0878-9 AE), 15±19 April 1974, Pittsburgh, PA,

pp. 208±210.

Sloboda, T., Waibel, A., 1996. Dictionary learning for spon-

taneous speech recognition. In: Proceedings of ICSLP-96,

Philadelphia, pp. 2328±2331.

Strik, H., 1998. Publications on pronunciation variation and

ASR. http://lands.let.kun.nl/TSpublic/strik/pron-var/refer-

ences.html.

Strik, H., Cucchiarini, C., 1998. Modeling pronunciation

variation for ASR: overview and comparison of methods.

In: Strik, H., Kessens, J.M., Wester, M. (Eds.), Proceedings

of the ESCA Workshop on Modeling Pronunciation Vari-

ation for Automatic Speech Recognition, Rolduc, Kerk-

rade, 4±6 May 1998. A2RT, University of Nijmegen,

pp. 137±144.

Strik, H., Kessens, J.M., Wester, M., 1998. Proceedings of the

ESCA Workshop on Modeling Pronunciation Variation for

Automatic Speech Recognition, Rolduc, Kerkrade, 4±6

May 1998. A2RT, University of Nijmegen, 168 pp.

Svendsen, T., Soong, F., Purnhagen, H., 1995. Optimizing

acoustic baseforms for HMM-based speech recognition. In:

Proceedings of EuroSpeech-95, Madrid, pp. 783±786.

Tappert, C.C., 1974. Experiments with a tree search method for

converting noisy phonetic representation into standard

orthography. In: Erman, L. (Ed.), Proceedings of the IEEE

Symposium on Speech Recognition, Carnegie-Mellon Uni-

versity (IEEE Catalog No. 74CH0878-9 AE), 15±19 April

1974, Pittsburgh, PA, pp. 261±266.

Torre, D., Villarrubia, L., Hern�andez, L., Elvira, J.M., 1997.

Automatic alternative transcription generation and vocab-

ulary selection for ¯exible word recognizers. In: Proceedings

of ICASSP-97, Munich, pp. 1463±1466.

Wesenick, M.-B., 1996. Automatic generation of German

pronunciation variants. In: Proceedings of ICSLP-96, Phil-

adelphia, pp. 125±128.

Wester, M., Kessens, J.M., Strik, H., 1998a. Improving the

performance of a Dutch CSR by modelling pronunciation

variation. In: Strik, H., Kessens, J.M., Wester, M. (Eds.),

Proceedings of the ESCA Workshop on Modeling Pronun-

ciation Variation for Automatic Speech Recognition,

Rolduc, Kerkrade, 4±6 May 1998. A2RT, University of

Nijmegen, pp. 145±150.

Wester, M., Kessens, J.M., Cucchiarini, C., Strik, H., 1998b.

Selection of pronunciation variants in spontaneous speech:

Comparing the performance of man and machine. In:

Proceedings of the ESCA workshop, SPoSS 98 ± Sound

Patterns of Spontaneous Speech: Production and Percep-

tion: Aix-en-Provence, France, 24±26 September 1998,

pp. 157±160.

Wester, M., Kessens, J.M., Cucchiarini, C., Strik, H., 1999.

Comparison between expert listeners and continuous speech

recognizers in selecting pronunciation variants. In: Proceed-

ings of the 14th International Congress of Phonetic Sciences

(ICPhS-99), San Fransico, USA, 1999.

Williams, G., Renals, S., 1998. Con®dence measures for

evaluating pronunciation models. In: Strik, H., Kessens,

J.M., Wester, M. (Eds.), Proceedings of the ESCA Work-

shop on Modeling Pronunciation Variation for Automatic

Speech Recognition, Rolduc, Kerkrade, 4±6 May 1998.

A2RT, University of Nijmegen, pp. 151±156.

Wiseman, R., Downey, S., 1998. Dynamic and static improve-

ments to lexical baseforms. In: Strik, H., Kessens, J.M.,

Wester, M. (Eds.), Proceedings of the ESCA Workshop on

Modeling Pronunciation Variation for Automatic Speech

Recognition, Rolduc, Kerkrade, 4±6 May 1998. A2RT,

University of Nijmegen, pp. 157±162.

Zeppenfeld, T., Finke, M., Ries, K., Westphal, M., Waibel, A.,

1997. Recognition of conversational speech using the

JANUS speech engine. In: Proceedings of ICASSP-97,

Munich, pp. 1815±1818.

246 H. Strik, C. Cucchiarini / Speech Communication 29 (1999) 225±246