RNN …...of Mandarin Chinese. Mandarin Chinese is a tonalandsyllabiclanguage.Thereexistmorethan...

19
RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion Wern-Jun Wang a,b, * , Yuan-Fu Liao a , Sin-Horng Chen a,1 a Department of Communication Engineering, National Chiao Tung University, Taiwan, ROC b Advanced Technology Research Laboratory, Chunghwa Telecommunication Laboratories, Taiwan, ROC Received 24 December 1999; received in revised form 5 September 2000; accepted 7 November 2000 Abstract In this paper, a recurrent neural network (RNN) based prosodic modeling method for Mandarin speech-to-text conversion is proposed. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues to assist in linguistic decoding. It employs a simple three-layer RNN to learn the relationship between input prosodic features, extracted from the input utterance with syllable boundaries pre-determined by the preceding acoustic decoder, and output word-boundary information of the associated text. After the RNN prosodic model is properly trained, it can be used to generate word-boundary cues to help the linguistic decoder solving the problem of word-boundary ambiguity. Two schemes of using these word-boundary cues are proposed. Scheme 1 modifies the baseline scheme of the conventional linguistic decoding search by directly taking the RNN outputs as additional scores and adding them to all word-sequence hypotheses to assist in selecting the best recognized word se- quence. Scheme 2 is an extended version of Scheme 1 by further using the RNN outputs to drive a finite state machine (FSM) for setting path constraints to restrict the linguistic decoding search. Character accuracy rates of 73.6%, 74.6% and 74.7% were obtained for the systems using the baseline scheme, Schemes 1 and 2, respectively. Besides, a gain of 17% reduction in the computational complexity of the linguistic decoding search was also obtained for Scheme 2. So the proposed prosodic modeling method is promising for Mandarin speech recognition. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Recurrent neural network; Prosodic modeling; Speech-to-text conversion; Acoustic decoding; Linguistic decoding 1. Introduction Prosody is an inherent supra-segmental feature of human’s speech. It carries stress, intonation pattern, and timing structure of continuous speech which, in turn, decide naturalness of the speech (Wightman and Ostendorf, 1994). In the past, prosodic information was rarely used in speech recognition. But it became an interesting research issue in recent years. A general statement of the function of prosodic modeling for speech recog- nition is to explore the prosodic phrasing of the testing utterance for providing useful information to help linguistic decoding in the next stage. The main concern of the prosodic phrasing issue is to www.elsevier.com/locate/specom Speech Communication 36 (2002) 247–265 * Corresponding author. Mailing address: 6F, No. 82 Tz- Chung Street, Chung-Li, Taoyuan 320, Taiwan. Tel.: +886-3- 4244536; fax: +886-3-4244147. E-mail addresses: [email protected] (W.-J. Wang), [email protected] (S.-H. Chen). 1 Tel.: +886-3-5731822; fax: +886-3-5710116. 0167-6393/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII:S0167-6393(01)00006-1

Transcript of RNN …...of Mandarin Chinese. Mandarin Chinese is a tonalandsyllabiclanguage.Thereexistmorethan...

RNN-based prosodic modeling for mandarin speech and itsapplication to speech-to-text conversion

Wern-Jun Wang a,b,*, Yuan-Fu Liao a, Sin-Horng Chen a,1

a Department of Communication Engineering, National Chiao Tung University, Taiwan, ROCb Advanced Technology Research Laboratory, Chunghwa Telecommunication Laboratories, Taiwan, ROC

Received 24 December 1999; received in revised form 5 September 2000; accepted 7 November 2000

Abstract

In this paper, a recurrent neural network (RNN) based prosodic modeling method for Mandarin speech-to-text

conversion is proposed. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims

at detecting word-boundary cues to assist in linguistic decoding. It employs a simple three-layer RNN to learn the

relationship between input prosodic features, extracted from the input utterance with syllable boundaries pre-determined

by the preceding acoustic decoder, and output word-boundary information of the associated text. After the RNN

prosodic model is properly trained, it can be used to generate word-boundary cues to help the linguistic decoder solving

the problem of word-boundary ambiguity. Two schemes of using these word-boundary cues are proposed. Scheme 1

modifies the baseline scheme of the conventional linguistic decoding search by directly taking the RNN outputs as

additional scores and adding them to all word-sequence hypotheses to assist in selecting the best recognized word se-

quence. Scheme 2 is an extended version of Scheme 1 by further using the RNN outputs to drive a finite state machine

(FSM) for setting path constraints to restrict the linguistic decoding search. Character accuracy rates of 73.6%, 74.6%

and 74.7% were obtained for the systems using the baseline scheme, Schemes 1 and 2, respectively. Besides, a gain of 17%

reduction in the computational complexity of the linguistic decoding search was also obtained for Scheme 2. So the

proposed prosodic modeling method is promising for Mandarin speech recognition. � 2002 Elsevier Science B.V. All

rights reserved.

Keywords: Recurrent neural network; Prosodic modeling; Speech-to-text conversion; Acoustic decoding; Linguistic decoding

1. Introduction

Prosody is an inherent supra-segmental featureof human’s speech. It carries stress, intonation

pattern, and timing structure of continuous speechwhich, in turn, decide naturalness of the speech(Wightman and Ostendorf, 1994). In the past,prosodic information was rarely used in speechrecognition. But it became an interesting researchissue in recent years. A general statement of thefunction of prosodic modeling for speech recog-nition is to explore the prosodic phrasing of thetesting utterance for providing useful informationto help linguistic decoding in the next stage. Themain concern of the prosodic phrasing issue is to

www.elsevier.com/locate/specomSpeech Communication 36 (2002) 247–265

*Corresponding author. Mailing address: 6F, No. 82 Tz-

Chung Street, Chung-Li, Taoyuan 320, Taiwan. Tel.: +886-3-

4244536; fax: +886-3-4244147.

E-mail addresses: [email protected] (W.-J. Wang),

[email protected] (S.-H. Chen).1 Tel.: +886-3-5731822; fax: +886-3-5710116.

0167-6393/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-6393 (01 )00006-1

build a model describing the relationship betweeninput prosodic features extracted from the testingutterance and output linguistic features of the as-sociated text. Two primary approaches of prosodicmodeling are based on the boundary-label-ing scheme with or without using phonologicalknowledge. Most methods in the approach usingphonological knowledge employ statistical models,such as decision-tree and hidden Markov model(HMM), to detect prosodic phrase boundaries,word prominence, or word accent type (Bou-Ghazale and Hansen, 1998; Wightman andOstendorf, 1994; Iwano and Hirose, 1998, 1999).These detected cues are used to help resolvingsyntactic boundary ambiguity (Niemann et al.,1997; Price et al., 1991), reordering N-best acous-tically decoded word sequences (Hunt, 1994;Kompe et al., 1995), or improving mora recognitionfor Japanese speech (Iwano and Hirose, 1999).The statistical model used in the approach can betrained using a large speech database with majorand minor breaks of the prosodic phrase structureand/or prominence levels of words being given viaproperly labeling. A well-known prosody labelingsystem is the Tones and Break Indices (TOBI)system (Grice et al., 1996; Silverman et al., 1992)which labels prosodic phrase boundaries using aseven-level scale. Two main problems of the ap-proach can be found. One is that the prosodic la-beling of training utterances must be done bylinguistic experts. This is a cumbersome work.Besides, the consistency in labeling is difficult tomaintain over the whole database. The other isthat it needs to further explore the relationshipbetween labels of prosodic phrase boundary andthe syntactic structure of the associated textfor properly using the detected prosodic phras-ing information in linguistic decoding. The otherapproach which does not use phonologicalknowledge directly uses syntactic features of theassociated text as the output targets for modelingthe prosodic features of the input utterance (Bat-liner et al., 1996; Kompe et al., 1997; Price et al.,1991; Hirose and Iwano, 1997, 1998). One problemof the approach is that the syntactic phrasestructure is not completely matched with the pro-sodic phrase structure. Most prosodic phrasescontain one to several syntactic phrases. In some

cases, a long syntactic phrase can split into severalprosodic phrases. This mismatch may degrade theaccuracy of the prosodic modeling and hence de-creases its usability in linguistic decoding. A hy-brid approach which takes the prosodic tendencyand syntactic compatibility into consideration wasalso studied recently (Batliner et al., 1996; Kompeet al., 1997). A new prosody-labeling schemewhich includes perceptual-prosodic boundariesand syntactic boundaries was developed in such ahybrid approach.Prosodic modeling is even more important for

Mandarin speech recognition as compared withthat for other word-based languages such asEnglish. This is because Chinese is a character-based language. A Chinese text is composed ofclauses and sentences which are ended with punc-tuation marks (PMs) such as comma and period.A clause or sentence is formed by concatenatingwords which consist of one to several characters.Although word is the smallest meaningful unit insyntax, character is the basic pronunciation unit.Each character is pronounced as a syllable withwhich a tone associates. Due to the fact thatthere are no special marks used to delimit wordboundaries, inter-word syllable boundaries are, ingeneral, not specially emphasized in pronouncinga Chinese text. This makes the determination ofword boundary be a problem to be solved inMandarin speech-to-text conversion. Convention-ally, the problem is solved in linguistic decodingusing the acoustic decoding results and some sta-tistical language models such as word-class bigrammodel. But, due to the fact that human beings relymainly on prosodic information in their wordperception, a proper prosodic modeling can surelyprovide useful cues to assist in solving the prob-lem. In some cases, cues provided by a prosodicmodel are more efficient than a statistical languagemodel. This is demonstrated by the following ex-ample in which S1 and S2 show two output can-didate sentences of a linguistic decoding with andwithout the help of prosodic information for thesame acoustically decoded syllable sequence:S1. (Chu-Yung Kuanis a military strategic frontier pass.)S2. (Chu-Yung Kuansoldier’s home military strategic frontier pass.)

248 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

Here, the vertical bars in S1 indicate the wordboundary cues suggested by a prosodic model (tobe discussed in Section 2). The sentence S2, which isan incorrect result, is generated by a conventio-nal linguistic decoder based on a language modelinvolving word unigram and word-class bigramprobabilities. The sentence S1, which is a correctone, is obtained by our proposed method whichincorporates an RNN prosodic model with theabove conventional linguistic decoder. In the past,the studies of prosodic modeling for Mandarinspeech recognition were few (Bai et al., 1997; Hsiehet al., 1996; Lyu et al., 1995). In the GoldenMandarin dictation machine (Lyu et al., 1995), theconcept of breath group was used to make the rec-ognition operate on a prosodic-segment-by-pro-sodic-segment mode. Bai et al. (1997) used pitch andenergy features to detect possible syllable and wordboundaries for pruning unnecessary path searches inkeyword recognition. Hsieh et al. (1996) used energyand pitch information to detect syllable boundariesand used syllable-lengthening factor to detect phraseboundaries for reducing the computational com-plexity of recognition search. Although the recog-nition speeds improved significantly in those studies,slight losses on the recognition rates were paid.In this paper, a new RNN-based prosodic

modeling method for Mandarin speech recogni-tion is proposed. It is performed in the post-pro-cessing stage of acoustic decoding and aims atdetecting word-boundary cues of the input utter-ance to help the following linguistic decoder solv-ing the problem of word-boundary ambiguity. Ituses an RNN to detect the word-boundary infor-mation from the input prosodic features extractedfrom the testing utterance with syllable boundariespre-determined by the preceding acoustic decoder.The detected word-boundary information is thenused in linguistic decoding to assist in determiningthe best word (or character) sequence.Two distinct properties of the proposed method

can be found as compared with previous studies(Batliner et al., 1996; Grice et al., 1996; Kompeet al., 1997; Silverman et al., 1992). One is that itadopts word boundary information as the outputtargets to be modeled instead of the conventionalmulti-level prosodic marks, such as the TOBI sys-tem (Grice et al., 1996; Silverman et al., 1992), or

the prosodic-syntactic features (Batliner et al.,1996; Kompe et al., 1997). This leads to the fol-lowing three advantages although the performanceof word boundary detection may be degraded forsome cases when two or more words are combinedinto a word chunk and pronounced within a pro-sodic phrase. First, it uses an RNN to automati-cally learn to do prosodic phrasing of Mandarinutterance and implicitly stores the mapping withinthe internal RNN representation. No explicit pro-sodic labeling of the speech signals is needed. Sec-ond, it is easy to incorporate the prosodic modelinto the linguistic decoder by a soft-decisionscheme which directly takes the RNN outputs asadditional scores, or by a partial-soft-and-partial-hard-decision scheme which uses the RNN outputsto drive an FSM for setting path constraints torestrict the linguistic decoding search (to be dis-cussed in Section 3). Both of them can cope withthe performance degradation on word boundarydetection caused by pronouncing a word chunkwithin a prosodic phrase. Third, it is relatively easyto prepare a large training database without thehelp of linguistic experts. Only a simple word tok-enization system is needed to analyze the texts as-sociated with the training utterances for finding theoutput targets to be modeled. Neither complicatedsyntactic analyses nor cumbersome prosodic-marklabeling are needed. Another property of the pro-posed method lies in its use of neural networktechnology to solve the prosodic phrasing problem.The organization of the paper is stated as fol-

lows. Section 2 presents the proposed prosodicmodeling method. Section 3 describes the func-tional blocks of the Mandarin speech-to-textconversion system. The way of incorporating theprosodic model into linguistic decoding is dis-cussed in detail. Effectiveness of the prosodicmodeling method and its usefulness in helpinglinguistic decoding were evaluated by simulationexperiments and are discussed in Section 4. Someconclusions are given in Section 5.

2. The proposed prosodic modeling method

Before discussing the proposed prosodic model-ing method, we briefly introduce the characteristics

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 249

of Mandarin Chinese. Mandarin Chinese is atonal and syllabic language. There exist more than80,000 words, each composed of one to severalcharacters. There are more than 10,000 commonlyused characters, each pronounced as a mono-syl-lable with one of five tones. The total numberof phonologically allowed mono-syllables is only1345. All mono-syllables have a very regular, hi-erarchical phonetic structure as shown in Fig. 1(Cheng, 1973). A mono-syllable is composed of abase-syllable and a tone. There are in total 411base-syllables. A base-syllable can be further de-composed into two parts: an optional initial (onset(Yin, 1989)) and a final (rime (Yin, 1989)). Theinitial part contains a single consonant if it exists.The final part consists of an optional medial (semi-vowel (Wu, 1998)), a vowel nucleus, and an optionalnasal ending (coda (Yin, 1989)). These 411 base-syllables are formed by all legal combinations of 21initials and 39 finals. These 39 finals are, in turn,formed by the combinations of 3 medials, 9 vowelnuclei and 5 nasal endings. There are only five lexicaltones, namely, Tone 1 (or high-level tone), Tone 2(or high-rising tone), Tone 3 (or low-dipping tone),Tone 4 (or high-falling tone), and Tone 5 (or neu-tral tone). Conventionally, a complete continuousMandarin speech recognition system is generallycomposed of two components: acoustic decodingfor mono-syllable identification and linguistic de-coding for word (or character) string recognition.Owing to the regular hierarchical phonetic structureof mono-syllables, acoustic decoding is tradition-ally further decomposed into two sub-components:base-syllable recognition and tone recognition. Inthis study, we add a new RNN-based prosodicmodel to the conventional continuous Mandarinspeech recognition system via inserting it in betweenacoustic decoding and linguistic decoding.

Fig. 2 shows a block diagram of the proposedRNN-based prosodic modeling method. It oper-ates in two phases: a training phase and a testingphase. In the training phase, each training utter-ance is first processed in acoustic decoding toobtain the best syllable-boundary segmentationmatching with the associated text. Some prosodicfeatures are then extracted based on the best syl-lable-boundary segmentation. Meanwhile, the textassociated with the input utterance is tokenizedusing a statistical model-based method to extractsome word-boundary information. An RNN pro-sodic model is then trained to learn the relation-ship between the input prosodic features of thetraining utterance and the output word-boundaryinformation of the associated text. In the testingphase, the input utterance is first processed inacoustic decoding to generate a top-N base-sylla-ble lattice (to be discussed in detail in Section 3).Some prosodic features are then extracted basedon the syllable-boundary segmentation of the top-1 base-syllable sequence. The well-trained RNNprosodic model is then employed to generate out-put word-boundary cues for the base-syllable lat-tice by using these input prosodic features. AnFSM is then used to discriminate reliable outputsof word-boundary cues from unreliable ones.These detected word-boundary cues are lastly usedto help linguistic decoding in the next stage. Fig. 3shows the architecture of the simple RNN. It is athree-layer network with all outputs of the hiddenlayer being fed back to the input layer as addi-tional inputs (Elman, 1990). An RNN of this typehas been shown in some previous studies (Chenet al., 1998; Elman, 1990; Robinson, 1994) topossess a good ability of learning the complex re-lationship of the input feature vector sequenceand the output targets via implicitly storing the

Fig. 1. The phonological hierarchy of Mandarin syllables. Here, the number within a parenthesis indicates the total number of the

specified unit in Mandarin Chinese.

250 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

contextual information of the input sequence in itshidden layer. So it is suitable for the problem ofrealizing a complex mapping between the inputprosodic features and the output linguistic fea-tures. The RNN can be trained by the backpropagation through time (BPTT) algorithm(Haykin, 1994).Inputs of the RNN prosodic model include

some acoustic features extracted from several syl-lable segments, of the top-1 base-syllable sequence,surrounding the current syllable segment. Featuresused in this study are selected based on their close

relation with the prosody of speech signal. It isknown that pitch, energy and timing informationare prosody-related features and, hence, widelyused in some previously prosodic-modeling studies(Campbell, 1993; Hirose and Iwano, 1997, 1998;Kompe et al., 1995; Wightman and Ostendorf,1994). Since prosody is a supra-segmental featureof speech signal, prosody-related features to beconsidered must be for speech segments muchlarger than frame. In this study, we choose syllablesegment as the basic unit to extract features forprosodic modeling because syllable is the basicpronunciation unit. For each syllable segmentof the top-1 base-syllable sequence, two prosodicfeature sets are extracted. One contains some localfeatures of the current syllable segment, while theother contains some contextual features extractedfrom the syllable segment and its two nearestneighbors. Fig. 4 shows a schematic diagram of thefeature extraction. Local features in the first setinclude: (1) the mean Mt and slope St of the pitchcontour of the current syllable segment, (2) thelog-energy mean Et and the normalized log-energymean NEt of the final part of the syllable segmentand (3) the normalized duration NDt of the syllablesegment. Here t is the index of the syllable segmentand the two normalization operations are per-formed with respect to the final type F ðtÞ of the tth

Fig. 2. A block diagram of the RNN-based prosody modeling method.

Fig. 3. The structure of the prosodic-modeling RNN.

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 251

syllable segment. The normalization operation isdiscussed as follows. Let the mean and standarddeviation of the log-energies of finals with typeF ðtÞ be lE

F ðtÞ and rEF ðtÞ, respectively. Then NEt is

obtained by first subtracting lEF ðtÞ from the log-

energy mean Et and then being divided by rEF ðtÞ.

The same normalization process is performed onthe duration of the syllable segment to calculateNDt. We note that these two normalization oper-ations are to compensate the high variabilities offinal on both log-energy level and syllable dura-tion. The reasons of using these local features inthe prosodic modeling study are briefly discussedas follows. Pitch mean and log-energy mean ofsyllable segment are useful in discriminating dif-ferent states of a prosodic phrase because both thepitch level and the log-energy level in the begin-ning part of a prosodic phrase are usually muchhigher than those in the ending part. Duration ofsyllable segment is useful in identifying the endingpoint of a prosodic phrase because the lengtheningeffect always occurs at the last syllable of a pro-sodic phrase.

The second set contains contextual featuresextracted from the current syllable segment and itstwo nearest neighbors. They include: (1) two flagsindicating whether the current syllable segment is,respectively, the beginning and ending syllablesegments, FB and FE, of a sentence, (2) two valuesshowing the durations, dpp and dsp, of the non-pitch segments between the current pitch contourand its two nearest neighbors, (3) two normalizedinter-syllable pause durations, Ndps and Ndss, be-sides the current syllable, (4) two pitch meandifferences, Mt �Mt�1 and Mt �Mtþ1, and (5) twolog-energy mean differences, Et � Et�1 and Et �Etþ1, between the current syllable and its twonearest neighbors. Here the two normalized val-ues, Ndps and Ndss, are performed with respect tothe initial types, IðtÞ and Iðt � 1Þ, of the currentand succeeding syllables, respectively. Specifically,let the mean and standard deviation of the pausedurations preceding initials of type IðtÞ be lpd

IðtÞ andrpdIðtÞ. Then, Ndps is obtained by subtracting lpd

IðtÞfrom the pause duration dps preceding the syllablesegment and then being divided by rpd

IðtÞ. The same

Fig. 4. A schematic diagram showing the extraction of the input prosodic features for the RNN-based prosodic modeling.

252 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

normalization process is applied to the pause du-ration dss following the syllable segment to obtainNdss. These two normalization operations are usedto compensate the affection of the succeeding ini-tial on pause duration. It is noted that the pitchmean and log-energy mean are set to zero for amissing preceding or succeeding syllable segment.Besides, dpp and dsp are set to zero for the first andlast syllable segments of a sentence, respectively.The reasons of using these features in the prosodicmodeling study are discussed as follows. The firsttwo parameters in (1) are to set the boundaryconditions. The use of inter-syllable pause dura-tion is to consider its relatively large value for bothmajor and minor breaks. The uses of pitch meandifference and energy mean difference are to con-sider the relatively large jumps of the pitch leveland the energy level at prosodic phrase bound-aries. Notice that similar features have been usedin some previous studies. For instances, funda-mental frequency mean difference of mora wasused in the detection of the accent type of prosodicword (Iwano and Hirose, 1998, 1999; Iwano,1999). There are in total 15 prosodic features ex-

tracted for each syllable segment. The number ofinput features for each syllable segment with itstwo nearest neighbors therefore equals to 105. Theissue of the suitable amount of input features forthis problem can be raised for discussion. How-ever, based on the dimensionality reduction func-tion of multi-layer neural network (Morgan andScofield, 1991), the extracted features in the hiddenlayer of a multi-layer neural network correspondto a low-dimensional projection from the inputfeature space to the input pattern space. Non-essential information for output classification canbe removed automatically.The evolution of the calculation of prosodic

features in a sample utterance is shown in Fig. 5.The waveform and the segmentation informationare shown in the upper part of this figure. Theutterance contains 11 syllables. The solid lines,dashed lines and dotted lines represent the startingpoints of syllables, the junctions of initial and finaland the ending points of syllables, respectively.The middle part of this figure shows the F 0 con-tour of the utterance. The star symbols representthe F 0 means of these 11 syllables. The energy

Fig. 5. The waveform, segmentation information, F 0 contour and energy contour of the sample utterance ‘‘/ch2/ /yeh4/ /lan2/ /chiu2/ /dui4/ /yuan2/ /shen1/ /tsai2/ /hsiang1/ /dang1/ /kao1/’’ (The professional basketball players are very tall.).

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 253

contour of the utterance is shown in the lower partof this figure. The star symbols represent the en-ergy means of these 11 syllables. All the prosodicfeatures can be calculated based on the segmen-tation information, the F 0 mean and the energymean shown in this figure. It can be seen from thisfigure that the features of F 0 mean and energymean can represent well the global variations ofutterance. Moreover, in order to considering theaffection from a wider context in the prosodicmodeling, the prosodic feature sets of the currentsyllable segment and its six nearest neighbors areassembled and taken as the inputs of the RNNprosodic model. In the feature-assembling process,two things need to be specially taken care. One isthe elimination of some duplicate contextual pro-sodic features collected across several neighboringsyllable segments. This can increase the efficiencyof the prosodic modeling. The other is the settingof boundary conditions for some syllable segmentsin the beginning and ending parts of every utter-ance. A simple scheme to set all non-existingprosodic features to zero is adopted in this study.Investigating the responses of the RNN prosodicmodel to the boundary conditions confirmed thesuitability of the simple boundary-condition-set-ting scheme.The outputs of the prosodic-modeling RNN

include four linguistic features. These four flagsindicating whether the current syllable is a mono-syllabic word or is the beginning syllable, the in-termediate syllable, or the ending syllable of apolysyllabic word. These four output features aredenoted as MW (a mono-syllabic word), BPW (thebeginning syllable of a polysyllabic word), IPW(an intermediate syllable of a polysyllabic word)and EPW (the ending syllable of a polysyllabicword), respectively. To prepare output targets fortraining the RNN, texts associated with all train-ing utterances are tokenized into word sequencesin advance by an automatic statistical model-basedalgorithm with a long-word-first criterion (Su,1994). The algorithm was tested to achieve a wordtokenization accuracy rate around 97% (Su, 1994).Here, a lexicon containing 111,243 words is usedin the tokenizing process. Besides, several simpleword-merging rules are used to improve the wordtokenization. They are bracketing rules which

construct some types of compound words, missingin the current lexicon, including character-dupli-cated compound words (e.g., (happy)),determiner-measure compound words (e.g.,(one type)), short verb–preposition compoundwords (e.g., (sit at)), noun-localizer compoundwords (e.g., (inside forest)), short adverb–verb compound words (e.g., (can predict)),negation-verb compound words (e.g., (notkeep time)), short adjective–noun compoundwords (e.g., (small stone)), etc. Some toke-nization errors are corrected manually. All outputtargets to train the RNN are extracted from thesetokenized word sequences. It is worth noting that,owing to the following two reasons, we do not usehigh-level syntactical features, such as syntacticphrase boundaries, in this study. One is that it isgenerally not easy to do syntactic analysis forunlimited texts of natural Chinese language. Theother is that the syntactic structure of a Chinesetext is not isomorphic to the prosodic phrasestructure of the corresponding Mandarin speech.To check the classification ability of the RNN,

the four flags showing the location of the currentsyllable in a word are considered. This flag set isreferred to as Word-tag. An FSM is used to ex-amine whether the responses of the RNN are goodenough to make reliable classifications for this flagset. The purpose of the FSM is to provide a par-tial-hard-and-partial-soft classification scheme forsafely invoking the classification results into thespeech recognition process. We can therefore takethe advantage of pre-classifying the input speechsignal to design a sophisticated search procedurefor the recognition process when the pre-classifi-cation is reliable and to simultaneously avoid un-amendable speech recognition errors caused bypre-classification errors. The topology of theWord-tag FSM is shown in Fig. 6. The symbols‘‘M’’, ‘‘B’’, ‘‘I’’, ‘‘E’’ and ‘‘U’’ shown in the Word-tag FSM denote ‘‘MW’’, ‘‘BPW’’, ‘‘IPW’’, ‘‘EPW’’and uncertain states, respectively. The operationof the FSM is stated as follows. When one RNNoutput is higher than the other three outputs by athreshold Thd and is also higher than a highthreshold Thh, it moves to the associated stablestate if it is a legal one; otherwise it moves to anuncertain (U) state. These two thresholds are de-

254 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

termined empirically by considering the tradeoffbetween the classification accuracy rate and thenumber of undetermined responses.

3. The mandarin speech-to-text conversion system

invoking with the RNN prosodic model

A complete block diagram of the proposedspeech-to-text system invoking with the RNNprosodic model is shown in Fig. 7. The inputspeech is first preprocessed to extract someacoustic features for base-syllable recognition.Recognition features extracted include 12 Mel-frequency cepstral coefficients (MFCCs), 12 deltaMFCCs, and a delta log-energy. Then, HMM-

based base-syllable recognition is done to generatea top-N base-syllable lattice. The HMM recognizeruses 100 three-state right-final-dependent (RFD)initial models and 39 five-state context-indepen-dent (CI) final models to form 411 eight-state base-syllable models and is trained using the maximallikelihood criterion. For silence, a single-statemodel is used. The observation features in eachHMM state is modeled by a mixture Gaussiandistribution. The number of mixture componentsin each state of these models is variable with arange from 1 to 20. The number of mixture com-ponents used in each HMM state depends on thenumber of training data. The top-N base-syllablelattice is generated by the Viterbi-parallel-back-trace method (Huang and Wang, 1994). Themethod consists of two steps: a forward one-stageViterbi search and a parallel-backtracking (PB)procedure. In the first step, the top-1 base-syllablestring matching with the input utterance is ob-tained by a one-stage Viterbi search. The corre-sponding base-syllable boundaries are detected bydecoding the top-1 base-syllable string through asimple backtracking procedure. Then, in the sec-ond step, the PB procedure is applied to expandthe top-1 base-syllable string to the top-N base-syllable lattice. The PB procedure keeps allbase-syllable boundaries and expands all top-1base-syllables in parallel to produce the base-syl-lable candidates of the top-N base-syllable lattice.For each top-1 base-syllable, it first sets a searchrange by letting the end boundary be fixed to theending point of the top-1 base-syllable segmentand relaxing the beginning boundary to a smallrange of the beginning point of the top-1 base-syllable. It then matches the speech segment inthe search range with HMM models of all base-syllables other than the top-1 base-syllable to findthe remaining top-N base-syllable candidates bytime-reversed Viterbi searches.With all base-syllable boundaries pre-deter-

mined by the HMM base-syllable recognizer, tonerecognition and prosodic modeling are then per-formed. Since the tonality has lexical meaning asdiscussed in Section 2, tone recognition is impor-tant in Mandarin speech recognition. An RNNtone recognizer (Wang and Chen, 1994) is used inthis study. It also employs an RNN with the same

Fig. 6. The topologies of the Word-tag FSM.

Fig. 7. A functional block diagram of the proposed Mandarin

speech-to-text conversion system.

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 255

structure shown in Fig. 3 to discriminate the fivelexical tones using recognition features extractedfrom the speech part surrounding the current syl-lable. The feature extraction is discussed as fol-lows. Given with all base-syllable boundaries, twosets of local features and contextual features areextracted for every syllable. The local features areextracted from the current syllable segment andinclude: (1) four orthogonal transformed coeffi-cients (Chen et al., 1998) representing the meanand the shape of the pitch contour, (2) the log-energy mean, (3) the duration of the syllabic pitchcontour, (4) the syllable initial type, (5) the syllablemedial type, (6) the syllable final type and (7) thesyllable nasal-ending type. The contextual featuresinclude: (1) eight orthogonal transformed coeffi-cients representing the pitch contours of the twonearest neighboring syllables, (2) FB and FE, (3) dppand dsp and (4) Ndps and Ndss. There are in total 53features used in the tone recognition. Note thatsome features are simultaneously used for RNNprosodic model and RNN tone recognizer. Byusing the above tonal features, the RNN tonerecognizer generates the best M tone candidatesfor each base-syllable segment.Then the outputs of the base-syllable recog-

nizer, the tone recognizer, and the prosodic modelsare all fed into the linguistic decoder to generatethe recognized word (character) string. The lin-guistic decoder first combines the top-N base-syllable lattice generated by the base-syllablerecognizer, with the best M tone candidates gen-erated by the tone recognizer, to form an M � Ntonal-syllable lattice. It then performs a decodingsearch on the syllable lattice to find the best wordstring. The decoding search employs a Viterbisearch algorithm invoking with a word-construc-tion process and a statistical language model. Theword-construction process uses a lexicon contain-ing 111,243 entries. Each entry consists of one tofive syllables. A backward lexical tree is built fromthe lexicon for word construction. Each node of

the lexical tree represents a tonal syllable. Thenode number in each layer of the lexical tree islisted in Table 1. Fig. 8 shows a small portion ofthe lexical tree. Here nodes circled with solid anddashed lines represent leaf and intermediate nodes,respectively. There may exist a list of homonymwords on a leaf node. The detailed informationregistered in each node includes the address of nextnode in the same layer, the address of child node inthe succeeding layer, the tonal-syllable code, a flagshowing whether the current syllable is the begin-ning syllable of a word, the word unigram prob-ability, and a list of all homonym words. Thestatistical language model used in the linguisticdecoding search accommodates both word-uni-gram and word-class-bigram probabilities. Wordclasses used in the calculation of word-class-bigram probabilities are generated by a simpleword classification scheme which considers a spe-cial property of Chinese language. The propertysays that many polysyllabic words sharing thesame beginning or ending character have the same

Table 1

The node number distribution of the lexical tree

Layer no. First layer Second layer Third layer Fourth layer Fifth layer Total

Node number 1217 69811 37169 14238 808 123243

Fig. 8. A small part of the backward lexical tree.

256 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

function in syntax or semantics (Yang et al., 1994).For examples, (train), (car), (bi-cycle), etc., all end with the character ‘‘ ’’ which isa generic name of vehicle; (vice chairman),

(vice president), (vice principal), etc.,all begin with the character ‘‘ ’’. Two sets of wordclasses are generated to comply with the specialproperty. One is formed by clustering all wordswith the same beginning base-syllable into a class.It is referred to as CR. The other is formed byclustering all words with the same ending base-syllable into a class and is referred to as CL. BothCR and CL contain 411 word classes. Each word-class-bigram probability, therefore, represents thefrequency of a word-class-pair with the left andright word classes belonging to CL and CR, re-spectively. A well-tagged corpus which containsabout three million words (Chinese KnowledgeInformation Processing Group, 1995) is used totrain the statistical language model. Although thenumber of all possible combinations of right andleft word classes is 411� 411, there exist only73761 combinations in the corpus. All missingword-class-bigram probabilities are simply as-signed a low probability value.Invoking with the word-construction process

and the statistical language model, the Viterbisearch calculates discriminant scores for all wordsequence hypotheses and finds the best word se-quence with maximal discriminant score as therecognized result. Three searching schemes areused in this study. They include the conventional(baseline) scheme and two schemes assisted withthe proposed RNN prosodic model. In the base-line scheme without invoking the prosodic model,the Viterbi search starts a searching process ateach syllable segment of the syllable lattice. Thesearching process continues backward alongthe syllable lattice and ends at the location of thefourth syllable segment ahead because the maxi-mum word length is five for the lexicon used in thestudy. In the backward searching process, allpossible words with lengths from one to five syl-lables are constructed using the backward lexicaltree. The score of each word-sequence candidateW is formed by combining the likelihood scoresof all constituent base-syllables provided by theHMM base-syllable recognizer, the scores of all

constituent tones provided by the RNN tone rec-ognizer, the unigram probabilities of all constitu-ent words, and the bigram probabilities of allconstituent word-class-pairs, i.e.,

LðW Þ ¼XKk¼1

wHMM lnðpHMMðbkÞÞ

þXKk¼1

wtone lnðoðtkÞÞ þXL

l¼1wUG lnðpUGðwlÞÞ

þXL

l¼2wBG lnðpBGðCRðwlÞjCLðwl � 1ÞÞÞ;

ð1Þ

where W ¼ fw1w2; . . . ;wLg ¼ fs1s2; . . . ; sKg con-sists of Lwords or equivalentlyK syllables, wl is thelth word, sk ¼ ðbk; tkÞ is the kth syllable formed bybase-syllable bk and tone tk; oðtkÞ is the score of tonetk, pHMMðbkÞ is the likelihood score of base-syllablebk, pUGðwlÞ is the unigram probability of word wl,pBGðCRðwlÞjCLðwl�1ÞÞ is the word-class-bigramprobability of the word-pair (wl�1, wl), andwHMM;wtone;wUG and wBG denote the weights forthe four different scores. A discriminating optimi-zation experiment (Chiang et al., 1996) has beenconducted in finding the suitable weights to inte-grate the four different scores. Two schemes of in-corporating the RNN prosodic model into thelinguistic decoding are proposed. Scheme 1 modi-fies the baseline scheme by directly taking theoutputs of the prosodic-modeling RNN as addi-tional scores and changing the recognition score by

L0ðW Þ ¼ LðW Þ þXKk¼1

wPMlPMðskÞ; ð2Þ

where

lPMðskÞ ¼lnðoMWðkÞÞ if sk is a mono-syllabic word;lnðoBPWðkÞÞ if sk is the beginning syllable of

a polysyllabic word;lnðoIPWðkÞÞ if sk is an intermediate syllable

of a polysyllabic word;lnðoEPWðkÞÞ if sk is the ending syllable of

a polysyllabic word;

8>>>>>>>><>>>>>>>>:

ð3Þ

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 257

oMWðkÞ; oBPWðkÞ; oIPWðkÞ and oEPWðkÞ are, respec-tively, the MW, BPW, IPW and EPW outputs ofthe prosodic modeling RNN at syllable sk and wPMdenotes the corresponding weight for this score.The suitable weights of wHMM;wtone;wUG;wBG andwPM are obtained via using a joint optimizationexperiment (Chiang et al., 1996). Scheme 2 is anextended version of Scheme 1. In addition to usingthe new recognition score defined in Eq. (2),Scheme 2 also uses the word-boundary informa-tion provided by the Word-tag FSM to set pathconstraints to further restrict the Viterbi search.Different constraints are set for the five states of M(MW), B (BPW), I (IPW), E (EPW) and U (un-certain). Generally, more restrictive searches areused for the three decisive states of M, B and Ewhile full searches are used for I and U states.Specifically, when a syllable segment of the inputsyllable lattice is classified as an M state, a B state,or an E state, we restrict it to be a mono-syllabicword, the beginning syllable of a polysyllabicword, or the ending syllable of a polysyllabic wordin the Viterbi search. Otherwise, for an I state ora U state, we do not set any restrictions to thesearching process. In realization, Scheme 2 modi-fies the Viterbi search of the baseline scheme byletting a backward searching process be activatedonly when the current syllable segment is at M, E,I or U state and stopped at a syllable segment with

B or M state or before a syllable segment with Mor E state. It is noted here that I state is treatedin the same way as U state because many MWs,BPWs and EPWs are erroneously classified asIPWs (to be discussed in detail in Section 4). Fig. 9compares the backward searching processes of thebaseline scheme and Scheme 2 for a simplified top-2 syllable lattice labeled with the outputs of theWord-tag FSM. In the figure, all partial pathsneed to be considered for constructing candidatewords via accessing the word-lexical tree are dis-played. It can be found from Fig. 9 that Scheme 2has much less partial paths to be considered thanthe baseline scheme. So Scheme 2 is more efficientin computational complexity.

4. Experimental results

Effectiveness of the proposed prosodic mod-eling method was examined by simulation ex-periments on a speaker-dependent, continuousMandarin speech recognition task using a largesingle-speaker database. The database contained452 sentential utterances and 200 paragraphicutterances. Texts of these 452 sentential utteranceswere well-designed, phonetically balanced shortsentences with lengths less than 18 characters.Texts of these 200 paragraphic utterances were

Fig. 9. The comparison of the backward searching processes of the baseline scheme and Scheme 2 for a simplified top-2 syllable lattice

labeled with the outputs of the Word-tag FSM. Here, all partial paths need to be considered for constructing candidate words via

accessing the word-lexical tree are shown.

258 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

news selected from a large news corpus to covera variety of subjects including business (12.5%),medicine (12.0%), social event (12.0%), sports(10.5%), literature (9.0%), computers (8.0%), foodand nutrition (8.0%), etc. All utterances weregenerated by a male speaker. They were all spokennaturally at a speed of 3.5–4.5 syllables per second.The database was divided into two parts. The onecontaining 491 utterances (or 28060 syllables) wasused for training and the other containing 161utterances (or 7034 syllables) was used for testing.

4.1. Results of the prosodic modeling

To test the proposed prosodic modelingmethod, we first trained the HMM base-syllablerecognizer and used it to segment all speechutterances into syllable sequences. Pitch is thendetected by the simplified inverse filter tracking(SIFT) algorithm (Markel and Gray, 1976). TheSIFT algorithm first uses an LP filter to reducethe bandwidth of the input signal to 0–1 kHz. TheLP-filtered signal is then down-sampled. Then,a fourth-order inverse filter designed using theautocorrelation method for linear predictionanalysis is applied to flatten the signal spectrum.Pitch period is then detected from the inverse-fil-tered signal by the autocorrelation method. Asecond-order interpolation is then applied to in-crease the resolution of the detected pitch period.Besides, a simple error correction is applied toeliminate some errors with abrupt pitch jumps.Lastly, all remaining errors are manually cor-rected. It is noted that the manual correction effortis much less than the manual prosodic-labelingeffort because the former task is much easier. Afterwe detected pitch, features for RNN-based pro-sodic modeling were then extracted from all seg-mented training utterances. Meanwhile, texts of alltraining utterances are tokenized into word se-quences. Word-boundary features for RNN out-put targets were then extracted. The details ofthese feature extraction processes have been de-scribed in Section 2. The RNN prosodic modelwas then trained by the BPTT algorithm. Thenumber of nodes in the hidden layer of the RNN isdetermined empirically and set to be 30 in thisstudy.

Table 2 shows the classification results of theRNN-based prosodic modeling for the output setof Word-tag without invoking the FSM. Accuracyrate of 71.9% was achieved for Word-tag set.Then, the performance of the RNN-based pro-sodic modeling invoking with Word-tag FSM wasexamined. The Word-tag FSM used thresholds ofThh ¼ 0:6 and Thd ¼ 0:3. In this FSM, an uncer-tain state was added for the cases when the RNNdid not respond well for making reliable classifi-cations. Experimental results are shown in Table 3.Accuracy rate increased to 87.6% with 32.3%outputs staying in uncertain states for Word-tagset. Table 4 shows the classification performancesof Word-tag FSM for two sets of thresholds. It canbe found from the table that the classification ac-curacy rate increased with a paid of putting moresyllables into U state as a more restrictive thresh-old setting was used. Fig. 10 shows a typicalexample of the Word-tag responses of the pro-sodic-modeling RNN and the corresponding FSMto an input sentential utterance. It can be foundfrom the figure that the FSM functioned well toreliably determine all B states and some I andE states. Lastly, it is noted that, although theabove prosodic modeling test was conducted in a

Table 2

The confusion matrix of Word-tag classification for the RNN

prosodic model. The overall classification rate is 71.9%

Result

Desired BPW IPW EPW MW

BPW 1861 317 110 52

IPW 288 1106 341 11

EPW 135 369 1774 62

MW 160 28 107 313

Table 3

The confusion matrix of Word-tag classification for the RNN

prosodic model invoking with an FSM using

Thh ¼ 0:6; Thd ¼ 0:3. The classification rate is 87.6%

Result

Desired BPW IPW EPW MW Uncertain

BPW 1547 147 27 14 605

IPW 141 720 118 1 766

EPW 56 163 1420 37 664

MW 97 7 67 198 239

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 259

speaker-dependent mode, it can be adapted to aspeaker-independent mode via properly normaliz-ing the input pitch and energy features by the pitchlevel and loudness of each individual speaker. Thisis worth further studying in the future.

4.2. An error analysis of the prosodic modeling

In this study, the linguistic features, used asoutput targets to train the RNN prosodic model,are primarily extracted based on the word toke-nization results obtained by a simple statisticalmodel-based method. But, owing to the out-

of-vocabulary (OOV) problem, the tokenizationcannot be very accurate. Besides, the prosodicphrasing of an utterance can be influenced bymany factors other than the linguistic features ofthe associated text, such as word emphasis, theneed for breathing, speaking rate and emotionalstatus, etc. Those mismatches are harmful to theaccuracy of the RNN prosodic modeling. A detailerror analysis is therefore needed in order tocompensate those effects for making the classifi-cation results more useful in linguistic decoding. Inthe following, the classification errors shown inTable 3 for Word-tag set were investigated.

Table 4

Performance comparison of Word-tag classifications for the RNN prosodic model without FSM and with FSM using two different sets

of thresholds

Word-tag No threshold Thh ¼ 0:6; Thd ¼ 0:3 Thh ¼ 0:8; Thd ¼ 0:6

BPW 76.1% 84.0% 91.2%

IPW 60.8% 69.4% 78.2%

EPW 76.8% 87.0% 94.8%

MW 71.5% 79.2% 87.6%

Fig. 10. An example of the Word-tag responses and the corresponding FSM outputs for a sentential utterance.

260 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

MW errors. From Table 3, an MW was easy to bemisclassified as a BPW or an EPW. This was be-cause it was combined with the following or pre-ceding words to form a compound word. About70% errors of misclassifying MWs as BPWs oc-curred when these MWs were short adverb, prep-osition, conjunction or transitive verb (see examplesentences M-a and M-b, where the processingChinese character and the corresponding word inEnglish translation are marked with an underline).This also easily occurred when MWs were specialfunction words, such as ‘‘ ’’ (is), ‘‘ ’’ (has) and‘‘ ’’ (of) (see M-c and M-d). For the case of mis-classifying MWs as EPWs, we found that 17%,17%, 30% and 10% of MWs were, respectively,preposition, auxiliary, pronoun and verb words(see M-e and M-f). This also easily occurred whenMWs were ‘‘ ’’ and ‘‘ ’’. For some cases, an MWwas combined simultaneously with the followingand preceding words to form a word chunk whenit was a preposition or conjunction (see M-g).M-a: (Dirty things are all wipedoff.);M-b: (Just like work-ing in a dark room.);M-c: (It ought to be nochange at all.);M-d:(The chord organ match is an annual interna-tional music contest.);M-e: (Driving to west Texas.);M-f: (Dirty air makes mefeel dizzy.);M-g: (Talking about physicalagility and arm strength.).

BPW errors. Most classification errors of BPWsoccurred when they were combined with the pre-ceding words and misclassified as IPWs. This wasespecially easy to occur when both the current andpreceding words were all short (see example sen-tences B-a and B-b). For the cases of misclassifingBPWs as EPWs, we found that 42% of them weresyllables with Tone 2 or Tone 3 which were pro-nounced lightly (see B-c and B-d). The processingChinese characters and the corresponding wordsin English translation are marked with an under-line in the following examples. It is noted that,owing to the difficulties in finding the one-to-one

correspondence between the Chinese character andEnglish word, we choose the suitable English wordby checking whether it contains the meaning of theunderlined Chinese character.B-a: (The students lineup to welcome the triumphant players.);B-b: (The climate in Taiwan ishumid.);B-c: (The chief of thelabor administration section of the social de-partment.);B-d: (The final of national con-test.).

IPW errors. Most IPW errors occurred when longwords were prosodically broken into pairs of shortwords and misclassified as BPWs and EPWs (seeI-a and I-b). Some other errors were owing tothe concatenations with low-energy syllables withTone 2 or Tone 3 (see I-c and I-d). Some inter-esting cases occurred when unfamiliar words, suchas long translated names of foreigners and infre-quently used characters, were pronounced. Thespeaker seemed to hesitate to utter such charactersso as to insert breaks around them (see I-e and I-f).I-a: (The company of flow-er production and distribution in Taipei city.);I-b: (The sightseeingleisure area and lakeside vacation area.);I-c: (The association of mo-tor sports in Mainland.);I-d: (Sixteen million US dol-lars.);I-e: (He looks aroundfearfully.);I-f: (The works ofEuripides will be performed on the stage.).

EPW errors. Most errors were owing to the con-nections with short succeeding words to becomeIPWs (see E-a and E-b). Besides, over 95% errorsof misclassifying EPWs as MWs occurred at theending syllables of compound words formed bycombining words with special function words like‘‘ ’’ (of), ‘‘ ’’ (a unit) and ‘‘ ’’ (at) (see E-c and E-d). Some other errors resulted from the interfer-ences by the preceding low-energy syllables withTone 2 or Tone 3 (see E-e and E-f).E-a: (The degree of senior highschool and above.);

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 261

E-b: (To manage a used motor-cycle shop.);E-c: (The future job opportu-nity.);E-d: (TheChinese folk fair is held regularly during theLantern Festivals.);E-e: (Cannot contendwith big and tall players.);E-f: (The man married with an el-der woman.).

Generally speaking, most word boundary detec-tion errors are due to the tendency of groupingfunction words and some MWs with adjacentwords. This phenomenon was found by Campbell(1993) in a study of detecting prosodic boundariesin British English. It showed that a function wordor an MW can be merged either with the precedingword in accordance with rhythmic principles, orwith the following word in accordance with syn-tactic principles to form a word chunk to be pro-nounced within a prosodic phrase. Aside from thesame phenomenon, we also found in the currentstudy that the merging process may also occur atboth sides simultaneously for Mandarin speech incase that the two adjacent words are all short.Some other word boundary detection errors areowing to the false alarms occurred at the neigh-borhoods of characters with Tone 2 or Tone 3.This mainly results from the ambiguity in distin-guishing an abrupt F 0 jump owing to the F 0 re-setting occurred at a prosodic boundary or to thecharacteristics of these two lexical tones.

4.3. Results of the speech-to-text conversion

The same database used in the prosodic mod-eling was used in this study to test the speech-to-

text conversion system invoking with the proposedRNN prosodic model. A sub-syllable-based HMMrecognizer was constructed from the training setby the maximum likelihood training algorithm.Each testing utterance was first preprocessed bythe HMM base-syllable recognizer to generate atop-N base-syllable lattice. The top-1 base-syllablerecognition rate was 81.4% with substitution, in-sertion and deletion error rates being, respectively,16.5%, 1.1% and 1.0%. The base-syllable inclusionrate was 96.5% for the top-10 base-syllable lattice.Then, both features for prosodic modeling and fortone recognition were extracted based on the givensyllable-boundary segmentation of the top-1 base-syllable recognition. The top-2 tone inclusion ratefor the case of using hand-corrected pitch contourswas 98.0%. After performing prosodic modelingand tone recognition, we combined their resultswith the base-syllable recognition results in thelinguistic decoder to generate the best recognizedcharacter (word) string.The two schemes, Schemes 1 and 2, of invoking

the RNN prosodic model into the linguistic de-coding were then examined. Experimental resultsare displayed in Table 5. The character accuracyrate was calculated according to the followingformula:

character accuracy rate

¼ 1� insertionþ deletionþ substitutiontotal character number

: ð4Þ

The search complexity counted the number of ac-cesses to the lexical tree for constructing candidatewords to be disambiguated in the linguistic de-coding search. It can be found from Table 5 thatthe character accuracy rate achieved by the base-line scheme without invoking the prosodic modelwas 73.6%. Detailed analyses revealed that the

Table 5

The performance comparison for speech-to-text conversion using three different linguistic decoding schemes

Method Use of Word-tag RNN

outputs

Word-tag FSM Character accuracy Complexity reduction

Baseline X X 73.6% X

Scheme 1 Yes X 74.6% X

Scheme 2-1 Yes Thh ¼ 0:6;Thd ¼ 0:3 73.9% 30.5%

Scheme 2-2 Yes Thh ¼ 0:8;Thd ¼ 0:6 74.7% 17%

262 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

substitution, insertion and deletion error rateswere 24.8%, 0.5% and 1.1%, respectively. Thecharacter accuracy rate was improved from 73.6%to 74.6% as we invoked the RNN prosodic modelin the linguistic decoding by Scheme 1. The im-provement resulted from the decrease of the sub-stitution error rate by about 1% with almost nochange for both the insertion and deletion errorrates. Accuracy rates of 73.9% and 74.7% with30.5% and 17% of search complexity reductionswere achieved for the two cases of Scheme 2. Thisshows that the threshold setting in Word-tag FSMof Scheme 2 can make a tradeoff between the im-provement on recognition performance and thereduction in computational complexity. If wecompare these experimental results with thoseobtained in two related studies of using prosodicinformation to assist in speech recognition (Hsiehet al., 1996; Kompe et al., 1997), the proposedmethod is better because it improves not onlythe computational complexity but also the rec-ognition performance. On the contrary, bothprevious methods improved the computationalcomplexities in paid of minor losses on the rec-ognition performance. If we compare the pro-posed method with another method of usingprosodic modeling to improve mora recognitionfor Japanese speech (Iwano and Hirose, 1999), it isslightly inefficient on performance improvement.Based on above discussions, we can conclude thatthe proposed method of using prosodic informa-tion in Mandarin speech recognition is a promis-ing one.

4.4. An error analysis of the speech-to-text conver-sion

For better understanding the effect of the RNN-based prosodic modeling on the linguistic decod-ing, we made a detailed error analysis for Scheme2-2. Table 6 lists six main types of recognitionerrors. The first three error types, E1–E3, resultedfrom three types of acoustic processing errors in-cluding: (1) base-syllable recognition error –correct base-syllable is not included in top-10base-syllable candidates; (2) tone recognitionerror – correct tone is not included in top-2 tonecandidates; and (3) syllable-boundary segmenta-

tion error – owing to insertion and deletion errorsin the top-1 base-syllable recognition. The HMMbase-syllable recognizer was responsible for E1-and E3-type errors. Improving the performance ofthe HMM base-syllable recognizer is surely veryhelpful to correct them. The E2-type errors werecaused by the RNN tone recognizer and may becorrected by including more tone candidates. Butthis will increase the computational complexity ofthe linguistic decoding. The two types of errors, E4and E5, were owing to the incompleteness of word-merging rules applied to construct compoundwords. The use of a more sophisticated word-construction algorithm is surely helpful to correctthem. The error type of E6 is due to the out-of-vocabulary problem. Increase the size of the lexi-con may be useful to improve it. Since correctwords could not be constructed to form a correctword sequence hypothesis when errors of these sixtypes occurred, the word-boundary informationprovided by the RNN prosodic model is vain forcorrecting those errors in the linguistic decoding. Ifwe do not count those uncorrectable errors, theimprovement on the character accuracy rate by theproposed method is significant.Lastly, many errors were owing to homonymic

ambiguity. In Mandarin speech-to-text conver-sion, homonymic ambiguity is a serious problemfor recognizing mono-syllabic words because eachsyllable maps, in average, to over 10 characters. Amore sophisticated language model is helpful tosolve the problem. If we neglect all homonym er-rors, the character accuracy rate increases to82.5%.

Table 6

An error analysis of the linguistic decoding

Error types Percentage

E1: Base-syllable missing in the base-

syllable lattice

11.1%

E2: Tone recognition error 21.6%

E3: Speech segmentation error 9.3%

E4: Compound word error owing to no

morphological rules applied

7.0%

E5: Quantity word error owing to no

morphological rules applied

4.8%

E6: OOV of the lexicon 8.0%

E7: Others 38.2%

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 263

5. Conclusions

A new RNN-based prosodic modeling methodfor Mandarin speech recognition has been dis-cussed in this paper. It uses an RNN to detectword-boundary information from the input pro-sodic features with base-syllable boundary beingpre-determined by an HMM-based acoustic de-coder. Two schemes of using the word boundaryinformation to assist the linguistic decoder insolving word-boundary ambiguity as well aspruning unlikely paths were proposed. Experi-mental results on a speaker-dependent speech-to-text conversion test have confirmed that the RNNprosodic model is effective on detecting usefulword boundary information from the input testingutterance for assisting in the linguistic decoding.Accuracy rate of 71.9% has been obtained forWord-tag detection. The character accuracy rateof speech-to-text conversion has increased from73.6% to 74.7% with an additional gain of 17%reduction in the computational complexity of thelinguistic decoding search. So it is a promisingprosodic modeling method for Mandarin speechrecognition.Some further studies to improve the proposed

RNN-based prosodic modeling method areworthwhile doing in the future. One is to improvethe efficiency of the RNN prosodic model bycompensating the effects of other affecting factors,such as tone and phonemic constituents of sylla-ble, on the input prosodic features. Specifically,both the pitch and energy levels of a syllable areseriously affected by its tone. The energy level of asyllable is also seriously affected by its final type. Ifwe can isolate those affecting factors from theprosodic modeling, we may achieve a better pro-sodic phrasing result. Another worthwhile study isto improve the effectiveness of the RNN prosodicmodel by providing more accurate prosodicphrasing information of training utterances forhelp preparing output targets. There are manycases that two or more words are combined to-gether to form a word chunk and pronouncedwithin a prosodic phrase in Mandarin speech.Experimental results have confirmed that this ef-fect resulted in degradation on the performance ofword boundary detection. If we can extend the

current study to additionally include the word-chunk boundary information to help setting out-put targets, we may obtain a more precise RNNprosodic model. The other worthy further study isto find new ways of incorporating the RNN pro-sodic model into the speech recognition. In thecurrent study, the RNN prosodic model is com-bined with the linguistic decoder to provide addi-tional scores for discriminating word boundaryfrom non-word-boundary (Scheme 1) and to ad-ditionally set path constraints for restricting thelinguistic decoding search (Scheme 2). Other waysof using prosodic modeling information in eitherlinguistic decoding or acoustic decoding is worthfurther exploring.

Acknowledgements

The database was provided by ChunghwaTelecommunication Laboratories and the basiclexicon was supported by Academia Sinica ofTaiwan.

References

Bai, B.R., Tseng, C.Y., Lee, L.S., 1997. A multi-phase

approach for fast spotting of large vocabulary Chinese

keywords from Mandarin speech using prosodic informa-

tion. In: Proc. IEEE Intern. Conf. Acoust., Speech, Signal

Process. (ICASSP), pp. 903–906.

Batliner, A., Kompe, R., Kie�ling, A., Niemann, H., N€ooth, E.,1996. Syntactic-prosodic labeling of large spontaneous

speech data-base. In: Proc. Int. Conf. On Spoken Language

Process. (ICSLP), Vol. 3, pp. 1720–1723.

Bou-Ghazale, S.E., Hansen, J.H.L., 1998. HMM-based stressed

speech modeling with application to improved synthesis and

recognition of isolated speech under stress. IEEE Trans. on

Speech and Audio Processing 6 (3), 201–216.

Campbell, W.N., 1993. Automatic detection of prosodic

boundaries in speech. Speech Communication 13, 343–354.

Chen, S.H., Hwang, S.H., Wang, Y.R., 1998. An RNN-based

prosodic information synthesizer for Mandarin text-to-

speech. IEEE Trans. on Speech and Audio Process. 6 (3),

226–239.

Cheng, C.C., 1973. A Synchronic Phonology of Mandarin

Chinese. Mouton, The Hague.

Chiang, T.H., Lin, T.H., Su, K.Y., 1996. On jointly learning the

parameters in a character-synchronous integrated speech

and language model. IEEE Trans. Speech and Audio Proc.

4 (3), 167–189.

264 W.-J. Wang et al. / Speech Communication 36 (2002) 247–265

Chinese Knowledge Information Processing Group. 1995. The

contents and descriptions of Sinica Corpus, Technical

Report no. 95-02, Academia Sinica.

Elman, J., 1990. Finding structure in time. Cognitive Science

14, 179–211.

Grice, M., Reyelt, M., Benzm€uuller, R., Mayer, J., Batliner, A.,1996. Consistency in transcription and labeling of German

intonation with GToBI. In: Proc. Int. Conf. On Spoken

Language Process. (ICSLP), pp. 1716–1719.

Haykin, S., 1994. Neural Networks – A Comprehensive

Foundation. Macmillan College Publishing Company,

New York.

Hirose, K., Iwano, K., 1997. A method of representing funda-

mental frequency contours of Japanese using statistical

models of moraic transition. In: Proc. Eur. Conf. on Speech

Commun. Technol. (EUROSPEECH), Vol. 1, pp. 311–314.

Hirose, K., Iwano, K., 1998. Accent type recognition and

syntactic boundary detection of Japanese using statistical

modeling of moraic transitions of fundamental frequency

contours. In: Proc. IEEE Intern. Conf. Acoust., Speech,

Signal Process. (ICASSP), Vol. 1, pp. 25–28.

Hsieh, H.Y., Lyu, R.Y., Lee, L.S., 1996. Use of prosodic

information to integrate acoustic and linguistic knowledge

in continuous mandarin speech recognition with very large

vocabulary. In: Proc. Int. Conf. On Spoken Language

Process. (ICSLP), Vol. 1, pp. 809–812.

Huang, E.F., Wang, H.C., 1994. An efficient algorithm for

syllable hypothesization in continuous Mandarin speech

recognition. IEEE Trans on Speech and Audio Process 2

(3), 446–449.

Hunt, A., 1994. A generalized model for utilizing prosodic

information in continuous speech recognition. In: Proc.

IEEE Intern. Conf. Acoust., Speech, Signal Process.

(ICASSP), Vol. II, pp. 169–172.

Iwano, K., 1999. Prosodic word boundary detection using mora

transition modeling of fundamental frequency contours

speaker-independent experiments. In: Proc. Eur. Conf. On

Speech Commun. Technol. (EUROSPEECH), Vol. 1,

pp. 231–234.

Iwano, K., Hirose, K., 1998. Representing prosodic words

using statistical models of moraic transition of fundamen-

tal frequency contours of Japanese. In: Proc. Int. Conf.

On Spoken Language Process. (ICSLP), Vol. 3, pp. 599–

602.

Iwano, K., Hirose, K., 1999. Prosodic word boundary detection

using statistical modeling of moraic fundamental frequency

contours and its use for continuous speech recognition. In:

Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.

(ICASSP), Vol. 1, pp. 133–136.

Kompe, R., Kie�ling, A., Niemann, H., N€ooth, E., Schukat-

Talamazzini, E.G., Zottmann, A., Batliner, A., 1995.

Prosodic scoring of word hypotheses graphs. In: Proc.

Eur. Conf. On Speech Commun. Technol. (EURO-

SPEECH), Vol. 2, pp. 1333–1336.

Kompe, R., Kie�ling, A., Niemann, H., N€ooth, E., Batliner, A.,

Schachtl, S., Ruland, T., Block, H.U., 1997. Improving

parsing of spontaneous speech with the help of prosodic

boundaries. In: Proc. IEEE Intern. Conf. Acoust., Speech,

Signal Process. (ICASSP), pp. 811–814.

Lyu, R.Y., Chien, L.F., Hwang, S.H., Hsieh, H.Y., Yang, R.C.,

Bai, B.R., Weng, J.C., Yang, Y.J., Lin, S.W., Chen, K.J.,

Tseng, C.Y., Lee, L.S., 1995. Golden Mandarin (III) a user-

adaptive prosodic-segment-based mandarin dictation ma-

chine for Chinese language with very large vocabulary. In:

Proc. IEEE Intern. Conf. Acoust., Speech, Signal Process.

(ICASSP), pp. 57–60.

Markel, J.D., Gray Jr., A.H., 1976. Linear Prediction of

Speech. Springer, Berlin.

Morgan, D.P., Scofield, C.L., 1991. Neural Networks and

Speech Processing. Kluwer, Dordrecht.

Niemann, H., N€ooth, E., Kie�ling, A., Kompe, R., Batliner, A.,

1997. Prosodic processing and its use in verbmobil. In: Proc.

IEEE Intern. Conf. Acoust., Speech, Signal Process

(ICASSP), pp. 75–78.

Price, P.J., Ostendorf, M., Shattuck-Hufnagel, S., Fong, C.,

1991. The Use of prosody in syntactic disambiguation.

J. Acoust. Soc. Am. 90 (6), 2956–2970.

Robinson, A.J., 1994. An application of recurrent nets to phone

probability estimation. IEEE Trans. on Neural Networks 5

(2), 298–305.

Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M.,

Wightman, C., Price, P., Pierrehumbert, J., Hirschberg, J.,

1992. TOBI: a standard for labeling English prosody. In:

Proc. Int. Conf. On Spoken Language Processing (ICSLP),

Vol. 2, pp. 867–870.

Su, Y.S., 1994. A study on automatic segmentation and tagging

of Chinese sentence. Master Thesis, National Chiao Tung

University, Taiwan, ROC.

Wang, Y.R., Chen, S.H., 1994. Tone recognition of continuous

mandarin speech assisted with prosodic information.

J. Acoust. Soc. Am. 96 (5), 2637–2645.

Wightman, C.W., Ostendorf, M., 1994. Automatic labeling of

prosodic patterns. IEEE Trans. Speech and Audio Proc. 2

(4), 469–480.

Wu, Z., 1998. The formalization of segmental coarticulatory

variants in Chinese synthesis system. In: Proc. Conf. on

Phonetics of the Languages in China, pp. 125–128.

Yang, Y.J., Lin, S.C., Chien, L.F., Chen, K.J., Lee, L.S., 1994.

An intelligent and efficient word-class-based Chinese lan-

guage model for Mandarin speech recognition with very

large vocabulary. In: Proc. Int. Conf. On Spoken Language

Process. (ICSLP), Vol. 3, pp. 1371–1374.

Yin, Y.M., 1989. Phonological Aspects of Word Formation in

Mandarin Chinese. University Microfilms International.

W.-J. Wang et al. / Speech Communication 36 (2002) 247–265 265