Concatenative synthesis based on a harmonic model

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 1, JANUARY 2001 11

Concatenative Synthesis Based on a Harmonic ModelDarragh O’Brien and A. I. C. Monaghan

Abstract—One of the currently most successful approaches tosynthesizing speech, concatenative synthesis, combines recordedspeech units to build full utterances. However, the prosody of thestored units is often not consistent with that of the target utteranceand must be altered. Furthermore, several types of mismatch canoccur at unit boundaries and must be smoothed. Thus, both pitchand time-scale modification techniques as well as smoothing algo-rithms play a crucial role in such concatenation based systems.

In this paper, we describe novel approaches to each of theseissues. First, we present a conceptually simple technique for pitchand time-scale modification of speech. Our method is based upona harmonic coding of each speech frame, and operates entirelywithin the original sinusoidal model [1]. Crucially, it makes no useof “pitch pulse onset times.” Instead, phase coherence, and thusshape invariance, is ensured by exploiting the harmonic relationexisting between the sine waves used to code each analysis frameso that their phases at each synthesis frame boundary are con-sistent with those derived during analysis. Secondly, a smoothingalgorithm, aimed specifically at correcting phase mismatches atunit boundaries, is described. Results are presented showing ourprosodic modification techniques to be highly suitable for usewithin a concatenative speech synthesizer.

Index Terms—Prosodic modification, sinusoidal modeling,smoothing, speech synthesis.

I. INTRODUCTION

I N concatenative synthesis recorded speech units are con-joined to create synthetic utterances. The prosodic charac-

teristics of the stored units, i.e., their pitch, duration and in-tensity, may often be at odds with the target prosodic contour,since natural utterances are infinitely variable but databases arealways finite. Furthermore, as the units are extracted from dis-joint phonetic contexts, discontinuities in spectral shape as wellas phase mismatches can occur at unit boundaries. Techniquesfor prosodically modifying and smoothing acoustic units arethus an integral part of most concatenative synthesizers and theoutput speech quality is critically dependent on the performanceof these algorithms.

This paper presents a novel and conceptually simple approachto pitch and time-scale modification of speech. Previously, pitchpulse onset times have played a crucial role in sinusoidal modelbased speech transformation techniques [2], [3]. At each onsettime, all waves are assumed to be in phase, i.e., the phase ofeach is assumed to be some integer multiple of. Onset timeestimates thus provide a means of maintaining waveform shape

Manuscript received June 26, 2000; revised September 18, 2000. The asso-ciate editor coordinating the review of this manuscript and approving it for pub-lication was Prof. Michael W. Macon.

D. O’Brien is with Sun Microsystems, Inc., Dublin 3, Ireland (e-mail: [email protected]).

A. I. C. Monaghan is with Aculab plc, Milton Keynes MK1 1PT, U.K. (e-mail:[email protected]).

Publisher Item Identifier S 1063-6676(01)00330-3.

and phase coherence in the modified speech. However, accurateonset time estimation is a difficult problem [4] and any errorsresult in poor speech quality.

In our approach, both voiced and partially voiced frames arecoded as a set of harmonics while spectral peaks are used to codevoiceless frames. Pitch pulse onset times are not used to maintainphase coherence. Instead, post-modification waveform shapeis preserved by exploiting the harmonic relationship existingbetween the sinusoids such that their phases at synthesis frameboundaries are kept consistent with those measured duringanalysis. Furthermore, our modification algorithms do not relyon time–domain techniques such as PSOLA [5] and therefore,unlike in HNM [6], [7], analysis need not be pitch synchronous.The duplication/deletion of frames during scaling is avoided asa single pitch and time-scale modification factor is assigned toeach frame. Lastly, time-scale expansion of voiceless regionsis handled not through the use of a hybrid model but by in-creasing the variation in frequency of “noisy” sinusoids therebysmoothing the spectrum and alleviating the problem of tonalartifacts. Importantly, our approach allows a straightforwardcombination of pitch and time-scale modification.

We also present a smoothing algorithm which pays particularattention to the problem of phase mismatch correction. This is adirect extension of our prosodic modification techniques. Againours, unlike other approaches such as that detailed in [8], is notreliant on pitch synchronous analysis.

The rest of this paper is organized as follows. Section II re-views McAulay and Quatieri’s original sinusoidal model. Sec-tion III briefly outlines the analysis stage in our model, whichis similar to that used in the original model except we use har-monics rather than spectral peaks to code voiced and partiallyvoiced frames. Section IV shows how, using this new approach,both voiced and voiceless speech can be time-scaled. The al-gorithm is extended to handle pitch modification in Section V.Joint pitch and time-scale modifications are described in Sec-tion VI. Section VII presents results of an evaluation experimentin which speech prosodically transformed using our approachwas compared to speech similarly transformed by two other si-nusoidal coders. (A web address where readers can access all ofthe stimuli used in the experiment is also provided in this sec-tion.) Section VIII presents our smoothing algorithm, and thepaper concludes with a discussion of possible future improve-ments to the model in Section IX.

II. SINUSOIDAL MODELING

This section briefly reviews the sinusoidal model. InMcAulay and Quatieri’s original formulation [1], peaks ex-tracted from the FFT of speech frameare matched with thoseof frame by means of a nearest neighbor algorithm. Let

1063–6676/01$10.00 © 2001 IEEE

12 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 1, JANUARY 2001

and be the instantaneousamplitude, frequency and phase of theth sinusoid at the centerof frames and respectively. Amplitude is interpolatedlinearly using (1) where is the time interval from the centerof frame to the center of frame

(1)

A cubic polynomial (2) is introduced to model phase interpola-tion. Given that instantaneous frequency is defined as the deriva-tive of phase, the phase and frequency of each sine wave at anytime are given by (2) and (3), respectively

(2)

(3)

Substituting the known boundary values, when , obtainedfrom the FFT analysis into (2) and (3) gives

(4)

Similarly, substituting the known boundary values whengives

(5)

(6)

The target phase is measured modulo so phase unwrap-ping must be carried out and the term is added to (5) where

is an integer. We now solve for the three unknowns, and. For any , and can be calculated using (from [1])

The value of is chosen such that the smoothest frequencytrack is obtained [1]. This is achieved by minimizing in(7) with respect to the continuous variable

(7)

The minimizing value of can be shown to be that given by(8). Rounding to the closest integer gives , as shown in (9),where denotes the “nearest integer” operator

(8)

(9)

Once has been determined the model is completed by cal-culating and . Speech is then resynthesized from(10) where is the number of waves in frame

(10)

III. A NALYSIS

Analysis in our model is slightly different to that outlinedabove. analysis is carried out on the speech signal using En-tropic’s pitch detection software1 which is based on an algorithmpresented in [9]. The resulting pitch contour, after smoothing, isused toassignan estimate toeach frame(zero if voiceless).

Over voiced (and partially voiced) regions the length of eachframe is set at three times the local pitch period; 20 ms framesare used in voiceless regions. A constant frame interval of 10 msis used throughout the analysis. A Hanning window is appliedto each frame and its FFT calculated. For voiced frames theamplitudes and phases of sinusoids at harmonic frequencies arecoded. Voiceless frames are coded using a simple peak-pickingalgorithm.

Clearly small errors in estimation at this stage will causethe locations of our higher order harmonics to differ significantlyfromthoseof thesignal’sactualharmonics.However, forourpur-poses it is sufficient that the coded harmonics define a waveformwhose period is reasonably close to the true pitch period. Higherorderharmonicsservemerely toprovidespectral definition.

IV. TIME-SCALE MODIFICATION

This section presents our method for time-scale modification.Because of the differences in the transformation techniques em-ployed, time-scaling of voiced and voiceless speech are treatedseparately.

A. Voiced Speech

If their frequency is kept constant, the phases of the har-monics used to code each voiced frame repeat periodically every

s where is the fundamental frequency expressed in rads . Each parameter set (i.e., the amplitudes, phases and fre-quencies at the center of each analysis frame) can therefore beseen as defining a periodic waveform. For any phase adjustmentfactor a new set of “valid” phases (where valid means beinglinearly related to phases measured during analysis) can be cal-culated from

(11)

where is the new phase and is the original phase of theth sinusoid with frequency . After time-scale modification,

harmonics’ phases should occupy a valid state at each synthesisframe interval, i.e., their new and original phases should be re-lated by (11). Thus, the task during time-scaling is to estimatethe factor for each frame, from which a new set of phases ateach synthesis frame interval can be calculated. Equipped withphase information consistent with the new time-scale, synthesisis straightforward (see Section II). A procedure for estimatingis presented as follows.

After nearest neighbor matching (over voiced frames thissimplifies to matching corresponding harmonics), the fre-quency track connecting the fundamental of framewith that offrame is computed as in Section II and may be written as

(12)

1get_f0Copyright Entropic Research Laboratory, Inc. 5/24/93.

O’BRIEN AND MONAGHAN: CONCATENATIVE SYNTHESIS BASED ON A HARMONIC MODEL 13

TABLE ITIME-SCALING ALGORITHM

Time-scaling (12) is also straightforward. For a given time-scaling factor, , a new target phase, , must be determined.Let the new time-scaled frequency function be

(13)

The new target phase, , is found by integrating (13) overthe time interval (where is the analysis frame interval) andadding back the start phase

(14)

By evaluating (14) modulo , the phase of the fundamental,, is determined. Scaling is completed by solving forand

, again as outlined in Section II.Applying the same procedure to each remaining matched pair

of harmonics will, however, lead to a breakdown in phase coher-ence after several frames as waves gradually move out of phase.To overcome this, and to keep waves in phase,is calculatedfrom (11) as

(15)

where simply represents the linear phase shift from the funda-mental’s old phase to its new target phase value. Oncehas beendetermined, all new target phases, , are calculated from(11). Cubic phase interpolation functions may then be calcu-lated for each sinusoid using the method outlined in Section II.Resynthesis of time-scaled speech is carried out according to

(16)

It is necessary to keep track of previous phase adjustmentswhen moving from one frame to the next. This is handled by(see Table I) which must be added, along with, to target phases

Fig. 1. Original speech,� = 1.

Fig. 2. Time-scaled speech,� = 0:6.

Fig. 3. Time-scaled speech,� = 1:3.

to compensate for phase adjustments in previous frames. Thecomplete time-scaling algorithm is presented in Table I.

In summary, each frame is assigned a time-scale modifica-tion factor from which a new set of target phases is derived.Synthesis is then carried out as in the original sinusoidal model.Some example waveforms, taken from speech time-scaled usingour method, are given in Figs. 1–3. The shape of the original iswell preserved in the modified speech. We tested our approach


using values of ranging from 0.5 to 2 and found the resultingmodified speech to be of high quality. Readers can access speechwhich has been time-scaled using this method at the websitegiven in Section VII.

B. Voiceless Speech

An attempt was made in our previous work [10] to minimizethe difference between the original and time-scaled frequencytracks. Such an approach, we thought, would help to preservethe random nature of frequency tracks in voiceless regions, thusavoiding the need for phase and frequency dithering or hybridmodeling and providing a unified treatment of voiced and voice-less speech during time-scale modification. Using this approach,as opposed to computing the smoothest frequency track, meantthat slightly larger scaling factors could be accommodated be-fore tonal artifacts were introduced. The improvement, however,was deemed insufficient to outweigh the extra computationalcost incurred.

For these reasons, we implemented frequency dithering tech-niques to be applied over voiceless speech during time-scale ex-pansion. Initially, two simple methods of increasing randomnessin voiceless regions were incorporated into the model

• upon birth or death of a sinusoid in a voiceless frame arandom start or target phase is assigned;

• upon birth or death of a sinusoid in a voiceless frame arandom start or target frequency is assigned.

These simple procedures can be combined if necessary withshorter analysis frame intervals to handle most time-scale ex-pansion requirements. In a concatenative synthesizer, scalingfactors are likely to be relatively small since the database ofstored units should provide an approximate match in most cases.However, for larger time-scale expansion factors these measuresmay not be adequate to prevent spurious tonality. Our solution insuch cases is to split “noisy” sinusoids into two separate compo-nents each following a different frequency track. The spectrumis thus smoothed and perceptual randomness preserved. This isillustrated in Fig. 4.

Because each frequency track is modeled with a parabola itsfrequency track is necessarily constrained to lie either above orbelow the line connecting the start and target frequencies. Inorder to increase (in this case, double) the variation in frequencyof each sinusoid, it is simply reflected through the lineto givean auxiliary track. Amplitude interpolation must be adapted totake the existence of this new track into account. During time-scale expansion each phase function can be written as

(17)

The variation in frequency, var, of the corresponding track(i.e., its maximum distance from) can be shown to be thatgiven by (18) (where )

var (18)

Using this equation, and expressingand in terms of ,can be chosen such that each track has the desired frequency

Fig. 4. Bandwidth expansion in voiceless speech.

behavior. In our experiments,var was set to 100 Hz thus en-suring that after reflection the combined frequency variation forboth sinusoids was at least 200 Hz.

The phase interpolating function of the auxiliary frequencytrack, obtained by reflecting the original through, can be givenby

(19)

where

(20)

Each sinusoid is effectively split in two and both parts aresynthesized separately. Amplitude is interpolated linearly asdepicted in Fig. 4. Using this approach, we found the tonalquality associated with time-scale expanded voiceless speechto be eliminated even for large scaling factors (values up to

were tested). Again, readers can access examples ofspeech which has been time-scaled using this method at thewebsite given in Section VII.

V. PITCH MODIFICATION

Our time-scale modification algorithm can easily be con-verted to implement frequency-scaling. To frequency-scale bya factor simply time-scale by as outlined in the previoussection and resynthesize using

(21)

Original waveform shape is successfully retained but the re-sulting speech is distorted as formant locations have been al-tered.

In order to extend our approach to time-scale modificationto perform pitch modification it is necessary to separate thevocal tract and excitation contributions to the speech production


process. Here, an LPC-based inverse filtering technique, itera-tive adaptive inverse filtering (IAIF) [11] is applied to the speechsignal to yield a glottal excitation estimate which is then sinu-soidally coded using the approach in Section III.

Assuming sinusoidal analysis has been carried out on theglottal wave estimate, the frequency track connecting the fun-damental of frame with that of frame is given by

(22)

Pitch-scaling (22) is quite simple. Let and be the pitchmodification factors associated with framesand respec-tively. Interpolating linearly, the modification factor across theframe is given by

(23)

where is the analysis frame interval. The pitch-scaled funda-mental can then be written as

(24)

The new (unwrapped) target phase, , is found by inte-grating (24) over and adding back the start phase

(25)

Evaluating (25) modulo gives from which can becalculated and a new set of target phases derived as in Section IV.

Each start and target frequency is scaled byand re-spectively. Composite amplitude values are calculated by mul-tiplying excitation amplitude values by the LPC system magni-tude response at each of the scaled frequencies. (Note that theexcitation magnitude spectrum is not re-sampled, but frequency-scaled.) Composite phase values are calculated by adding thenew excitation phase values to the LPC system phase responsemeasured at each scaled frequency. Resynthesis of the pitch-scaled speech may then be carried out as in Section II by com-puting a phase interpolation function for each sinusoid and sub-stituting into (26)

(26)

Except for the way is calculated, pitch modificationis similar to the time-scaling algorithm in Table I. The pitch-scaling algorithm is given in Table II. This approach is some-what different to that which we presented in [12] where pitch-scaling was, in effect, converted to a time-scaling problem.

In the evaluation presented below, a number of speech sam-ples were pitch modified using the method described and resultswere found to be of high quality for values ofranging from0.5 to 3 (examples can be accessed at the website given in Sec-tion VII). Some example waveforms, taken from pitch-scaled

TABLE IIPITCH-SCALING ALGORITHM

speech are given in Figs. 5–7. Again, it should be noted that theoriginal waveform shape has been generally well-preserved.

VI. JOINT PITCH AND TIME-SCALE MODIFICATION

The algorithms for pitch and time-scale modification pre-sented can be readily combined to perform simultaneous pitchand time-scale modification. The frequency track linking thefundamental of frame with that of frame can again bewritten as

(27)

The pitch-scaled and time-scaled track, whereis the time-scaling factor associated with frameand and are thepitch modification factors associated with framesand re-spectively, is given by

(28)

where is the linearly interpolated pitch modification factorgiven in (23). Integrating (28) over the interval and addingback the start phase, , gives

(29)

Evaluating (29) modulo gives from which can becalculated and a new set of target phases derived. Using thescaled harmonic frequencies and new composite amplitudesand phases, synthesis (again as outlined in Section II) is carriedout to produce speech that is both pitch-scaled and time-scaled.Some example waveforms showing speech (from Fig. 5) which


Fig. 5. Original speech,� = 1.

Fig. 6. Pitch-scaled speech,� = 0:7.

Fig. 7. Pitch-scaled speech,� = 1:6.

has been simultaneously pitch-scaled and time-scaled usingthis method, are given in Figs. 8 and 9. In these examples thesame pitch and time-scaling factors have been assigned to eachframe, although this need not be the case as the two factorsare mutually independent. As with the previous examples,waveform shape has been well preserved. Testing our approachby jointly varying and between 0.5 and 2.0 and between 0.5and 3.0 respectively we achieved high quality results. Speech

Fig. 8. Pitch- and time-scaled speech,� = 0:7, � = 0:7.

Fig. 9. Pitch- and time-scaled speech,� = 1:6, � = 1:6.

samples are available at http://www.icp.grenet.fr/cost258/eval-uation/server/interactif.html.

VII. EVALUATION

The time-scale and pitch modification algorithms describedin Sections IV and V were tested against other models in aprosodic transplantation task. The COST-258 coder evaluationserver2 provides a set of speech samples with neutral prosodyand for each sample there is a set of associated target prosodiccontours. The target contours are taken from natural utterancesby the same speaker who deliberately altered his prosody.Speech samples to be modified include vowels, fricatives(both voiced and voiceless) and continuous speech. Results

2At http://www.icp.grenet.fr/cost258/evaluation/server/interactif.html, inter-ested readers can find examples of speech which has been both pitch and time-scale modified using the algorithms presented in this paper. In our evaluationSHMDCU (our model) Version 1 was compared to PSSVGO Version 0 andHNMICP Version 0 using samples taken from the AT and EM corpora. (Laterversions of these systems, which are also based on a sinusoidal model, were notavailable at the time the evaluation was carried out.) Examples of our model’sperformance in scaling vowels and voiced/voiceless fricatives are also availableat the above website in the VO and FD corpora respectively. Note: In the sam-ples contained in the AT and EM corpora no frequency dithering was appliedto voiced fricatives—they were modeled as if completely voiced. However, inorder to prevent the introduction of tonal artifacts, the frequency dithering tech-nique presented in Section IV-B was applied above 2 kHz to the voiced fricativescontained in the FD corpus.


from a formal listening test show our model’s (SHMDCU)performance to compare very favorably with that of othercoders including HNM (as implemented by Institut de laCommunication Parlée, Grenoble, France) [13] and a pitchsynchronous sinusoidal technique developed at the Universityof Vigo, Spain [14].

A. Formal Listening Test

We randomly selected 28 utterance-length target contoursfrom the prosodic transplantation task, and the versionsproduced by the SHMDCU model were compared againstthe versions from the HNM implementation from Grenoble(HNMICP) and the Vigo technique (PSSVGO). It should beemphasized that all three models are development systemsonly, and are constantly being improved and extended. Theversions tested here were those existing in August 1999.

In a pairwise comparison, listeners were asked to indicatewhich of two versions of the same target utterance they consid-ered to be of higher quality. The utterances were in French andCzech, and the listeners were not native speakers of those lan-guages nor did they hear the natural utterance which provided thetarget contours, so that all they could judge was acoustic qualityrather than intelligibility or closeness to the natural target.

Listeners were split into experts (those with a high level offamiliarity with synthetic speech) and nonexperts. There werefour experts and ten nonexperts. The experts judged two setsof pairs: SHMDCU versus HNMICP, and SHMDCU versusPSSVGO. The nonexperts judged only one set of pairs. Thestimuli were arranged in two different balanced random orders,where one order was the reverse of the other (i.e., BA-BA-ABbecame BA-AB-AB): expert listeners judged one set of pairsin each order, and nonexpert listeners were divided equallybetween the two orders.

Listeners were allowed to listen to each pair as many times asthey liked before making their judgment. They were not allowedto judge both elements of a pair to be equally good. Listeners’comments indicated that in some cases the decision was diffi-cult but in most cases there was a clear difference in quality. Aspot check on consistency of judgments showed that the samelistener would make the same choices a few minutes later withan agreement of between 75% and 80%.

B. Results

The results showed a high degree of consistency between lis-teners. There was no appreciable difference between the judg-ments of experts and nonexperts, or between the different ordersof presentation. We will therefore present the overall averagescores without further analysis.

Average scores for the SHMDCU versus HNMICP compar-ison were 87.4% for SHMDCU and 12.6% for HNMICP.

Average scores for the SHMDCU versus PSSVGO compar-ison were 70.9% for SHMDCU and 29.1% for PSSVGO.

All these stimulii were prepared in the institutions concerned,and were freely available on the Internet. In this respect, ourevaluation represents a very fair test. However, as we mentionedabove, these systems are not commercial products, nor are theystable versions: they are part of the continuing research programon signal generation within COST-258, and they have probably

all been modified since these stimuli were produced. Neverthe-less, we consider that these results show the SHMDCU modelto be at least as good as other state-of-the-art systems for pitchand time-scale modification of speech.

VIII. SMOOTHING

A concatenative speech synthesis system based on the pitchand time-scale modification algorithms presented in Sections IVand V is currently under development. Our system uses a di-phone database made available with the Festival TTS system.3

Each diphone is coded as described in Section III, i.e., voicedframes are coded as a set of harmonics and voiceless frames asa set of peak frequencies.

At synthesis time the sequence of diphones necessary to syn-thesize the target speech is retrieved from the speech unit data-base. Unit selection is a trivial process as the database containsonly a single copy of each diphone. Target prosody (currentlyextracted from natural speech) is used to assign pitch, time-scaleand energy modification factors to each frame. Each modifica-tion factor, , is simply given by

Target value

Measured value(30)

Once scaling factors have been assigned and harmonic ampli-tudes scaled to match the target energy level, synthesis is carriedout as described earlier. Spectral smoothing across voiced con-catenation points, as in AT&T’s Next Generation TTS system[15]–[17], is implemented by simple linear interpolation ofharmonic amplitudes. Similarly, across voiceless concatenationpoints peak frequency amplitudes are also interpolated linearly.

As with other sinusoid based systems, phase mismatch atconcatenation points is a serious issue. Specifically, at diphoneboundaries target phase values bear no relation to start phasevalues. As a result, frequencies may “struggle” to meet theirtarget phase and the consequent contorted frequency tracks canlead to waveform dispersion (a loss of waveform shape) andnoisy transitions. The resulting synthetic speech quality can beseriously degraded.

The solution we propose is simply to make the frequencytrack transitions across diphone boundaries as smooth as pos-sible. Let frame be the last frame of diphone and frame

be the first frame of diphone . As was the casewith the pitch and time-scaling algorithms above, we rely onthe fundamental to keep harmonics “locked” in phase. Letand be the pitch-scaled fundamentals in framesandrespectively. Discarding the measured target phase value,,

the smoothest frequency track between fundamentals, , iscalculated by simple linear interpolation and is given by

(31)

whereanalysis frame interval;

and pitch modification factors associated withframes and , respectively;time-scale modification factor.

3http://www.cstr.ed.ac.uk/projects/festival/festival.html


Integrating (31) and adding back the start phase,, gives thenew target phase value . Again, as was the case for pitchand time-scale modification, the amount by which all othertarget phases must be adjusted in order to calculate their newvalue is given by where

(32)

The phases of all harmonics in frame are then made con-sistent with that of their fundamental by applying

(33)

Synthesis may then be carried out by computing the smoothesttrack from each harmonic’s start to target parameters (see Sec-tion II).

Note that the above solution to phase mismatch relies on theassumption that across voiced concatenation points phase re-lations between harmonics and their fundamental will be quitesimilar. Thus, by making the fundamental’s transition as smoothas possible all other transitions will also be made smoother.Other criteria could also be used to solve forwhen resolvingphase mismatches. For example,could be chosen such that allharmonics’ transitions, not only the fundamental’s, were madeas smooth as possible when passing from one diphone into an-other. However, given our aim of the development of a TTSsystem, where performance is a priority, only the fundamentalfrequency is used in our approach.

It should be pointed out that this approach adds no compu-tational overhead to the synthesis procedure presented in Sec-tion VI. When performing pitch and time-scale modification thesame process must be followed, i.e., a new target phase configu-ration must be generated based on the pitch and time-scaled fun-damental. Thus our solution to the problem of phase mismatchfits neatly into and follows directly from our existing pitch andtime-scale modification algorithms. Furthermore, unlike otherproposed methods for removing local phase mismatches [8] wedo not rely on pitch-synchronous analysis.

That this approach is indeed effective at removing phase mis-matches is illustrated by Figs. 10 and 11. Fig. 10 shows a sectionof waveform traversing a diphone boundary where phase mis-match correction has not been applied. Waveform shape disper-sion is evident. In contrast, in Fig. 11 the phase mismatch pro-cedure outlined above has been applied and waveform shape iswell preserved.

Lastly, this solution to the phase mismatch problem suggeststhat a similar approach could be adopted throughout synthesis.Currently, a target phase set is calculated based on interpolatingthe track connecting the fundamental frequency’s start andtarget parameters. Using the approach above no interpolationwould be required, rather the fundamental component’s targetphase would be chosen such that its transition from frametoframe was as smooth as possible (as mentioned aboveother criteria could also be used). Target phases in frame,taken from FFT analysis, would be synchronized on this valuewith synthesis then proceeding as normal. A consequence ofsuch a synthesis technique would be that even in the absenceof any modification (i.e., and ) the measured

Fig. 10. Diphone boundary with phase mismatch.

Fig. 11. Diphone boundary with phase mismatch correction.

phase values would be adjusted at synthesis time such thatthe smoothest transition from one frame to the next wasobtained. Furthermore, using this approach would mean thatdiphone concatenation points would not be a special case,the same synthesis procedure being applied for all frames.Synthesis then would not be strictly bound to measured phasevalues: instead, each set would serve as a template for all validtarget phase configurations. This approach has been tested andachieves both high quality resynthesis and modification.

An informal copy synthesis experiment was carried out usingnatural prosody (taken from a speaker different to that used toproduce the diphone database) to generate the phrase “I need toarrive by 10:30 am on Saturday.” We found the resulting syn-thetic speech to be of a high quality, close to that of naturalspeech.

IX. DISCUSSION

We have described a high quality yet conceptually simple ap-proach to pitch and time-scale modification of speech, which isparticularly suitable for concatenative synthesis. Using only theharmonic structure of the sinusoids in each frame, phase coher-ence and waveform shape are well preserved after modification.


The simplicity of our approach stands in contrast to othershape invariant algorithms. In [2], pitch pulse onset times, usedto preserve waveform shape, must be estimated in both the orig-inal and target speech. In the approach presented here, onsettimes play no role and need not be calculated. Furthermore, in[2] onset times are used to impose a structure on phases anderrors in their location lead to unnaturalness in the modifiedspeech. In the approach described here, during modificationphase relations inherent in the original speech are preserved.Phase coherence is thus guaranteed and waveform shape re-tained. Obviously, our approach has a similar advantage over theABS/OLA modification techniques [18], [3] which also makeuse of pitch pulse onset times.

Unlike the PSOLA inspired HNM [6], [7] approach to speechtransformation, using our technique no mapping need be gener-ated from synthesis to analysis short-time signals. Furthermore,the duplication/deletion of information in the original speech,a characteristic of PSOLA type techniques, is avoided. Everyframe, to which a single pitch and time-scale modification factoris assigned, is used exactly once during resynthesis

The time-scaling technique presented here is somewhat sim-ilar to that used in the ABS/OLA model in that the harmonicnature of the sinusoids used to code each frame is exploitedby both models. However, the frequency (and associated phase)tracks linking one frame with the next which are absent from theABS/OLA model, are retained here. Furthermore, our new pitchmodification algorithm is a direct extension of our time-scalingapproach and is simpler than the “phasor interpolation” [18]mechanism used in the ABS/OLA model.

The incorporation of modification techniques specific tovoiced and voiceless speech brings to light deficiencies in theanalysis model presented in Section III. Voicing errors canseriously lower the quality of the resynthesized speech. Forexample, where voiced speech is deemed voiceless, frequencydithering is wrongly applied, waveform dispersion occurs, andthe speech is perceived as having an unnatural “rough” quality.Correspondingly, where voiceless speech is analyzed as voiced,its random nature is not preserved and the speech can take on atonal character.

Voicing errors are not the only problem area. Voiced frica-tives, by definition, consist of a deterministic and a stochasticcomponent and, because our model applies a binaryvoice dis-tinction, cannot be accurately modeled.

The model could be improved and the problems outlinedabove alleviated by incorporating several of the elementsused in HNM analysis. First, leaving the rest of the modelas it stands, a more refined pitch estimation procedure couldbe added, e.g. could be chosen to be the value whoseharmonics best fit the spectrum. Secondly, the incorporation ofa voicing cut-off frequency would add the flexibility requiredto solve the problem with voiced fricatives. Modeling suchsounds as a set of harmonics, above the cut-off frequency thedithering technique described in Section IV-B could be usedto prevent tonality while still retaining some time-domaincharacteristics (start and target phases are imposed consistentwith a purely voiced sound). This latter point is important as thestochastic and deterministic components of voiced fricatives

can be perceived as having been generated by separate sourcesif they are not properly synchronized.

During testing of this approach by modifying voiced frica-tives a cut-off frequency of 2 kHz was imposed. Although ob-viously less sophisticated than HNM’s approach where an op-timal cut-off frequency value is calculated for each frame, ourapproach produces good quality results. (Examples are availablein the FD corpus at the website mentioned in Section VII). Fur-thermore, given that our algorithm is intended for use in a con-catenative TTS system, where the speech segment under anal-ysis is known, cut-off frequencies can be determined in advance,possibly on a segment-by-segment basis. Lastly, the inverse fil-tering technique currently being used [11] is quite simple andis designed for efficiency rather than accuracy. A more accuratealgorithm should yield better quality results.

The main computational burden incurred in implementingpitch and time-scale modification, using our approach, centerson keeping frequencies in phase. In purely voiceless regionshowever phases can be considered random and thus would notrequire explicit monitoring, thus improving efficiency. Anothercomputationally expensive process is sinusoid duplicationduring time-scale modification of voiceless regions where thenumber of frequency components per frame is doubled. Thisapproach is justified on the basis that it is simple, produces goodquality results and, based on the test suite used in Section VII,is not often required.

Implicit in our approach is the assumption that, although thepreservation of phase coherence between harmonics is crucialfor high-quality synthesis, the actual value of phase at anygiven point is unimportant. Our approach changes the absolutevalue of phase significantly, especially in longer utterances, butwe maintain the phase relations between harmonics. This as-sumption has allowed us to simplify the processing required forprosodic modification of speech whilst maintaining synthesisquality. Indeed, the simplicity of the processing may help topreserve the naturalness of the speech.

A speech synthesis system using our techniques for prosodicmodification and smoothing is under development.

REFERENCES

[1] R. J. McAulay and T. F. Quatieri, “Speech analysis/synthesis based ona sinusoidal representation,”IEEE Trans. Acoust., Speech, Signal Pro-cessing, vol. ASSP-34, pp. 744–754, Aug. 1986.

[2] T. F. Quatieri and R. J. McAulay, “Shape invariant time-scale and pitchmodification of speech,”IEEE Trans. Signal Processing, vol. 40, pp.497–510, Mar. 1992.

[3] E. B. George and M. J. T. Smith, “Speech analysis/synthesis and mod-ification using an analysis-by-synthesis/overlap-add sinusoidal model,”IEEE Trans. Speech Audio Processing, vol. 5, pp. 389–406, Sept. 1997.

[4] M. W. Macon, “Speech synthesis based on sinusoidal modeling,” Ph.D.dissertation, Georgia Inst. Technol., Atlanta, 1996.

[5] E. Moulines and F. Charpentier, “Pitch synchronous waveform pro-cessing techniques for text-to speech synthesis using diphones,”SpeechCommun., vol. 9, pp. 453–467, Dec. 1990.

[6] J. Laroche, Y. Stylianou, and E. Moulines, “HNS: Speech modificationbased on a harmonic+ noise model,” inProc. Int. Conf. Acoustics,Speech, Signal Processing, vol. 2, Apr. 1993, pp. 550–553.

[7] Y. Stylianou, J. Laroche, and E. Moulines, “High quality speech modi-fication based on a harmonic+ noise model,”Proc. Eurospeech’95, pp.451–454, Sept. 1995.

[8] Y. Stylianou, “Removing phase mismatches in concatenative speechsynthesis,” inProc. 3rd ESCA/COCOSDA Workshop Speech Synthesis,Jenolan Caves, NSW, Australia, Nov. 1998.


[9] W. B. Kleijn and K. K. Paliwal, Eds., “A robust algorithm for pitchtracking (RAPT),” inSpeech Coding and Synthesis. New York: Else-vier.

[10] D. O’Brien and A. I. C. Monaghan, “Shape invariant time-scale modifi-cation of speech using a harmonic model,” inProc. Int. Conf. Acoustics,Speech, Signal Processing, Phoenix, AZ, Mar. 1999, pp. 381–384.

[11] P. Alku, E. Vilkman, and U. K. Laine, “Analysis of glottal waveformin different phonation types using the new IAIF-method,” inProc. Int.Congr. Phonetic Sciences, 1991.

[12] D. O’Brien and A. I. C. Monaghan, “Shape invariant pitch modificationof speech using a harmonic model,”Proc. Eurospeech’99, 1999.

[13] G. Bailly, “A parametric harmonic plus noise model,” inProc. COST258, G. Kelleret al., Ed., to be published.

[14] E. R. Banga, X. Fernando-Salgado, and C. Garcia-Mateo, “Concate-native text-to-speech synthesis based in sinusoidal modeling,” inProc.COST 258, C. Kelleret al., Ed., to be published.

[15] M. Beutnagel, A. Conkie, and A. Syrdal, “Diphone synthesis using unitselection,” inProc. 3rd ESCA/COCOSDA Workshop Speech Synthesis,Jenolan Caves, NSW, Australia, Nov. 1998.

[16] Y. Stylianou, “Concatenative speech synthesis using a harmonic plusnoise model,” inProc. 3rd ESCA/COCOSDA Workshop on Speech Syn-thesis, Jenolan Caves, NSW, Australia, Nov. 1998.

[17] M. Beutnagel, A. Conkie, J. Schroeter, and A. Syrdal, “The AT&Tnext-gen TTS system,” inProc. Joint Meeting ASA, EAA, DAGA,Berlin, Germany, Mar. 1999.

[18] E. B. George, “An analysis-by-synthesis approach to sinusoidal mod-eling applied to speech and music signal processing,” Ph.D. dissertation,Georgia Inst. Technol., Atlanta, Nov. 1991.

Darragh O’Brien received the B.Sc. degree in ap-plied computational linguistics and the Ph.D. degreein computer science from Dublin City University,Dublin, Ireland, in 1995 and 2000, respectively.

He is currently with Sun Microsystems, Inc.,Dublin.

A. I. C. Monaghan received the Ph.D. degreein linguistics from the University of Edinburgh,Edinburgh, U.K., in 1991.

After six years of speech synthesis research at theCenter for Speech Technology Research, Edinburgh,he took up a lectureship in computational linguisticsat Dublin City University, Dublin, Ireland. Hewas Director of the National Center for LanguageTechnology from 1995 to 2000, when he joinedAculab plc, Milton Keynes, U.K., to develop speechsynthesis systems for computer telephony. He is

Vice-Chairman for Prosody in the European COST 258 Research Action“Naturalness in Speech Synthesis.”

Concatenative synthesis based on a harmonic model

Documents

Transcript of Concatenative synthesis based on a harmonic model