[IEEE GLOBECOM '95 - Singapore (13-17 Nov. 1995)] Proceedings of GLOBECOM '95 - Fixed bit-rate PWI...

Abstract

BIT-RATE PWI SPEECH CODING WITH VARIABLE FRAME LENGTH

K.W.Tang and B.M.G.Cheetham

Dept. of Electrical Engineering and Electronics University of Liverpool, Liverpool L69 3BX UK.

This paper presents a harmonic vocoder for digitising speech at fixed low bit-rates based on the Prototype Waveform Interpolation (PWI) technique. A variable frame-length form of this technique is proposed. The basic idea is consistent with that of the original PWI algorithm, but improves its performance at low bit-rates and reduces its computational complexity. This variable frame-length coding technique, which uses vector quantised cepstral coefficients for representing the spectral envelope of individual pitch periods produces good perceptual quality at 2.4 kb/s.

1. Introduction

The prototype waveform Interpolation (PWI) method proposed by Kleijn [ I ] is one technique which is claimed to be able to reproduce high quality speech at bit-rates around 4 kb/s. Recent work by Shoham [2] proposes a Time-Frequency Interpolation (TFI) technique which in some ways is a generalised form of the PWI method. The success of these and other techniques has generated considerable interest in the potential for interpolation methods in low bit-rate speech coding. This interest has led us to investigate a form of PWI coding which defines each frame to consist of an integer number of pitch- periods [3]. This paper presents a modified form of PWI, termed variable frame-length PWI (VPWI), with lower computational complexity and yet with the potential for producing high quality speech at low bit- rates. Each frame is represented by a fixed number of bits produced at regular intervals of say 20-30 111s. Therefore the bit-rate is fixed. The number of pitch- cycles in each frame is chosen dynamically allowing a reserve of up to 150 samples, an assumed maximum pitch-period, to accumulate in a receiver buffer. An advantage of this approach is that shorter interpolation lengths than the average can be used for transition frames. Also prototype waveforms are phase-aligned

0-7803-2509-5195 US$4.00 0 1995 IEEE 1600

automatically, thus eliminating the need for time- alignment between prototype waveforms which complicates the decoding process with PWI [l].

The performance of this approach has been further improved by encoding only the harmonic amplitudes of each prototype waveform and regenerating the phase at the decoder under the assumption that the vocal tract transfer function is minimum phase.

2. Prototype Waveform Interpolation

Voiced speech is characterised by a high level of periodicity, and is commonly modelled in terms of a periodic excitation waveform applied as input to an allpole digital filter. The shape of this excitation waveform and the pitch-period change slowly over time as do the parameters of the digital filter. The idea of PWI is to periodically extract from the excitation signal representative segments, i.e. prototype waveforms, and to efficiently encode the shapes and the durations of these waveforms. The prototype waveforms are segments of one pitch-period duration, starting at "update-points" separated by suitable intervals of time. Their wave- shapes are characterised by the Fourier series coefficients of a periodic waveform whose fundamental period is the specified pitch-cycle duration. At the decoder, the waveform between update points, constituting a frame, is modelled as a Fourier series. i.e.

where $ and Hk are the Fourier series coefficients encoded at the update-points assumed to be at t=t0 and t=tl, a[t] is an interpolation function which gradually increases from a[t0]=0 to a[t l]=l. The instantaneous

277 hndamental frequency w , [ f ] is equal to - where

p[t] interpolates between pitch-periods at the update- points, i.e:

Pit1

The update-points are normally spaced regularly in time and the extracted prototype waveforms will produce Fourier series components that are not necessarily in phase from one prototype to the next. Therefore a "time- alignment" procedure must be applied which delays each Fourier series to achieve maximum similarity between (& coefficients from one update point to the next, and similarly for the Hk coefficients.

Equation (1) may be re-expressed in the form

3. The WWImethod

In our fixed bit-rate variable frame-length coder, each frame contains an integer number of pitch-periods and is represented by a fixed number of bits. The number of pitch-periods in a frame is determined independently at the encoder and decoder from a knowledge of the pitch-period and the state of the receiver buffer; no additional information about this need be encoded. This is done in such a way that the average frame length is a constant, T say. The differences between frame boundaries and the corresponding sampling times in the speech waveform never exceed 150 samples, therefore the extra delay introduced by this technique is 150 samples i.e. about 20 ms in 8 kHz sampled speech. At update-point t,, let the number of pitch-periods in a frame be M+R where:

(4)

with P,_, equal to the pitch-period of the previous

prototype and Pm equal to the pitch-period of the current prototype. When the pitch-period of the prototype waveform is more than 49 samples, the value of R is either one or zero depending on the buffer state i.e the number of samples, S, currently stored in the buffer:

M+1 when 4 + S I T + 150

R={ 0 : otherwise i=l

When the pitch-period of the prototype waveform is less ihan 50 samples,, R is made equal to 2 or 0, depeiiding on the brtffer state, to achieve a wider variation in frame-leng$hs. The threshold of 50 samples was chosen such that the resulting frame-length for interpolation is not less than half of the average frame- length (i.e. T/2).

3. I Voicing transitions

Voicing transitions are often a combination of both periodic and random signal components. The PWI technique was designed primarily for producing voiced speech assumed to be a concatenation of slowly evolving pitch-]period waveforms. However interpolation of Fourier series coefficients in voicing transitions often changies signal components with random noise-like characteristics to components with periodic properties. Thus an increase in periodicity of the reconstructed speech signal over that of the original speech signal may be dstixined.

Inaccuracies in the description of the pitch-cycle waveform dynamics and durations may introduce perceptual distortion in the synthetic speech. Synthetic speech with a l'buzzy'' quality may be obtained when it is reconstructed with modifications to the character of thc periodiicity Ill. This kind of distortion often causes the reconstructed speech to have too much pcriodicity in comparison with the original.

In our coder, the speech signal can be coded with shorter frame-lengths than average during transition frames. At the onset of' a voiced section at update-point fb. a past prototype waveform at update-point fb-1 is invented b y extracting a waveform segment of length pb samples starting at the update-point tb-1. The pitch- period pb is taken to ibe that of the current prototype waveform starting at update-point tb. Similarly, at the offset of a voiced section, a prototype waveform of the current frame is invented by extracting, for update point tb, a waveform of length pb-I i.e. the pitch-period at the previous update point.

As for voiced frames, the lengths of transition frames are determined independently at the encoder and decoder. At the encoder, the frame-length is made equal to the average frame-length, T, which is 160 samples in this work. The frame-length at the decoder, Td, is

1601

estimated from a knowledge of the duration of the prototype waveform and the state of the receiver buffer, i.e.:

where M is the number of pitch-periods in a frame, P equal to the pitch-period of the current (or previous) prototype waveform and S is the number of samples currently in the buffer. Since Td is never greater than T, the value of R in Equation ( 5 ) is always zero at the decoder for transition frames. Thus the number of pitch- periods is M where

M = int[ $1 (7)

The frame-length at the decoder may not be the same as the frame-length at the encoder. The variable frame- length technique tends to fill up the receiver buffer such that a shorter length than average can be used for transition frames. This inequality of frame-length between encoder and decoder implies that a degree of time-scale modification occurs at the decoder for transition frames. However since the reconstructed speech waveforms are made to change smoothly, the auditory quality should not be obviously affected.

3.2 Unvoiced and silences frames

The waveform interpolation technique is used for voiced speech only. During unvoiced segments, the speech is coded by a simplified form of CELP [4] with frame-length equal to T at both the encoder and decoder.

Longer than average frame-lengths are used to fill up the receiver buffer during periods of silence. The frame- length at silence is Ts where

Ts = T+150-S (8)

Filling up the receiver buffer has the advantage of allowing a shorter interpolation length than average to be used in the first frame after silence or unvoiced speech. The first frame after silence is usually a voicing transition and a reduction in the interpolation interval tends to reduce the effect of excess periodicity introduced by PWI at these points.

4. Low Bit-Rate Speech Coding based on WWI

In our coder, an accurate pitch-classifier sorts the input speech into one of a number of predefined categories at each update-point, i.e. voiced, unvoiced, transition or silence. For a voiced frame, the pitch- period, P, assumed to lie between 20 to 150 sampling intervals will be determined from the frame. A waveform segment of length P samples starting at the update-point is extracted as a prototype waveform for voiced speech and a pitch-synchronous DFT is used to determine its harmonic amplitudes, Dk[Q], for

k=O, 1, ... - - 1. The real cepstral coefficients: P 2

P-1 1 2

277 k=O c[t]=- c lnDk[tO)e t=O .... P-1 (9)

are now calculated via an inverse DFT and these (or the first 30 of these when P>60) are vector quantised to represent the prototype waveform. The cepstral coefficients are encoded along with the pitch-period measurement and voicing decision. No phase information is included.

At the decoder, the decoded cepstral coefficients c[t] are converted back to harmonic amplitudes Dk[to] by a P-point DFT, reversing equation (9). The phase offsets 4 k[Q] are deduced from the cepstral coefficients assuming a minimum phase relationship [5].

#k[t0] = -2 p-l C c[t]sin (2:) - k=O .... P-1 (10) t=l

Fourier series coefficients $[to] and Hk[Q] in Equation (1) may be calculated from Dk[Q] and and these may be interpolated between update points as in Kleijn [l].

The cepstral-based approach assumes the prototype waveform to begin at an excitation point, i.e. that it is a minimum phase impulse response. This assumption of the cepstrum may introduce a time-shifting effect on the reconstructed prototype waveform. However because of the pitch synchronous nature of the PWI mechanism, the reconstructed prototype waveform will have maximum similarity from one prototype to the next. This has the advantage of eliminating the need for time alignment between prototypes which is required prior to the interpolation procedure.

1602

5. fiverimenfa1 Results References

This section discusses an implementation of the VPWI vocoder for the bit-rate of 2.4 kbls. Table 1 gives the bit-allocations for voiced speech and unvoiced speech respectively. The coder has been applied to an 8-kHz sampled voiced speech segment with 20ms average interpolation interval. The 30 cepstral coefficients are split into two codebooks. The first codebook contains the first coefficient and the second codebook contains the remaining 29 coefficients. The first cepstral coefficient is important because it contains the energy level information. A single stage scalar quantiser is used for the first codebook and a multistage vector quantiser for the second codebook. Unvoiced speech is coded by using an CELP procedure [4] without LTP. For each 10 ms subframe, the optimal entry of a 6 bit stochastic codebook and the corresponding codebook gain are found. Informal listening tests have shown that the reconstructed speech produced by this coder is already better than that of the LPC-10. A slight muffling effect can sometimes be detected especially for female speech. The difficulties with female speech are common with this kind of coder because of the short pitch-period which makes pitch-period estimation error more likely. Also the number of pitch-cycles which have to be generated between update points is greater than for male speech. Any unnaturalness in the way the instantaneous frequency and the Fourier series coefficients vary across the synthesis frame will therefore be more apparent. A non-linear interpolation approach [6] may help to reduce this kind of distortion. Further, the minimum phase assumption is not entirely accurate [7] and the effect of inaccuracy may be a cause of some perceived unnaturalness.

6. Conclusion

This paper discusses the use of a variable frame- length waveform interpolation technique where each frame contains an integer number of pitch-periods. Only the spectral envelope of each prototype waveform is encoded, the phase being regenerated at the decoder. This technique has the advantage of reducing the complexity of the interpolation process and of using shorter than average frame-lengths for transitions. It has been found to produce good quality speech at a fixed bit- rate of 2.4 kbls.

[ l].W.B.IKleijn, "Continous Representations in Linear Predictive Coding", I'roc. ICASSP, pp.201-204, 1991.

[2].Y.Shoham, "Low-Rate Speech Coding based on Time-Frequency hterpolation", Proc. ICSLP, pp.37-40, 1992.

[3j.K, W.'Tang and B.;M.G.Cheetham, "Variable Frame Lengl h Prototype Waveform Interpolation for Low Bit Rate !Speech Coding", IEE Colloquium 1993/234, 1993.

[4].M.R.h;chroeder and B.S.Ata1, "Code Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates", Prw. ICASSP, pp.937-940, 1985.

[S].R.J.McAulay, T.F.Quatieri and T.G.Champion, "Sinewavle Amplitude Coding using High-Order Allpole Model", pip.395-398, 1994.

[6].H.Li and G.B.Lockhart, "Non-linear Interpolation in Prototype Waveform Interpolation (PWI) Encoders", IEE Colloquium 1994.

[7].X.Q.Sun, B.M.G.Cheetham and W.T.K. Wong, "Spectral Envelope and Phase Optimisation for Sinusoidal Speech Coding", 95 IEEE Workshop on Speeclh Coding, Maryland, Sep., 1995.

Acknowledgement: The authors acknowledge support from BT Laboratories and a British Government ORs award for this work.

1603

[IEEE GLOBECOM '95 - Singapore (13-17 Nov. 1995)] Proceedings of GLOBECOM '95 - Fixed bit-rate PWI...

Documents

Transcript of [IEEE GLOBECOM '95 - Singapore (13-17 Nov. 1995)] Proceedings of GLOBECOM '95 - Fixed bit-rate PWI...