Accurate Speech Decomposition Into Periodic and … SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC...

ACCURATE SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC COMPONENTS BASED ON DISCRETE HARMONIC TRANSFORM

Piotr Zubrycki and Alexander Petrovsky

Department of Real-Time Systems, Bialystok Technical University Wiejska 45A street, 15-351 Bialystok, Poland

phone: + (48 85) 746–90–50, fax: + (48 85) 746–90–57 email: [email protected]

ABSTRACT This paper presents a new method for the speech signal decomposition into periodic and aperiodic components. Proposed method is based on the Discrete Harmonic Transform (DHT). This transformation is able to analyze the signal spectrum in the harmonic domain. Another feature of the DHT is its ability to synchronize the transformation kernel with the time-varying pitch frequency. The system works without a priori knowledge about the pitch track. Unlike the most applications proposed method estimates the fundamental frequency changes within a frame before estimating fundamental the frequency itself. Periodic component is modelled as a sum of harmonically related sinusoids and for accurate estimation of the amplitudes and initial phases DHT is used. Aperiodic component is defined as a difference between the original speech and the estimated periodic component.

1. INTRODUCTION

Speech signal is generally assumed as a composition of two major components: periodic (harmonic) and aperiodic (noise). The problem of speech decomposition into its two basic components is the major challenge in many speech processing systems. In general this task lays in accurate estimation of the periodic and aperiodic components thus they can be analyzed separately which plays important role in many speech applications such as synthesis or coding. Periodic component is generated by the vibrations of vocal folds while aperiodic component is generated by the modulation of the air flow. Modulated air flow is responsible for generation fricative or plosive sounds but it also present in the voiced sounds as well. The basic speech production model assumes that the speech is either voiced or unvoiced. Unvoiced part of speech in this basic model is generated by passing a white gaussian noise signal through a linear filter, which represents the vocal track characteristics. Voiced parts of speech are modelled as a time-varying impulse train modulated by the vocal track filter. In this model it is assumed, that no noise signal is present in the voiced parts of speech. In fact, real voiced speech consists of some noise. The speech signal can be viewed as a mixed-source signal with both periodic and aperiodic excitation. In the sinusoidal

and noise speech models this mixed-source speech signal is generally modelled as [1]: ∑ =

+=Kk kk nrnnAns

1)()(cos)()( ϕ , (1)

where Ak is the instantaneous amplitude of k-th harmonic, K is the number of harmonics present in speech signal, r(n) is the noise component, φk is the instantaneous phase of k-th harmonic defined as:

)0()(2)(0 k

ni

s

kk F

ifn ϕπϕ += ∑ =, (2)

where fk is the instantaneous frequency of the k-th harmonic, Fs is the sampling frequency and φk(0) is the initial phase of the k-th harmonic.

Sinusoidal speech modelling treats the speech signal as a sum of periodic and aperiodic components where periodic signal defined as sum of sinusoids with a time-varying amplitudes and frequencies. If fk obey:

0kffk = , where f0 is the fundamental frequency, sinusoids in the model are harmonically related and thus the model is called Harmonic+Noise.

There are several variations of the sinusoidal speech modelling [1,2]. Sinusoidal speech model presented by McAulay and Quatieri [3] and further developed by George and Smith [4] assumes the voiced speech as a sum of harmonically related sinusoids with the amplitudes and phases obtained directly from the Short-Time Fourier Transform (STFT) spectrum. Unvoiced speech is modelled as a sum of randomly distributed sinusoids with the random initial phase. Stylianou presented more accurate approach to the voiced speech modelling based on the harmonic+noise model [5]. In this approach the maximum voicing frequency is determined on the basis of the speech spectra analysis. The speech band is divided into the lower-voiced and the higher-unvoiced bands by the maximum voicing frequency. In the Multiband Excitation Vocoder (MBE) presented by Griffin and Lim [6] the speech spectrum is divided into a set of bands with a respect to the pitch frequency. Each band is analysed and the binary voiced/unvoiced decision is taken. Voiced bands are modelled as a sinusoids and unvoiced as a band-limited noise.

Periodic and aperiodic speech decomposition in the methods discussed above involves a binary voiced/unvoiced

©2007 EURASIP 2336

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

decision which is not valid from the speech production point of view. Yegnanarayana et. all [7] proposed a speech decomposition method which considers the voiced and the noise components to be present in the whole speech band. Idea of the work is to use an iterative algorithm based on the Discrete Fourier Transform (DFT)/Inverse Discrete Fourier Transform (IDFT) pairs for the noise component estimation. Another method of decomposition which uses the Pitch Scaled Harmonic Filter (PSHF) is presented by Jackson and Shadle [8]. The speech signal is windowed and the window length is chosen with respect to the knowledge of the pitch frequency thus the segment taken to the analysis contains integer multiple of the pitch cycles. Pitch-scaled frame length enables the pitch harmonics to be aligned with the frequency bins of the STFT and thus minimises the leakage, but complicates the windowing process. PSHF algorithm performs a decomposition in the frequency domain by selecting only these STFT bins which are aligned with the pitch harmonics.

The most often assumption that is made to the speech signal is its local stationarity i.e. it is assumed that the parameters of the pitch harmonics are slowly-varying and locally these variations can be omitted. While dealing with the real speech signal these variations can decrease the quality of the speech components separation, especially if STFT is used as a spectral analysis tool. The accuracy of the speech decomposition can be improved if the speech signal nonstationarity is taken into account.

In this paper we propose a new periodic-aperiodic decomposition method which assumes periodic and aperiodic components to be present in the whole speech band, which is similar to the approach presented in [9]. The motivation of our approach to speech separation problem was to develop a system which is able to accurately separate the speech components with taking into account the nonstationary speech nature and without a priori knowledge of the pitch frequency track. In our system we use the speech model defined by (1). Basic concept of our method lays in the analysis of speech spectrum in the harmonic domain rather than the frequency domain in order to provide accurate estimation of the model parameters. For our purposes we have adopted the Harmonic Transform (HT) idea proposed by Zhang et. All [10]. The HT is a spectral analysis tool able to analyse the harmonic signal with the time-varying frequency and produce the pulse-train spectrum in the harmonic domain. First step of designed system is an estimation of the optimal speech fundamental frequency change on the frame-by-frame basis with usage of the HT. Once the optimal change of the pitch track is found the fundamental frequency is estimated using the analysis of the harmonic domain spectrum. On this basis the periodic component is estimated by selecting HT local maxima corresponding to the pitch harmonics. Aperiodic component is defined as a difference between the input speech and the estimated periodic component.

The paper is organized as follows. In section 2 we discus the Harmonic Transform and define the speech model used in our system. In section 3 the optimal pitch track estimation method is presented. Finally in section 4 we present a

decomposition scheme. Some experimental results are given in section 5.

2. DISCRETE HARMONIC TRANSFORM

The most speech analysis applications based on sinusoidal speech modelling use the STFT spectrum for estimation of the harmonics parameters with the assumption of the speech local stationarity, i.e. the fundamental frequency is constant within the analysis frame. This is often coarse assumption in the case of the real speech signals. In fact fundamental frequency varies in time and thus only several first harmonic are distinguishable in the DFT spectrum (fig.1).

Figure 1 – Harmonic Transform: harmonic signal with 16

harmonics and the fundamental frequency changing from 200Hz to 220Hz (top), DFT (middle) and DHT (bottom) of this signal.

This fact decreases STFT performance in the harmonic parameters estimation process. The basic concept of the harmonic domain spectral analysis is to provide the analysis along instantaneous harmonics frequencies rather than fixed frequencies like in the STFT. There are two main strategies which are possible. One strategy is to provide the time-warping of the input signal in order to convert time-varying frequency into the constant one and then use the STFT. The second one is to use the spectral analysis tool which transforms input signal directly into the harmonic domain. Zhang et. all [10] proposed the Harmonic Transform (HT) which is the transformation with “built-in” time-warping function. The HT of signal s(t) is defined as:

∫+∞

∞−

−′= dtettsS tjut

uu

)()( )()()( ωφ

φ φω , (3)

where )(tuφ is the unit phase function which is the phase of the fundamental divided by its instantaneous frequency [10] and )(tuφ′ is first order derivative of )(tuφ . Inverse Harmonic Transform is defined as:

∫+∞

∞−= ωω

πωφ

φ deSts tjt

uu

)()( )(

21)( . (4)

©2007 EURASIP 2337


In the real speech fundamental frequency is slowly time varying i.e. it cannot change rapidly in a short time period. On this basis in our approach we assume a linear frequency change within given speech segment. Instantaneous phase φ(t) of a sinusoid with linear change of the frequency is defined by known formula (for simplicity initial phase is omitted):

⎟⎟⎠

⎞⎜⎜⎝

⎛+=

22)(

2

0ttft επϕ , (5)

where f0 is the initial frequency and ε=(Δf0/T) is the fundamental frequency change divided by length of the segment (i.e. time in which this the frequency change occurs). Considering the discrete-time signals and the segment length of N samples (T=N/Fs) this formula can be written as:

⎟⎟⎠

⎞⎜⎜⎝

⎛ Δ+=

ss NFnf

Fnfn

22)(

200πϕ . (6)

Initial fundamental frequency within a given segment can be written as: 20 cc afff −= , cffa 0Δ= , (7) where fc is the central fundamental frequency within a given segment of the length N. Substituting f0 and Δf0 in (6) with (7) we get:

)(2)( nF

fn as

c απϕ = , ⎟⎠⎞

⎜⎝⎛ +−=

Nananna 22

1)(α . (8)

Now, let us consider the Discrete Harmonic Transform for signals with linear changing fundamental frequency. Frequencies of the spectral lines of the Discrete Fourier Transform are defined as:

NFf s

c = . (9)

In the HT central frequencies of the spectral lines are aligned with the frequencies of DFT spectral lines. Using (9) in (8) we get:

)(2)( nN

n aαπϕ = . (10)

Finally we can define the Short Time Discrete Harmonic Transform (STHT) for signals with linear frequency change:

∑ −

=

−′=10

)(2)()()( N

n

nj Nk

ennskSαπ

α , (11)

where α’(n) is defined as:

Nanan +−=′

21)(α

. (12) Inverse STHT is defined as:

∑ −

==

10

)(2)(1)( N

knj N

kekS

Nns

απ. (13)

Example of the STFT spectrum and the STHT spectrum of a test signal is shown on fig 1. The input harmonic signal consists of 16 harmonics, the fundamental frequency changes linearly from 200Hz to 220Hz within a segment of length 256 samples (Fs=8000Hz). Note, that only few first

harmonics in the STFT spectrum can be distinguished while in the STHT spectrum all of the harmonic are visible. The second example is an example of comparison of the spectrograms of the speech signal processed by the STFT and the STHT is shown in fig. 2.

Figure 2 – example spectrograms of the speech signal using the

STFT (top) and the STHT (bottom).

3. PITCH TRACK ESTIMATION

The pair of transforms given by (11) and (13) allow to analyze the harmonic signals in the harmonic domain in case when the fundamental frequency track is known. In case of speech both the central fundamental frequency and its change are unknown. Block diagram of the pitch detection algorithm is shown in fig. 3.

Proposed algorithm starts from searching the fundamental frequency change by examining the STHT spectrum for a different unit phase functions (12) i.e. unit phase functions with a different a parameter. Optimal a parameter value is defined as the value which minimises the Spectral Flatness Measure:

∑∏

−

=

−

== 10

1

1

0

),(

),()(minarg N

kN

N N

k

a kaSTHT

kaSTHTaSFM , (14)

where STHT(a,k) is the harmonic spectrum of a given speech segment for a given a and |.| denotes absolute value . The minimal spectral flatness value indicates the highest concentration, which in case of our algorithm means an optimal fit of the signal and the STHT kernel. This also means, that the optimal speech fundamental frequency change is found for a given speech segment. Once this is done, the pitch frequency is estimated. First step of this algorithm is the determination of the pitch frequency harmonics candidates fi by peak picking of the STHT spectra based on the algorithm proposed in [11]. Pitch harmonics candidates with the central frequency located between 50

©2007 EURASIP 2338


and 450Hz are considered as the pitch candidates. For each pitch candidate the algorithm tries to find its harmonics. In the case of inability to find three of the first four harmonics the candidate is discarded. In order to prevent pitch doubling or halving following factor is computed for each harmonic:

max

2

12max

h

nn n

n

ar

h∑ == ,

where an is an amplitude of the n-th harmonic of pitch, nhmax is the number of all possible harmonics for a given pitch candidate. This formula can be viewed as a mean energy of the harmonic signal for the particular pitch per a single harmonic multiplied by the energy carried by the signal. This formula prevents from pitch halving, while the mean energy per a harmonic is smaller for halved pitch candidates from one side and from the other side energy of the harmonic signal is higher for lower pitch candidates which prevents from pitch doubling. As a pitch for a given frame the pitch candidate is selected with the greatest r factor. Finally, the pitch value is refined using following formula:

max

1max

h

nn n

f

nfr

h n∑ == ,

where fn is the frequency of nth harmonic candidate.

Figure 3 – Pitch detection algorithm

Described procedure estimates the central pitch frequency for one frame. Further prevention of the pitch halving or doubling is provided by usage of the tracking buffer which stores the fundamental frequency estimates from a several consecutive frames. The final pitch estimation is done for the frame in the middle of the tracking buffer, thus the resulting pitch estimation is done witch a several frames delay. In our system we used the buffer length of 5. As a tracking algorithm we use the median filtering which we found simple and robust against grose pitch errors.

4. PERIODIC-APERIODIC DECOMPOSITION

Speech decomposition in our system is performed in a time domain. First, the periodic component is estimated and the aperiodic component is defined as a difference between the input speech signal and the estimated periodic component.

Figure 4 – Example of the speech decomposition: original

speech (top), estimated periodic (middle) and aperiodic (bottom) components

On the basis of the speech model discussed in section 2 periodic component is defined as: ( )∑ =

+=Kk kk nkAnh

1)0()(cos)( ϕϕ , (15)

where Ak is the amplitude of the k-th harmonic, φ(n) is the instantaneous phase of the k-th harmonic defined in (8) with the central frequency fc defined by the pitch frequency, φk(0) is the initial phase of the k-th harmonic. Unfortunately pitch harmonics are not aligned with the spectral lines and thus they cannot be directly estimated from the STHT spectrum. One possible solution for this problem is an interpolation of the adjacent STHT coefficients. In our system we propose more accurate solution to find the harmonics amplitudes and phases. In order to provide the spectral analysis exactly at the frequencies aligned witch the pitch harmonic we use the same formula (8) as we used in (15). By doing it we get the special case of HT which we have used in our previous work [12]. The DHT variant aligned with the pitch is defined as:

∑ −

=

−′=

10

)(2

)()()( Nn

njsF

rkf

ennskSαπ

α ,

where fr is the refined pitch frequency and k=1..K, K is the number of the harmonics of the pitch. Amplitudes and phases of the harmonics can be computed directly from S(h) coefficients:

22 )(Im)(Re kSkSAk +=

)(Re)(Imarctan)0(

kSkS

k −=ϕ ,

where Re and Im stands for the real and the imaginary parts of S(k) respectively. The periodic component is generated using formula (15) and the aperiodic component is defined as:

)()()( nhnsnr −= . Example of the speech decomposition is given in fig 4.

©2007 EURASIP 2339


5. EXPERIMENTAL RESULTS

In order to verify the proposed decomposition algorithm we performed set of experiments on a synthetic speech-like signals. The testing procedure was as follows: two sets of synthetic speech were prepared, one for male (central frequency 120Hz) and one for female (central frequency 200Hz). In order to verify the Short Time Harmonic Transform performance different fundamental frequency changes were used in both sets. The fundamental frequency change parameters were chosen randomly within a given boundaries which were chosen in order not to exceed 30% of the central fundamental frequency within a test frame. We have tested our algorithm for several Harmonic to Noise Ratios (HNR) by adding a noise signal with different energy to the input signal. Results of the experiment is shown in table 1.

Central Pitch Frequency

HNR [dB] Measured HNR [dB]

Estimated periodic

component SNR [dB]

120 ∞ 59,6 59,6 120 30 29,2 33,9 120 10 10,6 20,9 120 5 5,7 16,1 120 0 1,05 11,3 200 ∞ 68,3 68,3 200 30 30,3 38,9 200 10 10,6 21,5 200 5 5,54 16,3 200 0 1,06 11,2

Table 1 – Results of experiments

In the table the ‘HNR’ column is the original HNR ratio of the input signal. After periodic and aperiodic component estimation HNR parameter was measured. Mean value of this measure is shown in the column ‘Measured HNR’. Finally, the quality of estimated periodic component was tested by measuring its SNR, which is defined as the estimated periodic component energy to the error signal energy ratio. Error signal is defined as a difference between the original and estimated periodic components.

6. CONCLUSIONS

In this paper we proposed new speech decomposition scheme based on Harmonic Transform. Four our purposes we have developed two variants of the Short Time Discrete Harmonic Transform in the case of linear frequency change within analysis frame. First variant allows for the spectral analysis in the harmonic domain and has the ability to synchronize its kernel with the input signal. Second variant of the transformation allows for accurate estimation of the pitch harmonics amplitudes and frequencies because the spectral lines in this variant are aligned with the pitch frequency. There are two main advantages of using the STHT compared to the conventional spectral analysis using the STFT. First is the ability to estimate the fundamental frequency change without a knowledge of the fundamental

frequency itself. Second one is preventing spectrum smearing especially for higher order harmonics which is important if the spectral domain fundamental frequency estimation algorithm is used. This feature allows the algorithm to be more robust in the cases of highly intonated speech segments and transient speech segments as well. Experiments prove robustness of the proposed approach.

7. ACKNOWLEDGEMENTS

This work was supported by Bialystok Technical University under the grant W/WI/2/05.

REFERENCES

[1] A.M. Kondoz, Digital speech: coding for low bit rate communication systems, New York: John Wiley & Sons, 1996. [2] A.S. Spanias, “Speech coding: a tutorial review”, Proc. IEEE, vol. 82, no. 10, pp. 1541-1582, 1994. [3] R.J McAulay, T.F. Quatieri, “Sinusoidal Coding” in Speech Coding and Synthesis (W. Klein and K. Palival, eds.), Amsterdam: Elsevier Science Publishers, 1995. [4] E.B. George, M.J.T. Smith, “Speech Analysis/Synthesis and Modification Using an Analysis-by-Synthesis/Overlap-Add Sinusoidal Model”, IEEE Trans. on Speech and Audio Processing, vol. 5, no. 5, pp. 389-406, 1997. [5] Y. Stylianou, “Applying the Harmonic Plus Noise Mode in Concatenative Speech Synthesis”, IEEE Trans. on Speech and Audio Processing, vol. 9, no 1., 2001. [6] D.W. Griffin, J.S. Lim, “Multiband Excitation Vocoder”, IEEE Trans. on Acoust., Speech and Signal Processing, vol. ASSP-36, pp. 1223-1235, 1988. [7] B. Yegnanarayana, C. d’Alessandro, V. Darsions, “An Iterative Algorithm for Decomposiiton of Speech Signals into Voiced and Noise Components”, IEEE Trans. on Speech and Audio Coding, vol. 6, no. 1, pp. 1-11, 1998. [8] P.J.B. Jackson, C.H. Shadle, “Pitch-Scaled Estimation of Simultaneous Voiced and Turbulence-Noise Components in Speech”, IEEE Trans. on Speech and Audio Processing, vol. 9, no. 7, pp. 713-726, Oct. 2001 [9] X. Serra, “Musical Sound Modeling with Sinusoids plus Noise” in Musical Signal Processing (C. Roads, S. Pope, A. Picialli, and G. De Poli eds.), Swets & Zeitlinger Publishers, 1997, pp. 91-122 [10] F. Zhang, G. Bi, Y.Q. Chen, “Harmonic Transform”, IEEE Trans. on Vis. Image Signal Processing, vol. 151, No. 4, pp. 257-264, Aug. 2004. [11] V.Sercov, A.Petrovsky, “The method of pitch frequency detection on the base of tuning to its harmonics”, in Proc. of the 9th European Signal processing conference, EUSIPCO’98, vol.II, Sep. 8-11, 1998, Rhodes, Greece. - pp.1137-1140. [12] V. Sercov, A. Petrovsky, “An Improved Speech Model with Allowance for Time-Varying Pitch Harmonic Amplitudes and Frequencies in Low Bit-Rate MBE Coders”, in Proc. of the 6ht European Сonf. on Speech Communication and Technology EUROSPEECH’99, Budapest, Hungary, 1999, pp. 1479-1482.

©2007 EURASIP 2340


Accurate Speech Decomposition Into Periodic and … SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC...

Documents

Transcript of Accurate Speech Decomposition Into Periodic and … SPEECH DECOMPOSITION INTO PERIODIC AND APERIODIC...