Sinusoidal Synthesis of Speech Using MATLAB

8/13/2019 Sinusoidal Synthesis of Speech Using MATLAB

1/35

1

SINUSOIDAL SYNTHESIS OF SPEECH USING MATLAB

Thesis

Submitted in partial fulfillment of the requirement of

BITS C421T Thesis

BY

AKSHAY VIJAY JAIN

2009B4A8568P

Under the supervision ofDr. RAHUL SINGHAL

Assistant Professor, EEE

Dept.

BITS-Pilani

AT

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI

November, 2013


2/35

2

ACKNOWLEDGEMENT

I would like to thank the Almighty first of all for his blessings.

I am obliged to Prof B.N. JAIN, Vice Chancellor, Birla Institute of Technology & Science, Pilani

for providing us with a course pattern where a student gets exposure to projects.

I wish to express deep sense of gratitude to DrRahul Singhal, my supervisor for Thesis named

Sinusoidal Synthesis of speech using MATLAB for providing me this wonderful opportunity to

learn about various parameters associated with speech and synthesis of speech from spectrogram.

I would also like to thanks him for his constant advice, encouragement and support in the study.

I wish to express gratitude to all other people as well as all the websites for the content they

provided me for performance of research work.

Last but not the least; I would like to thank our parents for their constant support and motivation.


3/35

3

CERTIFICATE

This is to certify that Thesis entitled ____________________________

________Sinusoidal Synthesis of Speech using Matlab

______________ is submitted by _Akshay Vijay Jain_ ID NO _2009B4A8568P in partial fulfillment of the requirement of the BITS

C421T Thesis embodies the work done by him under my supervision

Signature of Supervisor

Date: 25 November 2013 Dr Rahul Singhal

Assistant Professor,EEE Department,

BITS PILANI, PILANI CAMPUS


4/35

4

Thesis Abstract

This thesis report discusses speech signal, how it is stored on computer,

how it is analyzed and how it is synthesized. One of the way ofanalyzing speech signal is Short Time Fourier Transform, which is

discussed in the Thesis report along with its parameter. Based on this

analysis of speech signal, we are extracting the matrix containing

frequency present in the signal as function of time. Then after having

obtained the matrix from the spectrogram generated from the MATLAB,

we try to resynthesize the speech signal back by sinusoidal addition

using MATLAB code.


5/35

5

TABLE OF CONTENTS

1)Introduction

2)Recording of speech signal

3)Analysis of speech signal

a) Long term frequency analysis

b) Window sequence

c) Effect of window

d) Choice of window

e) Parameters of Short Term Frequency Spectrum

f) Time-Frequency domain: spectrogram

g) Length of window and fundamental frequency

4)Why sinusoids?

5)Additive synthesis

6)Frequency Vs Time matrix from spectrogram in MATLAB

1.GenerateFreqVsTime Matlab Code

2.Croplimit MatlabCroplimit Code

3.Screenshots

7)Speech signal from Frequency Vs Time matrix in MATLAB

1.GenerateSoundData Matlab Code

2.TestAtLevel Matlab Code

8)Conclusion

9)Bibliography/Reference


6/35

6

1) IntroductionWe all know speech is an acoustic signal by that we mean that it is a

mechanical wave that is an oscillation of pressure transmitted through

solid liquid or gas and it is composed of frequencies within hearing

range. Sound is a sequence of waves of pressure that propagates through

compressible media such as air or water. (Sound can propagate through

solids as well, but there are additional modes of propagation). Sound that

is perceptible by humans has frequencies from about 20 Hz to

20,000 Hz. In air at standard temperature and pressure, the

corresponding wavelengths of sound waves range from 17 m to 17 mm.

During propagation, waves can be reflected, refracted, or attenuated bythe medium.

Figure 1. Typical sound signal


7/35

7

2) Recording of SpeechSound recording is an electrical or mechanical inscription of sound waves,

such as spoken voice, singing, instrumental music, or sound effects. The

two main classes of sound recording technology are analogrecording and digital recording. Acoustic analog recording is achieved by

a small microphone diaphragm that can detect changes in atmospheric

pressure (acoustic sound waves) and record them as a graphic

representation of the sound waves on a medium such as a phonograph (in

which a stylus senses grooves on a record). In magnetic tape recording,

the sound waves vibrate the microphone diaphragm and are converted into

a varying electric current, which is then converted to a varying magnetic

field by an electromagnet, which makes a representation of the sound asmagnetized areas on a plastic tape with a magnetic coating on it.

Digital recording converts the analog sound signal picked up by the

microphone to a digital form by a process of digitization, allowing it to

be stored and transmitted by a wider variety of media. Digital recording

stores audio as a series of binary numbers representing samples of

the amplitude of the audio signal at equal time intervals, at a sample

rate high enough to convey all speechs capable of being heard. Digital

recordings are considered higher quality than analog recordings notnecessarily because they have higher fidelity (wider frequency

response or dynamic range), but because the digital format can prevent

much loss of quality found in analog recording due to noise

and electromagnetic interference in playback, and mechanical

deterioration or damage to the storage medium. A digital audio signal

must be reconverted to analog form during playback before it is applied

to a loudspeaker or earphones.


8/35

8

3) Analysis of Speech SignalThelong-term frequency analysis of speech signals yields good

information about the overall frequency spectrum of the signal, but no

information about the temporal location of those frequencies. Sincespeech is a very dynamic signal with a time-varying spectrum, it is often

insightful to look at frequency spectra of short sections of the speech

signal.

a)Long-term frequency analysisThe frequency response of a system is defined as the discrete-time

Fourier transform (DTFT) of the system's impulse response h[n]:

Similarly, for a sequencex[n], its long-term frequency spectrum is

defined as the DTFT of the Sequence

Theoretically, we must know the sequence x[n] for all values of n (from

n=- until n=) in order to compute its frequency spectrum.

Fortunately, all terms where x[n] = 0 do not matter in the sum, andtherefore an equivalent expression for the sequence's spectrum is

Here we've assumed that the sequence starts at 0 and is N samples long.

This tells us that we can apply the DTFT only to all of the non-zero

samples of x[n], and still obtain the sequence's true spectrum X (). But

what is the correct mathematical expression to compute the spectrum

over a short section of the sequence, that is, over only part of the non-zero samples of the sequence?


9/35

9

b)Window sequenceIt turns out that the mathematically correct way to do that is to multiply

the sequence x[n] by a window sequence w[n] that is non-zero only for

n=0 L-1, where L, the length of the window, is smaller than the length

N of the sequence x[n]:Now

Then we compute the spectrum of the windowed sequence xw[n] as

usual

The following figure illustrates how a window sequence w[n] is applied

to the sequence x[n]:

Figure 2 Result of application of windowed sequence to data sequence


10/35

10

As the figure shows, the windowed sequence is shorter in length than the

original sequence. So we can further truncate the DTFT of the

windowed sequence:

Using this windowing technique, we can select any section of arbitrary

length of the input sequence x[n] by choosing the length and location of

the window accordingly. The only question that remains is: how does

the window sequence w[n] affect the short-term frequency spectrum?

c)Effect of the windowTo answer that question, we need to introduce an important property ofthe Fourier transform. The diagram below illustrates the property

graphically:

I. Implementation of an LTI system in the time domain.

II. Equivalent implementation of an LTI system in the frequency

domain.


11/35

11

The two implementations of an LTI system are equivalent: they will give

the same output for the same input. Hence, convolution in the time

domain = multiplication in the frequency domain:

And since the time domain and the frequency domain are each others

dual in the Fourier transform, it is also true that multiplication in the

time domain = convolution in the frequency domain:

This shows that multiplying the sequence x[n] with the windowsequence w[n] in the time domain is equivalent to convolving the

spectrum of the sequence X (), with the spectrum of the window W().

The result of the convolution of the spectra in the frequency domain is

that the spectrum of the sequence is smeared by the spectrum of the

window. This is best illustrated by the example in the figure below:


12/35

12

Figure 3 Result of application of window sequence in time and

frequency domain

d)Choice of windowBecause the window determines the spectrum of the windowed sequenceto a great extent, the choice of the window is important. Matlab supports

a number of common windows, each with their own strengths and

weaknesses. Some common choices of windows are shown below.

Figure 4 Rectangular window sequence


13/35

13

Figure 5 Triangular and Hamming window sequence

All windows share the same characteristics. Their spectrum has a peak,

called the main lobe, and ripples to the left and right of the main lobecalled the side lobes. The width of the main lobe and the relative height

of the side lobes are different for each window. The main lobe width

determines how accurate a window is able to resolve different

frequencies: wider is less accurate. The side lobe height determines how

much spectral leakage the window has. An important thing to realize is

that we can't have short-term frequency analysis without a window.

Even if we don't explicitly use a window, we are implicitly using a

rectangular window.

e)Parameters of the short-term frequency spectrumBesides the type of windowrectangular, hamming, etc.there are

two other factors in Matlab that control the short-term frequency

spectrum: window length and the number of frequency sample points.

The window length controls the fundamental trade-off between time

resolution and frequency resolution of the short-term spectrum,


14/35

14

irrespective of the window's shape. A long window gives poor time

resolution, but good frequency resolution. Conversely, a short window

gives good time resolution, but poor frequency resolution. For example,

a 250 millisecond long window can, roughly speaking, resolve

frequency components when they are 4 Hz or more apart (1/0.250 = 4),but it can't tell where in those 250 millisecond those frequency

components occurred. On the other hand, a 10millisecond window can

only resolve frequency components when they are 100 Hz or more apart

(1/0.010= 100), but the uncertainty in time about the location of those

frequencies is only 10 millisecond. The result of short-term spectral

analysis using a long window is referred to as a narrowband spectrum

(because a long window has a narrow main lobe), and the result of short-

term spectral analysis using a short window is called a widebandspectrum. In short-term spectral analysis of speech, the window length is

often chosen with respect to the fundamental period of the speech signal,

i.e., the duration of one period of the fundamental frequency. A common

choice for the window length is either less than 1 times the fundamental

period, or greater than 2-3 times the fundamental period.

Examples of narrowband and wideband short-term spectral analysis of

speech are given in the figures below:

Figure 6 Wideband and Narrowband analysis of speech

The other factor controlling the short-term spectrum in Matlab is thenumber of points at which the frequency spectrum H () isevaluated.

The number of points is usually equal to the length of the window.

Sometimes a greater number of points is chosen to obtain a smoother

looking spectrum. Evaluating H () at fewer points than the window

length is possible, but very rare.


15/35

15

f) Time-frequency domain: SpectrogramAn important use of short-term spectral analysis is theshort-time

Fourier transform orspectrogram of a signal. The spectrogram of a

sequence is constructed by computing the short term spectrum of a

windowed version of the sequence, then shifting the window over to anew location and repeating this process until the entire sequence has

been analyzed. The whole process is illustrated in the figure below:

Figure 7 Demonstration of making of spectrogram

Together, these short-term spectra (bottom row) make up the

spectrogram, and are typically shown in a two-dimensional plot, where

the horizontal axis is time, the vertical axis is frequency, and magnitude

is the color or intensity of the plot. For example:


16/35

16

Figure 8 A typical spectrogram

The appearance of the spectrogram is controlled by a third parameter:

window overlap. Window overlap determines how much the window is

shifted between repeated computations of the short term spectrum.

Common choices for window overlap are 50% or 75% of the window

length. For example, if the window length is 200 samples and window

overlap is 50%, the window would be shifted over 100 samples between

each short-term spectrum. In the case that the overlap was 75%, the

window would be shifted over 50 samples. The choice of window

overlap depends on the application. When a temporally smooth

spectrogram is desirable, window overlap should be 75% or more. Whencomputation should be at a minimum, no overlap or 50% overlap are

good choices. If computation is not an issue, you could even compute a

new short-term spectrum for every sample of the sequence. In that case,

window overlap = window length1, and the window would only shift

1 sample between the spectra. But doing so is wasteful when analyzing

speech signals, because the spectrum of speech does not change at such

a high rate. It is more practical to compute a new spectrum every 20-50

millisecond, since that is the rate at which the speech spectrum changes.

g)Length of the window and fundamental frequency


17/35

17

In a wideband spectrogram (i.e., using a window shorter than the

fundamental period), the fundamental frequency of the speech signal

resolves in time. That means that you can't really tell what the

fundamental frequency is by looking at the frequency axis, but you can

see energy fluctuations at the rate of the fundamental frequency alongthe time axis. In a narrowband Spectrogram (i.e., using a window 2-3

times the fundamental period), the fundamental frequency resolves in

frequency, i.e., you can see it as an energy peak along the frequency

axis. See for example the figures below:

Figure 9. Wideband Speech Spectrogram

Figure 10. Narrowband Speech Spectrogram


18/35

18

4) Why Sinusoids?In general the goal of modelling a signal is to reduce redundancy and to

get a more compact representation of the data. There are different

techniques to model a time series and it depends on the signal which

technique to apply. Sinusoids are especially suited for modelling speech

with harmonic content. Most natural acoustical sounds exhibit this

attribute and the reason for this sinusoidity can be found in the way of

the speech production. Human voice production system consists of two

fundamental parts working together, namely the voice chords (the

excitation source) and the pharynx with mouth and nasal cavities acting

as acoustical filter. During voiced parts of speech the vocal chords are

opening and closing at a certain frequency (the fundamentalfrequency, f0) modulating the airstream coming from the lungs. The

harmonic overtone structure results from the structure of the pharynx

which can be seen as an open tube in a simplified way, letting develop

all overtones.

f1fn being integer multiples of the fundamental f0.

5) Additive SynthesisSine waves can be considered the building blocks of speech. In

fact, it was shown in the 19th Century by the mathematician

Joseph Fourier that any periodic function can be expressed as a

series of sinusoids of varying frequencies and amplitudes. This

concept of constructing a complex speech out of sinusoidal terms

is the basis for additive synthesis, sometimes calledFourier

synthesisfor the aforementioned reason. In addition to this, the

concepts of additive synthesis have also existed since the

introduction of the organ, where different pipes of varying pitch

are combined to create a sound or timbre.

A simple block diagram of the additive form may appear like


19/35

19

Figure 11. Block Diagram representation of Sinusoidal Synthesis

Its mathematical form based on Fourier series will be

Where is an offset value for the whole function (typically 0),

= the amplitude weightings for each sine term,

= the frequency multiplier value.

With hundreds of terms each with their own individual frequency

and amplitude weightings, we can design and specify some

incredibly complex sounds, especially if we can modulate the

parameters over time.


20/35

20

6) Frequency Vs Time Matrix fromSpectrogram in MATLAB

Determination of the frequency content present in speech at a particular

instant of time is possible approximately by the Short Term Fourier

Transform (STFT), for our thesis work we are using the Narrow Band

Spectrogram produced from Matlab. We are choosing narrow band

because it gives better frequency resolution and acceptable time

resolution. We tried with Wideband Spectrogram, but the speech

synthesized using information from Wideband Spectrogram was very

noisy.

First of all, we take the spectrogram of speech signal with the help of

MATLAB commandspectrogram. The spectrogram produced by the

MATLAB command spectrogram is a RGB image in decibel scale ,

where in the intensities above 0 dB are expressed in varying shades of

Red color, so we separate out the Red component from the RGB image,

then in the separated component we can easily identify the frequencies

which had higher intensities in the speech, since the pixelscorresponding to high intensity frequencies will appear white while

others will appear black and the intermediate will be in gray scale. Now

the Red component is appropriately cropped and resized with number of

rows equal to 400 implying every row for 10 Hz range and into number

of columns hundred times the duration of the speech signal implying that

each column in the speech signal corresponds to 10 milliseconds of

speech.

It has been found that when we convert the resized image in to black

and white by converting gray pixel nearer to white into white and gray

pixel nearer to black into black the quality of speech is very near to the

original speech. So we produce the black and white image which

corresponds to Frequency Vs Time Graph for the speech signal.


21/35

21

a) The MATLAB code for performing above task is as follows

1)% function GenerateFreqVsTime()

2)% Record your voice for 5 seconds.

3)f=input('Enter the time in seconds for which you want to record');

4)recObj = audiorecorder(8000,8,1);

5)disp('Start speaking.');

6)recordblocking(recObj,f);

7)disp('End of Recording.');

8)% Play back the recording.

9) play(recObj);

10) % Store data in double-precision array.

11) myRecording = getaudiodata(recObj);

12) figure(1)

13) plot (myRecording);title('sound ');

14) % Plot the spectrogram

15) figure(2)

16) spectrogram(myRecording, 1000,923, 1024,8E3,'yaxis');

17) h=gcf;

18) set(gcf, 'Position', get(0,'Screensize')); % Maximize figure.

19) level=input('Please enter level between 0 and 1');

20) saveas(h,'spectrogram1.jpg');

21) fig=imread('spectrogram1.jpg');

22) figG1ray=rgb2gray(fig);

23) figure(9)

24) imshow(figGray); title('FigGray');

25) figRed=fig(:,:,1);

26) figure(3)

27) imshow(figRed);

28) title('figRed');

29) [xmin ymin width height]=croplimits(figRed);

30) figure(4)

31) figRedCropped=imcrop(figRed,[xmin ymin width height]);


22/35

22

32) imshow(figRedCropped);title('figRed Cropped');

33) figure(5)

34) figRedCroppedResized=imresize(figRedCropped,[400 100*f]);

35) imshow(figRedCroppedResized);title('figRedCroppedResized');

36) figRedCroppedResizedCorrected=flipud(figRedCroppedResized);

37) figure(6)

38) figRedCroppedResizedBW=im2bw(figRedCroppedResized,level);

39) imshow(figRedCroppedResizedBW);title('figRedCroppedResizedBW');

40) figure(7)

41) figRedCroppedResizedBWCorrected=flipud(figRedCroppedResizedBW);

42) imshow(figRedCroppedResizedBWCorrected);

b) Matlab code for Croplimits function used in above code is as follows

1) function [xmin ymin width height]=croplimits(input)

2) xmin=0;r2=0;ymin=0;c2=0;

3) [row,column]=size(input);

4) for i=30:90

5) if(input(i,column/2)~=255)

6) ymin=i+5;

7) break

8) end

9) end

10) count=0;

11) for ki=row:-1:row-120

12) if(input(ki,column/2)~=255)

13) for kj=column/2:column/2+50

14) if(input(ki,kj)~=255)

15) count=count+1;

16) else count=count-1;

17) end

18) if(count>0)

19) r2=ki;


23/35

23

20) break

21) end // end of if on line 18

22) end //end of for loop from line 13

23) end //end of if on line 12

24) end //end of for loop on line 11

25) count=0;

26) for j=80:180

27) if(input(row/3,j)~=255)

28) for i=row/2:row/2+40

29) if(input(i,j)~=255)

30) count=count+1;

31) else

32) count=count-1;

33) end


35) if(count>24)

36) xmin=j+8;break;

37) end

38) end //end of if on line 27


40) count=0;

41) for j=column:-1:column-120

42) if(input(row/2,j)~=255)

43) for i=row/2:row/2+100

44) if(input(i,j)~=255)

45) count=count+1;

46) else

47) count=count-1;

48) end //end of if from line 44

49) end //end of for from line 43

50) if(count>0)

51) c2=j;break;


24/35

24

52) end // end of if from line 51

53) end // end of 42

54) end // end of 41

55) height=r2-ymin+1;

56) width=c2-xmin+1;

57) end // end of function

c) Screenshots

i. Speech Waveform

Figure 12 Speech Waveform


25/35

25

ii. Spectrogram of above speech using Matlab

Figure 13 Spectrogram of above speech using Matlab

iii. Grayscale Spectrogram

Figure 14 Grayscale Spectrogram


26/35

26

iv. Image of Red component of spectrogram since red componentrepresents positive magnitude

Figure 15 Red component of spectrogram

v. Same figure after being cropped by the matlab functioncroplimit

Figure 16. Same figure after being cropped by the matlab function

croplimit


27/35

27

vi. Above figure resized by Matlab function to generate acolumn of pixel corresponding to 10 milliseconds

Figure 17 Resized using Matlab

vii. Above figure inverted so as to make first row correspond to10Hz frequency and next row correspond to 20Hz while last400throw correspond to 4KHz

Figure 18 Same figure as previous but inverted


28/35

28

viii. Same figure as above with pixels having intensity less than.9 reduced to zero while others extended to 1

Figure 19 Same figure as above with pixels having intensity less

than .9 reduced to zero while others extended to 1


29/35

29

7) Speech signal from Frequency VsTime Matrix in MATLAB

Once we have Frequency Vs Time matrix, we can generate the all thefrequencies using thesin function of MATLAB and add them all and do

these for all the columns which correspond to 10 milliseconds. Now we

can concatenate the data generated for each column and result is the

speech signal.

The MATLAB code for performing above series of task is as follows

a)GenerateSoundData Matlab Code:

1) function sounddata=GenerateSoundData(image)

2) [row column]=size(image);

3) image=image/.255;

4) sounddata=zeros(1,80*column);

5) timeResolution=.01;% 10 milliseconds

6) samplingRate=8000;%8000Hz

7) time=1/samplingRate:1/samplingRate:timeResolution;

8) fori=1:column

9)y=sqrt(double(image(10,i)))*sin(2*pi*time*1*10);10) forj=11:row-100

12) y=y+sqrt(double(image(j,i)))*sin(2*pi*time*j*10);

13) end

14) sounddata(80*(i-1)+1:80*i)=y;

15)end

16) sounddata=sounddata';

In this code we are only generating frequencies in the range 100Hz to

3000 Hz, because other frequencies do not affect the hearing ability so

much.

b)TestAtLevel Matlab Code:


30/35

30

1) function sdata=TestAtLevel(spectrograph,level)

2) bwspectrograph=im2bw(spectrograph,level);

3) sdata=GenerateSoundData(bwspectrograph);

4) soundsc(sdata,8000);

5) end

In the above function namely TestAtLevel, we pass the matrix

obtained from the GenerateFreqVsTime function of name

figRedCroppedResizedCorrected, along with the level which specifies

the threshold for converting lower values to zero while values

greater than level to 1.

8) ResultsThe speech waveform generated with different values of level for

conversion of Red component of spectrogram into Black and

White image are demonstrated below along with their spectrogram


31/35

31

a)Level = 0.8


32/35

32

b)Level = 0.9


33/35

33

c)Level = 0.95


34/35

34

9) ConclusionFrom the above three speech waveforms, it seems that the level of

around 0.9 is the best threshold for Red component of spectrogram

generated from the Matlab, so that the speech generated using the above

Matlab function namely GenerateSoundData is matching more with

the original speech.

The sinusoidal model, a framework for modelling speech and music

signals, has been presented. Sinusoidal synthesis of speech by extracting

frequency and time information form the spectrogram gives acceptable

quality of speech. Another strategy would be decomposing the signal

into deterministic and stochastic parts and using different models for the

different portions of a speech as proposed by [5].

9)Bibliography/References[1] R. McAulay, Th. Quatieri: Speech Analysis/Synthesis Based on a SinusoidalRepresentation,in IEEE Transactions on Acoustics, Speech, and Signal Processing, August1986

[2] J. Smith III, X. Serra: PARSHL: An Analysis/Synthesis Program for Non -HarmonicSounds Based on a Sinusoidal Representation

[3] K. Fitz, L. Haken: On the Use of Time-Frequency Reassignment in Additive SoundModelling

[4] M. Lagrange, S. Marchand, M. Raspaud, J.-B. Rault: Enhanced Partial Tracking using

Linear Prediction, in Proc. of the 6th Int. Conference on Digital Audio Effects (DAFx-03),September 2003

[5] X. Serra: A System for Sound Analysis/Transformation/Synthesis based on a Deterministicplus Stochastic Decomposition, Thesis, Stanford University, 1989


35/35

Sinusoidal Synthesis of Speech Using MATLAB

Documents

Transcript of Sinusoidal Synthesis of Speech Using MATLAB