A SYLLABLE BASED SEGMENT VOCODER -...

113
A SYLLABLE BASED SEGMENT VOCODER A THESIS submitted by SADHANA CHEVIREDDY for the award of the degree of MASTER OF SCIENCE (by Research) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY MADRAS DECEMBER 2008

Transcript of A SYLLABLE BASED SEGMENT VOCODER -...

A SYLLABLE BASED SEGMENT VOCODER

A THESIS

submitted by

SADHANA CHEVIREDDY

for the award of the degree

of

MASTER OF SCIENCE

(by Research)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY MADRAS

DECEMBER 2008

THESIS CERTIFICATE

This is to certify that the Thesis entitled A Syllable based Segment Vocoder

submitted by Sadhana Chevireddy to the Indian Institute of Technology Madras,

for the award of the degree of Master of Science (by research), is a bonafide record

of the research carried out by her under our supervision and guidance. The con-

tents of this thesis, in full or in parts, have not been submitted to any other

Institute or University for the award of any degree or diploma.

Dr. Hema A. Murthy Dr. C. Chandra Sekhar

Chennai-600036

Date:

To my late grand parents

Sri B. Pulla Reddy and Smt B. Jayamma

for their enduring love that still touches me

i

ACKNOWLEDGEMENTS

First and foremost, I would like to express my sincere gratitude towards my guides

Prof. Hema A. Murthy and Prof. C. Chandra Sekhar for their constant support

and guidance. Hema Ma’am has always been my role model and whatever progress

I have made, both personally and academically in these three years of stay, I owe it

to her. Working under her provides an immense scope to learn and a few minutes

of discussion with her opens up a new insight into the problem. Personally, she has

showered her motherly love whenever I was dejected. I cherish all the moments

I spent with her. Thank you Ma’am. I thank Chandra Sekhar sir for providing

his invaluable feedback during the correction of my papers and Thesis. I always

remember the two most important interactions with him - one which made me

come regularly and sharply at 8 A.M to lab. The other which made me write my

Thesis better. I am extremely grateful to you sir, for your remarks always made

me stronger and humble.

I thank my GTC members, Prof. B. Ravidran and Prof. Andrew Thangaraj

for their valuable suggestions in my GTC and Synopsis meetings. I am thankful to

the Head of the Department, Prof. Timothy A. Gonsalves for his encouragement

and nice words in my M.S Seminar. I also thank him for providing excellent

facilities in the Department.

I thank Sarada Ma’am for her kind assistance whenever required, Natarajan

Sir and Radhai Ma’am for the their co-operation in scheduling my GTC meetings.

ii

I am indebted to Prema Ma’am for her love and for presenting me with one of her

beautiful paintings. Most importantly, I would like to thank every person in the

department who helped directly or indirectly in successfully completing my course

work.

I would like to thank my loving friends, Ramya who had always been there

for me during my difficult times. More than a friend, she had been a wonderful

critic who helped in understanding my work better. And Sravanti with whom I

enjoy chatting, for she is the one who listens to me with utmost concentration and

makes me happy. I would like to thank someone who always says I am correct

(mostly to please me), Deivapalan. I always remember the light hearted moments

shared with him. And Pappu sir, who is right there to extend his help whenever

required.

I would like to thank my friend Abhijit for his valuable suggestions regard-

ing coding. The discussions with him made me understand my work better.

I thank all my friends Deepti, Venu, Ranbir sir, Chaitanya, Vinodh and Ra-

jesh for providing friendly environment in Lab. I am extremely thankful to my

friends Syam, Deepthy, Bama, Ramya, Sahiti, Radhika, Neetha, Jyothi, Suneetha,

Samaja, Harini and Mrudula for making my stay in campus a memorable one.

On the personal front, I firstly thank Madhu uncle for nurturing the art of

thinking through never ending discussions on whatnot on earth. Ramesh uncle for

his all good words during my tough times. I thank my doctor uncle, Srinath mama

for ensuring that my health is fine during my stay here. And Sujatha atta for her

immense love towards me. I thank all my best friends Shallu, Suni, Summu, Siri

iii

and Navya for their love and affection. Not even once they made me feel that I

stay away from them.

I am extremely grateful to my dearest friend Akku who had been everything to

me during my stay in the campus. She is still the one who takes all my nonsense

with lot of patience.

I thank my loving Chaakha for bothering Gods on behalf of me whenever I am

worried and Chinanna for his love and care towards me.

I am extremely fortunate to have wonderful parents, my strength. Daddy has

been my first guru from whom I learnt the value of education and life. Amma

has been my dearest friend, philosopher and guide who always shows me the best

path when I am confused. I express my sincere gratitude to them.

Finally, I thank the Almighty for showering His immense grace upon me by

placing me among these wonderful people and in the most beautiful place, IIT

Madras...

Sadhana

iv

ABSTRACT

Keywords: Very low bit rate speech coding, Segment vocoders, Syllable segmen-

tation, Unsupervised HMM clustering, MELP residual modeling

In the modern day voice communication technology, the allocation of available

bandwidth to increasing number of users is a challenge. This paved way for the

development of very low bit rate speech coders. There has been a progressive

research on how to reduce the transmission bit rates without paying off for the

consumer satisfaction. The low bit rate coders are aimed to achieve good intel-

ligibility at bit rates less than 4.8 Kbps. Segment vocoders are one such class of

very low bit rate coders which tremendously reduce the bit rates below 2.4 Kbps

and still able to retain the intelligibility.

In the present work, a syllable based segment vocoder that operates at 1.4 Kbps

is developed. Implementation of a segment vocoder has many stages: segmenta-

tion of speech into predefined units, preparation of the segment codebook and

residual modeling. Many vocoders use phonemes or diphones as segmental units

and employ appropriate automatic segmentation techniques to obtain the units.

Most of the segmentation techniques used are either iterative or use parametric

models built on a large speech corpus. In the present work, a signal process-

ing technique called automatic group delay based segmentation is used to obtain

syllable like segments. These segments are used as the units for compression. Syl-

v

lable like segments being larger units than phonemes, offer better compression.

A Hidden Markov Model (HMM) based incremental training algorithm is used

for clustering the segmental units. Unlike most of the clustering techniques, in

the proposed clustering technique, HMM training is unsupervised and does not

require any transcription for building the HMMs. The clusters are later used to

form the system codebook. To preserve the naturalness in the synthesized speech,

the residual is modeled using Mixed Excited Linear Prediction (MELP) codec op-

erating at 2.4 Kbps. When the residual is modeled using MELP, an average bit

rate of 1.4 Kbps is achieved for the proposed syllable based segment vocoder. The

synthesized speech quality is compared with that of the MELP codec using the

objective evaluation measure, Perceptual Evaluation of Speech Quality (PESQ).

vi

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ii

ABSTRACT v

LIST OF TABLES x

LIST OF FIGURES xii

ABBREVIATIONS xiii

1 OVERVIEW OF THE THESIS 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . 6

1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . 7

2 SPEECH CODERS 8

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Waveform Coders . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Vocoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Segment Vocoders . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Issues in Implementing a Segment Vocoder . . . . . . . . . . . . 19

2.6 A Review of Segment Vocoders . . . . . . . . . . . . . . . . . . 22

2.7 Syllable based Segment Vocoder . . . . . . . . . . . . . . . . . . 28

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 SPEECH SEGMENTATION 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Segmentation in Segment Vocoders . . . . . . . . . . . . . . . . 30

vii

3.2.1 Maximum Likelihood (ML) Segmentation . . . . . . . . . 32

3.2.2 Spectral Transition Measure (STM) based Segmentation 34

3.3 Syllable Segmentation . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Group Delay based Segmentation . . . . . . . . . . . . . . . . . 36

3.5 Syllable Segmentation for the Syllable based Segment Vocoder . 40

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 SYSTEM CODEBOOK PREPARATION 42

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Need for System Compression . . . . . . . . . . . . . . . . . . . 42

4.3 Review of Clustering Algorithms . . . . . . . . . . . . . . . . . . 43

4.4 Motivation for HMM based syllable segment clustering . . . . . 46

4.5 System Codebook Preparation . . . . . . . . . . . . . . . . . . . 47

4.5.1 MFS and MFR Techniques . . . . . . . . . . . . . . . . . 48

4.5.2 Clustering of syllable segments . . . . . . . . . . . . . . . 48

4.5.3 Color palette analogy . . . . . . . . . . . . . . . . . . . . 51

4.5.4 Selection of representative syllable segments . . . . . . . 53

4.5.5 Preparation of system codebook . . . . . . . . . . . . . . 57

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 SYLLABLE BASED SEGMENT VOCODER 59

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Operation of the Vocoder . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3.2 System Quantization . . . . . . . . . . . . . . . . . . . . 62

5.3.3 Source Encoding . . . . . . . . . . . . . . . . . . . . . . 62

5.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4.1 Extraction of system parameters . . . . . . . . . . . . . . 65

5.4.2 Source Modeling . . . . . . . . . . . . . . . . . . . . . . 65

5.5 Synthesis of Speech . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . 70

viii

5.7 Issues in the Implementation . . . . . . . . . . . . . . . . . . . . 76

5.7.1 Residual Modeling . . . . . . . . . . . . . . . . . . . . . 76

5.7.2 Syllable Recognition . . . . . . . . . . . . . . . . . . . . 78

5.7.3 Duration Matching . . . . . . . . . . . . . . . . . . . . . 82

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6 CONCLUSIONS 88

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.3 Criticism of the work . . . . . . . . . . . . . . . . . . . . . . . . 90

6.4 Directions for Future Work . . . . . . . . . . . . . . . . . . . . . 91

LIST OF PUBLICATIONS 96

LIST OF TABLES

3.1 Syllable boundaries with single and two-level segmentations . . 39

5.1 PESQ scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for four sentences in the single speakerdatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for six sentences in the multi speakerdatabase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for six sentences of Tamil language inthe multi speaker, multi lingual database . . . . . . . . . . . . . 75

5.5 Comparison of PESQ scores for the syllable based segment vocoderand 2.4 Kbps MELP codec for six sentences of Hindi language inthe multi speaker, multi lingual database . . . . . . . . . . . . . 75

5.6 PESQ scores of an utterance after appropriate frame repetition . 85

x

LIST OF FIGURES

1.1 Role of a speech codec in mobile phone technology. . . . . . . . 2

2.1 Block diagram of a waveform coder . . . . . . . . . . . . . . . . 11

2.2 LPC model for speech production . . . . . . . . . . . . . . . . . 12

2.3 Block diagram of a vocoder showing the process in (a) Transmitterand (b) Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 Block diagram of a segment vocoder showing the process in (a)Transmitter and (b) Receiver . . . . . . . . . . . . . . . . . . . 18

2.5 Performance of different coders . . . . . . . . . . . . . . . . . . 21

3.1 Group delay based segmentation of a speech signal . . . . . . . 38

3.2 Illustration of the two-level group delay based segmentation (a)Waveform of the speech signal for a sentence (b) Result of the firstlevel of segmentation and (c) Result of the second level of segmen-tation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Color palette analogy to explain the proposed HMM based tech-nique for segment clustering. . . . . . . . . . . . . . . . . . . . . 53

4.2 Illustration of selection of a representative syllable segment . . . 54

4.3 Waveforms and Spectrograms of five syllable segments in a clus-ter formed using the proposed clustering algorithm. The clustercontains the syllable segments that sound like /va/. . . . . . . . 56

4.4 System codebook entries. Here Si denotes the representative sylla-ble for the cluster Ci and ni is the duration of Si. . . . . . . . . 58

5.1 Block diagram of the syllable based segment vocoder . . . . . . 60

5.2 Segmentation of a Tamil sentence into syllable like units . . . . 61

5.3 System encoding in the syllable based segment vocoder . . . . . 62

5.4 Block diagram of the MELP encoder used for source encoding inthe syllable based segment vocoder . . . . . . . . . . . . . . . . 63

5.5 Block diagram of the MELP decoder . . . . . . . . . . . . . . . 66

5.6 Comparison of the MELP model residual with the original residual 67

xi

5.7 Waveform of (a) The original speech signal and (b) The clippedsignal of the synthesized waveform . . . . . . . . . . . . . . . . 69

5.8 Block diagram of the modified segment vocoder . . . . . . . . . 70

5.9 Comparison of spectrograms: (a) Original signal and (b) Synthe-sized signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.10 Comparison of the residual for the original speech signal and theresidual used in the vocoder: (a) Residual of the original speechsignal and (b) Residual of the speech signal passed through thefilter from the LP coefficient vectors of the representative syllablesegments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.11 Different combinations of syllables for the utterance /vanakkam/ 79

5.12 PESQ improvement by selecting perceptually close representativesyllable segments . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.13 Comparison of the residual for the (a) original speech signal, (b)first best results of the representative syllable segments and (c)perceptually close representative syllable segments from 3-best re-sults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.14 Comparison of the segment duration of (a) original syllable segmentand (b) representative syllable segment . . . . . . . . . . . . . . 82

5.15 Analysis of spectrograms for (a) Input syllable segment, (b) Synthe-sized syllable segment obtained with middle frame repetition, (c)Synthesized syllable segment with appropriate frame repetition and(d) Synthesized syllable segment with appropriate frame repetitionbut different from the frame repetition used in (c). . . . . . . . . 84

5.16 (a) Synthesized speech obtained using the MELP codec and (b)Synthesized speech obtained using the syllable based segment vocoder. 86

xii

ABBREVIATIONS

ADPCM Adaptive Differential Pulse Code ModulationCELP Code Excited Linear PredictionDP Dynamic ProgrammingDPCM Differential Pulse Code ModulationDTW Dynamic Time WarpingGD Group DelayHMM Hidden Markov ModelLP Linear PredictionLPC Linear Predictive CodingLPCC Linear Prediction Cepstral CoefficientLSF Line Spectral FrequencyMBE Multiband ExcitationMELP Mixed Excited Linear PredictionMFCC Mel Frequency Cepstral CoefficientMFR Multiple Frame RateMFS Multiple Frame SizeML Maximum LikelihoodMLSA Mel Log Spectrum ApproximationMOS Mean Opinion ScorePCM Pulse Code ModulationPESQ Perceptual Evaluation of Speech QualityRELP Residual Excited Linear PredictionRMS Root Mean SquareSTE Short Term EnergySTM Spectral Transition MeasureTTS Text-to-SpeechVLSI Very Large Scale IntegrationVQ Vector Quantization

xiii

CHAPTER 1

OVERVIEW OF THE THESIS

1.1 Introduction

The digital revolution has had its impact on every field of science and technology

over the last two decades. The field of communications in general, and the voice

communication technology in particular, has witnessed drastic changes. Advances

in Very Large Scale Integration (VLSI) and sophisticated speech coding algorithms

have enabled the miniaturization of hand held devices that operate with very low

power and communicate at very low transmission bit rates.

Speech coding is a set of techniques used to compress the digital speech data.

Speech is represented in the form of a code and the code is stored or transmitted.

During the playback or reception, the stored code is converted back to the speech

signal. Speech coding algorithms aim at achieving very high compression rates of

speech without an objectionable loss of speech quality. The high compression in

speech enables less storage space and low transmission rates of communication.

The software or hardware that implements a speech coding algorithm is called

a speech coder. Speech communication between two points is complete when

a speech signal is compressed and coded at the transmitter, and decoded and

regenerated at the receiver. The coding and decoding operations require a speech

coder at the transmitter and a speech decoder at the receiver. The combination

of a speech coder and speech decoder forms a speech codec. Speech codecs find

applications in all forms of voice communications including cellular telephony,

voice over IP etc. The role of a speech codec in mobile phone technology is shown

in Figure 1.1.

CODEC

Decoding algorithms

User at the transmitter

Speechsignal

Speech signal

Digital to AnalogConverter (DAC)

User at the receiver

Channel

CODEC

CompressionBitstream

Bitstream DecodeRegeneration

Channel Converter (ADC)Analog to Digital

Encode

Coding algorithms

Figure 1.1: Role of a speech codec in mobile phone technology.

The implementation of a speech codec essentially means the implementation

of a speech coder. The operation of the speech decoder depends on the method of

coding employed in the speech coder.

With the increase in the number of users day by day, the need for accommo-

dating more and more users within the available bandwidth is unarguably a big

challenge. Thus the primary goal of any speech coder is to represent the speech

data with minimum possible bits. This leads to a decrease in the transmission

bit rates and thereby increases the bandwidth utilization without compromising

for the speech quality at the receiver. Normally, as more compression is achieved,

the distortion in the regenerated speech increases, resulting in a decrease in the

quality of speech. Therefore, there is always a trade-off between speech quality

and bit rate.

Speech coders are broadly classified into two groups: waveform coders and

source coders [1]. The two classes differ in the methods used for speech repre-

sentation while coding. A waveform coder does not use the knowledge of signal

(speech) generation while coding the input signal. A waveform coder tries to

code the signal in such a way that the reproduced signal at the receiver is as

2

close as possible to its original. Therefore, the waveform coders have very high

bit rates and yield high quality speech. Source coders, also called vocoders, use

the knowledge of speech signal generation while coding. Prior to coding, vocoders

represent the speech signal in the form of a model. They try to extract the param-

eters of the model which are then encoded and transmitted. At the receiver, the

vocoder decodes the parameters, rebuilds the model using the decoded parameters

and synthesizes the speech using the model. Generally, Linear Predictive Coding

(LPC) model is used to realize the vocoders. Low bit rates less than 8 Kbps are

achieved in these vocoders by compressing the system and source information to

maximum extent. However, the quality of the synthesized speech signal is poor

when compared to that of the waveform coders. Segment vocoders are the class

of vocoders where the system compression is done at the segment level instead

of at the frame level. Compression is carried out on source by modeling it with

minimum possible number of bits. This high compression of system and source

enables the vocoders to encode speech at very low bit rates (less than 2.4 Kbps)

which in turn reduces the speech quality. Focus of research in segment vocoders

is on improving the speech quality at very low bit rates.

Many very low bit rate coders use a segment based approach to coding [2][3][4][5][6]

[7][8]. The first phase in building such a coder is the segmentation task. Here,

the utterances in large speech corpora are automatically segmented into spectrally

stable units. The second phase is the development of codebooks for the segments

obtained in the first phase. Assuming a source-system model for speech produc-

tion, each segment is deconvolved into its source and system components. In LPC

based analysis, the residual corresponds to the source and the LP coefficients cor-

respond to the system [9]. The system and source components of a segment are

separately coded. To prepare the codebook of system parameters, the sequences

of spectral parameter vectors like Mel Frequency Cepstral Coefficients (MFCCs),

Linear Prediction Cepstral Coefficients (LPCCs), and Line Spectral Frequencies

(LSFs) are clustered. The codebook for the source is prepared by quantizing the

3

parameters such as pitch, gain and voicing. The third and final phase in a segment

vocoder is the encoding of the speech input. The input speech utterance is seg-

mented into variable length segments. Each segment is quantized using the system

codebook. The codebook index that best matches the given segment is transmit-

ted. The residual is encoded using any technique namely LPC10/MELP [10]. At

the receiver, the speech is synthesized using the sequence of vectors corresponding

to the codebook index received, and the decoded residual.

1.2 Scope of the Thesis

The two components of speech, source and system, differ in their characteristics.

Hence, the extent of compression that can be achieved for the two components

is different. The source, which is fast varying, is generally quantized frame by

frame. Since the system characteristics vary slowly, quantization can be achieved

over a set of frames or a segment. This facilitates better compression. In segment

vocoders, the savings in bandwidth primarily comes from the choice of the segment;

the larger is the segment, the lower is the bit-rate. The segment should not be

so large that it cannot be modeled properly. Therefore, choosing an appropriate

segmental unit which is reasonably stable and is not too small becomes an issue.

Generally, segments which are acoustically stable within the segment duration

are chosen. For example, phonetic vocoders are built using phonemes [3] as basic

segmental units. Attempts are also made to use diphones and multi-gram [5] units

as basic segments. Automatic segmentation methods like Maximum Likelihood

(ML) segmentation and Spectral Transition Measure (STM) based segmentation

are used to automatically segment a speech utterance into phoneme like units [11].

The above methods assume piecewise stationarity of speech as an acoustic criterion

for determining the segments. The criterion fails when there is a significant ‘intra-

segmental distortion’. Further, the number of segment boundaries is fixed in the

4

ML segmentation. The phoneme rate can vary quite significantly from one speech

utterance to another. Segmentation is also done by building parametric models

for the segmental units. This method requires the models to be built using a

large database to effectively capture the variations in the segmental units. The

variations in the units increase with an increase in size of the segmental unit. In

the present work, a signal processing technique called Group Delay (GD) based

segmentation which has been successfully used to give syllable like units [12][13],

is applied for the segmentation task. The algorithm does not require any prior

information or prior processing to locate the boundaries. Hence, an attempt is

made in this work to use syllable as a basic segmental unit for compression. The

unit is also large enough to enable better compression.

The system codebook is prepared by clustering the segmental units obtained

from a large speech corpus. Conventional methods include vector quantization

(VQ) and its modifications [2][14][15][16][17][18] to cluster the segmental units

in vector space. In VQ, the clusters are formed based on the spatial variation

among the features of the segments. Duration information is not modeled in

VQ. Hence, duration and mapping analysis have to be addressed in the clustering

process. When clustering using VQ, the sequence information is completely lost.

Syllables being larger units, the sequence information is crucial. Hidden Markov

Models (HMMs) are also used to model the segment characteristics acoustically

[6][4][5][7][8]. However, most of the methods are supervised, need transcription

and enough examples for training HMMs. Therefore, in this work, an acoustic

clustering technique called unsupervised Hidden Markov Model (HMM) based

clustering method [19] is used. In this clustering technique, the acoustically similar

syllables are grouped into one cluster. Each cluster is identified by an HMM model.

A codebook is built using the HMMs of clusters generated. The compression of

the system information is around 100 bps. The source (residual) is coded using the

residual modeling technique implemented in Mixed Excitation Linear Prediction

(MELP) codec [20][21][22]. A bit rate of 1.2 Kbps resulted in encoding the residual

5

using MELP. When the duration information of the syllable segments is encoded,

a bit rate of 1.4 Kbps is obtained. Residual modeling using the MELP ensures

naturalness in the synthesized speech.

1.3 Organization of the Thesis

Chapter 2 starts with a brief introduction to speech coding. The different speech

coding techniques are discussed and the significance of low bit rate speech coding

is established. Later, segment vocoders which are capable of reducing the bit rates

to less than 2 Kbps are discussed in detail. Different segment vocoders proposed

in the literature are briefly described and the motivation for developing a syllable

based segment vocoder is presented.

Chapter 3 describes the group delay based segmentation technique, which is the

first task in the implementation of the proposed vocoder. Different automatic seg-

mentation algorithms used in segment vocoders are discussed. Then the phoneme

segmentation techniques like ML segmentation and STM based segmentation are

presented. The importance of group delay based segmentation for getting syllable

like units is also discussed.

Chapter 4 explains the preparation of system codebook. An overview of ap-

proaches to system codebook preparation is presented. The proposed technique

to cluster the large corpus of segmental units called the automatic HMM based

unsupervised clustering algorithm, and the system codebook preparation are de-

scribed.

Chapter 5 presents the proposed syllable based segment vocoder. The process

of encoding the system and source is explained. Since, the MELP is used to

encode the residual, a brief overview of MELP codec is also presented. Finally, the

decoding and synthesis of speech are explained. The issues in the implementation

of the proposed coder are also discussed in detail.

6

1.4 Contributions of the Thesis

The following are the major contributions of the thesis:

1. A novel approach to build a segment vocoder is proposed. In this approach,

syllables are considered as the segmental units for compression.

2. An unsupervised clustering algorithm, which does not require any transcrip-

tions or predefined examples to train HMMs, is used to prepare the system

codebook for the vocoder. A system compression of 100 bps is achieved

using the proposed the clustering technique.

3. A syllable based segment vocoder capable of producing natural sounding

speech at 1.4 Kbps, is built using the proposed segmentation and clustering

techniques.

7

CHAPTER 2

SPEECH CODERS

2.1 Introduction

Speech coders are used to give an efficient digital representation of telephone

bandwidth speech, particularly in narrow band telephone communication. Often,

the bandwidth of speech is limited to 3.4 KHz and sampled at 8 KHz. The

objective of any speech coder is to represent the digital speech with as few bits as

possible, while producing the reconstructed speech that sounds identical or almost

identical to the original speech. However, in practice, there is always a trade-off

between the bit rate and the speech quality. The four main issues to be addressed

in implementing a speech coder are as follows:

1. Speech quality: Speech quality depends on the method used for coding.

Depending on the application at hand, an estimate of the required quality

is decided and the method for coding is chosen accordingly. Generally, the

high bit rate coders provide a high quality speech.

2. Bit rate: The coding efficiency of a speech coder is expressed in terms of

bits per second (bps). The bit rate depends on the method used for coding.

3. Communication delay: The time taken to process the input speech for

coding is called the communication delay. Delay is an important issue in

real time implementation of the speech coder.

4. Complexity: The complexity is expressed in terms of memory requirement

and processing capability. A high complexity may lead to a high power

consumption in the hardware. The complexity involved in realizing the

coder should be as less as possible to incorporate it in any hardware such as

a cell phone.

In this chapter, different types of speech coders are briefly reviewed. The per-

formance evaluation of coders in terms of speech quality and bit rates is presented.

As the focus of the present research work is not on the implementation of the real-

time coder, emphasis is not given to the delay and complexity factors. Sections 2.2

and 2.3 explain the waveform coders and the vocoders respectively. The need for

low bit rate coders is established in Sections 2.4 and 2.5 thereby paving a way to

the area of interest, the segment vocoders. Section 2.6 gives a brief review of some

of the segment vocoders proposed in the literature. The chapter concludes with

a discussion on the proposed syllable based segment vocoder and its objectives in

Section 2.7.

Speech coders are broadly classified into two groups: waveform coders and

vocoders. Waveform coders operate at high bit rates and yield high quality speech

at the receiver. Vocoders or source coders operate at very low bit rates, but the

quality of speech at the receiver is reduced. Hybrid coders form a sub-category

of coders that use techniques from both source coding and waveform coding to

produce good quality speech at intermediate bit rates. A detailed description of

the coders is presented in the following two sections.

2.2 Waveform Coders

Waveform coders do not use the knowledge of signal generation while coding the

input speech signal. In other words, a waveform coder tries to reproduce the signal

whose waveform (in time domain) or spectrum (in frequency domain) is as close

as possible to that of its original. In the time domain, a waveform coder quantizes

and encodes the speech signal sample by sample. The coder requires only the

9

information about the sample value at the time of encoding. The complexity of

coding is thus very low. Each sample is coded using at least two bits (in delta

modulation) and hence the bit rates are greater than 16 Kbps. The simplest form

of waveform coding called Pulse Code Modulation (PCM), samples the speech

signal at 8 KHz and encodes each sample at 16 bits has a bit rate of 128 Kbps.

The output speech quality of such a coder is very high and indistinguishable from

the original speech. The only error present between the input speech signal and

the reconstructed speech signal is the quantization error.

Several modifications are made to the basic technique to reduce the bit rates

without affecting the speech quality. The reduction in bit rates is achieved by using

the redundancy in the speech signal. Redundancy in speech enables prediction

of the value of speech signal at a particular time instant based on the values

at preceding time instants. Differential Pulse Code Modulation (DPCM) and

Adaptive DPCM (ADPCM) are the modifications to PCM that use prediction for

coding the speech signal.

In summary, waveform coders are low complexity coders that code the speech

signal sample by sample. The coders yield very high bit rates and produce speech

with a high quality. The block diagram of a PCM coder that shows the encoding

and decoding processes involved is shown in Figure 2.1. The input speech signal

is sampled using a sampling circuit. The vertical lines denote the speech samples.

Each sample of speech is fed to a quantizer. The quantized value of the sample

is binary encoded and transmitted as a bit stream. At the receiver, the decoder

extracts the sample information from the bit stream. Samples are regenerated

using the quantized values. A digital-to-analog converter is used to convert the

digital sample stream to an analog speech signal. If the sampling rate of the

sampler is S samples per second and each sample is encoded using m bits, the bit

rate of the coder is mS bits/s or mS bps.

10

Stream of bits

converter

Sampler Quantizer

Decoder

Transmitter

Receiver

Samples

Stream of bits

Samples

speechInput

Output speech

Digital to Analog

Figure 2.1: Block diagram of a waveform coder

2.3 Vocoders

Vocoders, also called source coders use the knowledge of the speech signal gener-

ation while coding. Prior to coding, vocoders represent the speech signal as the

output of a physical model. The parameters of the model are extracted, quan-

tized, encoded and transmitted. At the receiver, the decoder extracts the encoded

parameters, rebuilds the model using the decoded parameters and synthesize the

speech using the model. The model used for coding is based on the human speech

production mechanism.

Human speech production involves a series of excitation pulses (puffs of air)

passing through a filter (vocal tract) to produce the sound. The type of sound

produced depends on the type of excitation as well as the vocal tract shape. The

human speech production system can be modeled using linear prediction as shown

in Figure 2.2. The model represents the digital speech signal as the output of a dig-

ital filter (called LPC filter) whose input (excitation) is either a train of impulses

or a white noise sequence. From the figure, it is seen that speech is synthesized

by sending either a series of impulses or a white noise sequence through a filter.

The type of input to the filter depends on the nature of the sound unit. A voiced

sound is synthesized with an impulse train as input, and an unvoiced sound is

11

X

Impulse

G

Time−vary− s(n)

Vocal Tract Parameters

Voiced/Unvoicedtrain

Generator

Random Noise

Generator

ing digitalfilter

switch

Pitch Period

e(n)

h(n)

Figure 2.2: LPC model for speech production

synthesized with noise as input. The LPC model is also called the source-system

model of speech. In this model, the LPC filter represents the system and the

excitation (either train of impulses or white noise sequence) represents the source.

The process of synthesizing speech using the LPC filter and excitation is called

LP synthesis. In vocoders, the system and source information of the input speech

signal is encoded and transmitted. An LP synthesis filter is used to synthesize

the speech at the receiver. The process of extracting the system and source in-

formation from the speech is done using LP analysis. In LP analysis, the speech

input is given to an LPC filter to get residual as the output. The LPC filter coef-

ficients represent the system characteristics of the speech input and the residual,

also called prediction error, represents the source information. Since, speech is a

quasi-stationary signal, the system and source characteristics change rapidly. For

all practical purposes, speech is assumed to be stationary within a duration of

approximately 25 milliseconds, called a frame. The system and source parameters

are extracted over a frame of speech. In vocoders, the system and source parame-

ters obtained from each frame of speech are quantized using predefined codebooks.

The codebooks are prepared separately for system and source. The system infor-

mation for a frame of speech is generally captured by some spectral parameter

vector. The codebook for system is generated using the spectral properties of the

12

speech signal. Vector Quantization (VQ) is popularly used for codebook genera-

tion. The codebook for system is prepared by clustering the spectral parameter

vectors computed for each frame of speech. Line spectral Frequencies (LSFs),

LP Cepstral Coefficients (LPCC), Mel Frequency Cepstral Coefficients (MFCC)

are the commonly used spectral parameters for clustering. The centroids of the

clusters formed are considered as codewords. The source information is defined

in terms of parameters like pitch, gain and voicing. Multiple codebooks are pre-

pared for source because a separate codebook is prepared for every parameter.

Simple scalar quantization techniques are used to prepare the codebooks for these

parameters. The process of encoding and decoding the speech using a vocoder

is shown in Figure 2.3. The speech input is fed frame by frame through an LP

analysis filter to get the system and source information in the form of LP coef-

ficient vector and residual respectively. The LPC vector is quantized using the

system codebook. The index corresponding to the codebook entry which matches

with the input spectral parameter vector is encoded. Similarly, for encoding the

residual, parameters like pitch, gain and voicing are quantized using the respec-

tive codebooks. High compression is achieved in vocoders because the system and

source information of the input speech is represented by one of the entries in the

respective codebooks and the indices corresponding to the entries are encoded.

The bit rates of vocoders depends on the codebook sizes. The calculation of bit

rate for a typical vocoder is shown below:

Number of frames per second = N

Size of the system codebook = Cs

Size of the source codebook = Ce

Number of bits needed to encode the system,ms = dlog2(Cs)e

Number of bits needed to encode the source,me = dlog2(Ce)e

Total Bit rate = N × (ms + me) bps

13

Cs

Ce

Ce

Cs

LPAnalysis

quantizationSystem

Sourcequantization

binary coded indices

1

1

2

2

Spectral parameter vector

Index

Residual

residual

ofLPCs

LP Synthesis

modeling

12

21

LP coefficientvector sequence

Index

Source

parametersResidual

Index

Source

System

Output speechBitstream

Index

Source

System

Binary coded indices

Bitstream

(a) Transmitter

(b) Receiver

Codebook

Codebook

Codebook

Codebook

Sequence

Modeled

SystemDecoder

Input speechframes

Figure 2.3: Block diagram of a vocoder showing the process in (a) Transmitterand (b) Receiver

14

The decoding process at the receiver to synthesize speech follows the same series of

operations performed at the transmitter, but in reverse order as shown in Figure

2.3(b). The input bit stream is decoded to find the codebook indices. Using the

indices, the system and source parameters are obtained from the respective code-

books. The excitation signal is modeled using the quantized source parameters

obtained from the source codebook. An LP synthesis filter is formed using the

system information obtained from the system codebook. The excitation signal is

passed through the LP synthesis filter to give the synthesized speech output. The

quality of speech obtained from this type of coding is less. Since the LP synthesis

filter is formed from the quantized parameters, the speech synthesized using this

filter differs significantly from the original speech. The other reason for deteriora-

tion of speech quality in vocoders is the modeled excitation signal. The excitation

signal modeled from source parameters is very different from the residual obtained

from the LP analysis filter at the transmitter. Ideally, the excitation to the LP

synthesis filter should be identical to the residual, for high quality speech output.

The inability in modeling the excitation signal is due to the complex nature of the

residual signal. The closer the modeled excitation signal is to the residual, the

higher is the quality of the synthesized speech. The focus of research in vocoders

is to achieve a high quality speech at low bit rates. The emphasis is laid on better

clustering processes to compress the system and source information further, and

efficient modeling of the excitation signal.

There is a sub-class of coders referred to as ‘hybrid coders’ where an attractive

trade-off between waveform coding and vocoding is achieved, both in terms of

speech quality and transmission bit rate. They are also referred to as analysis-

by-synthesis coders. The Code Excited Linear Prediction (CELP) is an example

of hybrid coder. Hybrid coders attempt to fill the gap between the waveform and

source coders. As described above, waveform coders are capable of providing good

quality speech at bit rates of 16 Kbps, but are of limited use at rates below this.

Vocoders, on the other hand, can provide intelligible speech at 2.4 Kbps and below,

15

but cannot provide natural sounding speech. Hybrid coders use the same linear

prediction filter model of the vocal tract as found in LPC vocoders. However,

instead of using the pitch and voiced/unvoiced decisions to model the excitation

input to the LP synthesis filter, the excitation signal is chosen by attempting to

match the reconstructed speech waveform as closely as possible to the original

speech waveform as done in waveform coders. Such a process ensures a better

speech quality when compared to the vocoders.

2.4 Segment Vocoders

In segment vocoders, the system compression is performed at segment level instead

of at the frame level. Typically, a segment is much larger than a frame. The num-

ber of frames in a segment depends on the type of the segmental unit. However,

the source compression in segment vocoders is generally done at frame level to

account for the rapid variations in the source characteristics. The implementation

of a segment vocoder is carried out in the following four stages.

1. Segmentation:

Segmentation is done on utterances of a huge corpus of speech data to yield

a large number of variable length segments.

2. System codebook and source codebook preparation:

A system codebook is prepared using the large corpus of variable length

segments. Each entry in the codebook corresponds to a sequence of vectors

of Linear Prediction Coefficients (LPCs). A source codebook is also prepared

for the residual parameters.

3. Encoding and transmission:

The input speech utterance is segmented into variable length segments. Sys-

tem quantization is done on each of these segments using a system codebook.

16

The best-match codebook index is binary coded and transmitted. Some bits

are also allocated for the duration information of the input segment. The

residual is encoded using the source codebook.

4. Speech synthesis at the receiver

The LPCs corresponding to the transmitted system codebook indices and

the excitation modeled from the quantized source parameters are used for

synthesizing the speech using the LP synthesis filter.

The block diagram that shows the working of a segment vocoder is shown in

Figure 2.4. The input speech is divided into variable length speech units called

‘segments’. For each segment, the system characteristics are quantized using the

system codebook. The residual from the LP analysis filter is encoded using the

source codebook. Since, the segments are of variable length, the duration infor-

mation of each segment is also encoded. At the receiver, the decoder decodes

the codebook indices, obtains the system information from the system codebook

to form a sequence of LP coefficient vectors and models the excitation signal us-

ing the decoded source parameters. The excitation signal is passed through the

LP synthesis filter formed using LP coefficient vector sequence from the system

codebook to produce the speech output.

17

Segmentationquantization

quantizationframe by frame

speechInput

Segments

Encoded system codebook

indices

Encoded source codebook indices

Spectral parametervector sequence

System

Residual Source

Residual parameters

Index

Bit stream

12

C

12

s

e

modeling

stream

properties

12

12

C

C

Ce

s

Bit

LP coefficientvector sequence

System codebook

System codebook

LP vector sequence

Synthesized speech

Decoder

Source

Index

Residual

Source codebook

Durationquantization

DurationDecoder

(a) Transmitter

(b) Receiver

LP analysis

Index

LP synthesis

Modeled residual

Index

Figure 2.4: Block diagram of a segment vocoder showing the process in (a) Trans-mitter and (b) Receiver

18

The quality of synthesized speech is drastically reduced in the case of segment

vocoders because the unit of compression in segment vocoders is larger than a

frame. Phonemes and diphones are generally considered as the segmental units.

As the speech is a quasi-stationary signal, the spectral parameters vary for every

frame. Units like phonemes have variations within the segment duration. The

method used for system codebook preparation for such units should be able to

capture the variations present within the segments. Hence, the focus of research

in segment vocoders is on the system compression aspect of coding. With a high

compression of system and source information, segment vocoders give a synthetic

quality speech at average bit rates much less than 2.4 Kbps. The average bit rate

calculation for a segment vocoder is given below:

Average number of segments per second = M

Frames per second used for source coding = N

System codebook size = Cs

Source codebook size = Ce

Number of bits required to encode the system,ms = dlog2(Cs)e

Number of bits required to encode the source,me = dlog2(Ce)e

Average bit rate = M × ms + N × me bps

Since the number of segments in a second of speech varies, the bit rate of segment

vocoders is expressed as the average bit rate.

2.5 Issues in Implementing a Segment Vocoder

The segment vocoders differ mainly in the method used for segmentation to obtain

the segment boundaries and the method used for the system compression and

the system codebook preparation. The residual (source) is coded using similar

techniques as used in frame vocoders. The naturalness in the quality of output

19

speech depends on the quality of modeled residual (excitation) at the receiver.

The following are the issues in the implementation of a segment vocoder:

1. Choice of segmental unit:

(a) The segmental unit should be large enough to enable better compres-

sion.

(b) The segmental unit should be small enough to be modeled effectively.

(c) The segmental unit should be obtained automatically.

2. Segmentation

(a) Depending on the segmental unit chosen, segmentation should yield the

boundaries of segmental units in an utterance.

(b) The segmentation algorithm should not be computationally expensive.

3. System codebook

(a) Segmental units have varying spectral characteristics within the seg-

ment duration. The clustering procedure used for codebook prepara-

tion should effectively capture the variations in the segment.

(b) The clustering algorithm should also be able to handle the differences

in the duration of the segments.

(c) The clustering mechanism should allow a high compression to generate

a codebook of small size for bit rate reduction.

20

BAD

POOR

FAIR

GOOD

EXCELLENT

1 2 4 8 16 32 64

Speech Quality

waveformcoders

Bit rate (Kbps)

Hybrid coders

vocoderssegment

vocoders

Figure 2.5: Performance of different coders

Figure 2.5 shows the summary of the performance of different types of coders

[1]. The figure shows the performance of different speech coders in terms of the

speech quality obtained at respective bit rates of operation. It can be seen that the

speech quality increases with an increase in the bit rate. The vocoders operating at

very low bit rates have a poor speech quality. Further, it is also seen that segment

vocoders are poorer in quality. The waveforms generated using PCM, ADPCM,

MELP and the proposed segment vocoder are incorporated under the directory

named ‘Chapter-2’ in the CD-ROM attatched with the Thesis. The next section

presents a review of some approaches to build the segment vocoders. The review

gives a brief overview of the methods used for the segment vocoder implementation.

The performance of the coders is discussed in terms of the output speech quality

and average bit rates of operation. The detailed review of segmentation and

clustering techniques used in the segment vocoders are later presented in the

respective chapters.

21

2.6 A Review of Segment Vocoders

Roucos, et al., [2] have proposed an early segment vocoder that operates at

150 bps. Spectrally steady state regions of speech are chosen as segmental units.

An automatic segmentation method based on a heuristic algorithm is used to

search the segment boundaries. System codebook is generated using a binary

clustering technique based on K-means clustering. For the single speaker mode, a

bit rate of about 150 bps is obtained. The 110 bps are used for system represen-

tation and 44 bps are used for encoding source parameters like pitch and gain for

each segment. The output speech signal is evaluated as intelligible but synthetic.

It is also observed that there are intonation changes in the output speech.

The poor speech quality of the above mentioned segment vocoder is avoided

by using template waveforms to synthesize the speech at the receiver [15]. The

encoding part of the waveform segment vocoder is similar to the earlier segment

vocoder. A system compression of 170 bps has been achieved. The information

regarding pitch, gain and duration of the input segment is encoded using another

130 bps to obtain an overall bit rate of 300 bps. During synthesis, the waveform

segment corresponding to the spectral parameter vector, called template wave-

form is considered instead of the matched spectral parameter vector from the

codebook. The template waveform segment is modified to match the correspond-

ing input parameters like pitch, gain and duration. The synthesis is performed

using the Residual Excited Linear Prediction (RELP) filter whose inputs are the

LP coefficient vector sequence and the residual of the modified template waveform.

The vocoder when implemented for speaker independent case produced an output

speech quality better than the earlier vocoder with reduced buzziness. However,

the speech is chopped due to the discontinuities at the segment boundaries.

Shiraki, et al., [14] have developed a variable length segment vocoder based

on joint segmentation and quantization. The coding problem is to search for

22

the segment boundaries and the sequence of code segments so as to minimize

the spectral distortion measure for the given interval. The problem is solved

by a dynamic programming algorithm. An iterative algorithm for designing the

variable length segment quantizer is also proposed. When pitch information is also

coded, the coder resulted in bit rates of around 150 bps for a single male speaker

database. It has been shown that the coder offers sufficient intelligibility at such

low rates.

Bardenhagen, et al., [23] have used fenone as unit of segmentation. The quasi-

stationary regions in speech are termed as fenones. A system codebook is prepared

using 3-way split VQ technique over spectral envelopes. A harmonic speech model

is used to realize the coder. The vocal tract is represented by the spectral envelope.

In harmonic speech model, the spectral envelope is represented by the harmonic

magnitudes. The glottal excitation is modeled by a mixture of voiced and unvoiced

components. The voiced component is the impulse train with the period modified

based on the pitch lag. The unvoiced component is modeled using the white noise.

For a speaker independent case, the coder achieved good intelligibility and quality

at bit rates around 1.8 Kbps.

Siva Kumar, et al., [24] have developed a segment based MBE speech coder

at 1000 bps. A Multiband Excitation (MBE) speech model that provides natural

sounding speech and robustness to acoustic background noise is considered for very

low bit rate coding based on speech segmentation. An ML segmentation algorithm

is used to obtain the segment boundaries. The system codebook is prepared by

clustering LSF vector sequences of the segments using split-vector quantization.

When pitch, gain and voicing are coded, a bit rate of 1000 bps is achieved. Unlike

other segment coders whose bit rate varies according to the segment rate, this

vocoder has fixed bit rate. The average PESQ score for the vocoder is evaluated

to be 2.69 for TIMIT database.

Apart from the vector quantization techniques, a new recognition/synthesis

23

paradigm has evolved in the implementation of segment vocoders where the system

codebook is prepared using the parametric models like Hidden Markov Models

(HMMs). The system codebook contains the parametric models built for the

segmental units.

Picone, et al., [3] have proposed a phonetic vocoder with phonemes as seg-

mental units. The system compression of less than 200 bps is achieved. The

codebook is built using the HMMs trained for phonemes. The models are trained

for each phoneme using contextually rich examples and transcriptions from the

TIMIT database. The trained models are also used to obtain the segment bound-

aries. Source information such as pitch and energy are also coded to retain the

naturalness in the output speech. However, the output speech is reported to have

synthetic quality.

Cernocky, et al., [5] developed a segment vocoder with units obtained from

an automatic segmentation algorithm. Segmentation uses an iterative algorithm

called temporal decomposition that gives the unigram, bigram and trigram units.

The units are modeled using the HMMs in the unsupervised mode. In the single

speaker mode, output speech is reported to have intelligibility at 211 bps. The

system compression is done at 120 bps. Source parameters are not modeled except

for energy which is varied according to the energy of the input segment. The

output quality is observed to be good for short words or digits. The intelligibility

is reduced as the length of the input segments is increased. Good intelligibility is

obtained for segments whose length is close to that of phonemes.

Felici, et al., [25] proposed a diphone based segment vocoder for Italian lan-

guage which operates at bit rates of 300 bps. The codebooks are prepared for

each speaker separately. Diphone segmentation is carried using a neural network

based diphone recognizer. The codebook of diphone segments is prepared in a

supervised mode. For each diphone, the codebook contains examples featuring

energy and pitch close to the average values of that class. No particular clustering

24

mechanism is used for codebook preparation. The grouping of diphone examples

is done until the segment distortion measure is below a particular threshold. The

source information like pitch and energy are coded separately. The speech quality

is assessed using Mean Opinion Score (MOS) which is around 2.

Ismail, et al., [7] proposed a segment vocoder where a continuous speech rec-

ognizer is used to transcribe the incoming speech as a sequence of subword units

termed as acoustic segments. Prosodic information is combined with the segment

identity to form a serial data stream suitable for transmission. A rule-based system

maps the segment identity and the prosodic information to parameters suitable for

driving a parallel formant speech synthesizer. Acoustic segment HMMs are used

to build a recognizer . A segment error rate of 3.8% was achieved in a speaker-

dependent and task-dependent configuration. An average data rate of 262 bps was

obtained.

Tokuda, et al., [6] proposed a vocoder using the HMMs based synthesis. The

basic segmental unit chosen is the phoneme. The system codebook is a phoneme

recognizer built using a large phoneme database. At the receiver, instead of ob-

taining the spectral parameter vector from the codebook, it is generated using

the HMM corresponding to the codebook index. The synthetic quality speech is

obtained using Mel Log Spectrum Approximation (MLSA) filter formed from the

spectral parameter vector generated. The coder operates for a single speaker at

a bit rate of 150 bps. Hoshiya, et al., [8] improved the performance of the above

vocoder by transmitting the extra information regarding formants and state du-

ration. The coder is again tested in a single speaker mode.

Low bit rate coders are also developed based on Text-to-Speech (TTS) synthe-

sis scheme. In a TTS system, a large database of waveform units is prepared based

on the text information. Synthesis is done by concatenating the waveform units

corresponding to the input text. Prosody modifications are later incorporated on

the synthesized speech to make the speech sound natural. In a TTS based coder,

25

this large database of waveform units can be seen as a codebook. The earliest

work in this direction is by Benbassat, et al., [26], where a text message and spo-

ken utterance are jointly used to provide a TTS input stream. A small number

of prototypes for pitch patterns and duration patterns are also included in the

codebook for prosody coding. The codebooks are prepared using a single male

speaker database. The bit rates achieved are around 150 bps. The output speech

quality is reported to be intelligible. Similar kind of work is reported by Vepyek,

et al., [27] where the TTS is used to generate the synthetic speech from text and

then the speech conversion is used to convert the voice characteristics including

the speaking style and emotion. The coder is reported to operate at 300 bps. The

above mentioned two coders require transcription for the input speech sentence

for coding. The transcription is carried out at the phoneme level. To improve on

this, a speech coder based on automatic speech recognition and TTS synthesis is

developed. Chen, et al., [28] used the HMM based phoneme recognition and Pitch

Synchronous Overlap Addition (PSOLA) based TTS for the implementation of

speech coder at 750 bps. The individual segments recognized by the HMMs are

quantized using a phonetic inventory. The output speech quality is reported to

have a MOS score of 3.0. Later Lee, et al., [4] developed a unit selection based

TTS coder operating at 1000 bps on a single speaker database, whose speech

quality is comparable to that of a MELP coder. A large TTS labeled database is

used as the codebook. The codebook contains the identities of phonemes, their

durations and their pitch contours. The synthesis performance is increased using

an acoustic target cost function and a concatenation cost. The coder is reported

to have a speech quality close to that of a conventional MELP coder at 2.4 Kbps.

A modified version of the coder [29] is also proposed where the input speech signal

is segmented using a joint segmentation/classification scheme. The segments are

coded and synthesized using the TTS based coding technique described in [4]. In

single speaker tests, the coder gave an intelligible and natural sounding speech at

an average bit rate of about 580 bps.

26

It may be noted that almost all the above segment vocoders are implemented

for the single speaker case. The segmentation techniques are designed to obtain

steady state regions in speech which eventually turn out to be phone like units.

The automatic segmentation techniques use high complexity algorithms which are

mostly iterative in nature. Moreover, the segmentation methods are not capable

of generating segments which are larger than phonemes. For example, in [5] multi-

gram units that are larger than phonemes are considered. But it is observed that

intelligibility is preserved only when the units are close to phonemes. Segmen-

tation is also accomplished using parametric models trained over the segmental

units. But, this also requires the models to be prepared using sufficient number

of examples. The models should also be able to capture the prosody variations

in the same segmental unit. If HMMs are used for obtaining segmental units (as

in CSR), large amount of data is needed to train the models properly. Since the

segmental units have prosody variations, HMMs should be trained for wide range

of prosody variations inside a segment. Then only the models will be able to rec-

ognize the segment properly in the input speech utterance. In the coders which

use joint segmentation and recognition (while encoding), the models are built in a

highly supervised manner. In [5], an unsupervised clustering is used for modeling

HMMs but prior to that vector quantization is performed on the automatically

generated segmental units to group similar units into one cluster. This ensures

that the HMM is trained with sufficient number of examples of the same segmen-

tal unit. The models may fail to recognize the boundaries in case of the prosody

variations in the segments which are not captured by the HMMs while training.

Moreover, it is also observed that in all the vocoders, the emphasis was laid on the

system compression than on the modeling of source characteristics. Most of the

system compression techniques involve clustering under supervised conditions. In

the coders, where the source characteristics are modeled well, the output speech

preserves the naturalness. Attempts are being made to get the synthesized speech

as natural as possible by coding the pitch, gain and fundamental frequency (F0)

27

characteristics for each segment. When source characteristics are modeled well,

the bit rates of the vocoders are comparably higher than in the coders without

sophisticated source modeling. It is also observed that when the units of com-

pression are larger, source characteristics have to be modeled well, to preserve the

quality of the output speech.

2.7 Syllable based Segment Vocoder

Based on the observations made, the objective of the present work is to develop a

vocoder which

• provides a simple and flexible automatic segmentation algorithm that is ca-

pable of obtaining segmental units without any prior information.

• develops a system codebook without supervision and that does not require

any transcription or knowledge about the segmental units for clustering.

• works in a speaker independent environment and produces a natural sound-

ing speech.

With these objectives, in the proposed segment vocoder, a much larger unit,

syllable is chosen as the segmental unit. A simple and flexible signal processing

based segmentation algorithm called group delay based algorithm proposed earlier

to get syllable boundaries [30] is used. Since the chosen segmental unit is large,

two similar segments have significant prosody variations. Therefore, modeling the

parameters of such a big segment is an issue. Vector quantization techniques that

are popularly used for clustering may not be suitable because the methods cannot

capture the sequence information which is crucial in the case of larger segments

like syllables. Also, clustering is difficult due to the huge duration mismatch

among similar units. The HMM based parametric modeling is one alternative

that can be used to model the syllable like units. Parametric modeling using

28

HMMs for syllable like segments has been successfully used in Automatic Speech

Recognition(ASR) and TTS based synthesis systems [30][31][32][33][34]. In the

present work, the emphasis is laid on the system compression aspect. The system

is compressed to less than 100 bps using an unsupervised HMM based clustering

algorithm [19]. For source compression, the MELP is used as it preserves the

necessary source characteristics for a better quality of the synthesized speech. In

[35][36], the importance of good residual modeling for natural quality speech is

discussed. It is argued that for good quality speech the pitch, gain, voicing and

jitter are to be modeled well. MELP generates a mixed excitation signal which

is close to the residual of the input signal. Also, in order to check whether the

proposed methods of segmentation and clustering techniques work, a well modeled

residual is needed. Hence, in this work the residual is modeled using MELP. This

results in a bit rate of 1.4 Kbps.

2.8 Summary

In this chapter, an introduction to speech coding methods is given. Different

coding schemes are discussed in detail. High bit rate speech coders yield very

good speech quality and are less complex. Vocoders which revolutionized the

present day communication are made to operate at much lower bit rates but

still preserve the intelligibility in speech. Segment vocoders form a section of

vocoders, where compression is done at the segment level thereby reducing the

bit rates further. Due to high compression, segment vocoders produce a synthetic

quality speech. Implementation of segment vocoders addresses issues regarding the

automatic segmentation and clustering methods. The need for the development of

a segment vocoder that uses a flexible segmentation algorithm and an unsupervised

clustering algorithm for system codebook preparation are instrumented in the

formulation of the proposed approach to build a segmental vocoder.

29

CHAPTER 3

SPEECH SEGMENTATION

3.1 Introduction

Segmentation is the process of dividing a speech utterance into small predefined

units. Any segment vocoder operation starts with the segmentation task. Seg-

mentation is used twice in the implementation of a segment vocoder. In preparing

the system codebook, a large corpus of speech data is used to obtain the segments.

Once the codebook is prepared, it is fixed for the vocoder. During the encoding

operation of the vocoder, the speech utterance is segmented into predefined units.

In this chapter, the importance of automatic segmentation and an overview of

the earlier approaches to segmentation are presented in Section 3.2. The need for

choosing syllable as the segmental unit and the proposed automatic segmentation

algorithm called group delay based segmentation are explained in Sections 3.3

and 3.4 respectively. The chapter is concluded with Section 3.5 by presenting the

results obtained using the segmentation algorithm.

3.2 Segmentation in Segment Vocoders

In segment vocoders, the initial step towards encoding the speech sentence is to

divide it into segments. Hence, the development of an automatic segmentation

algorithm is necessary [11][37][38][39]. As discussed in Section 2.5, the segment

vocoders differ in the type of segmentation method used. The method of segmen-

tation depends on the type of segmental unit chosen. Prior to the development

of segment vocoders, compression of speech was done at frame level where each

frame is typically of 25 millisecond duration. Going beyond the frame level, a clear

definition of the segment is necessary to obtain segment boundaries. Phonemes

and diphones are the most common segments chosen for compression. Depend-

ing on the segmental unit chosen, several techniques for segmentation have been

proposed in the literature.

In [3][6][7][8], vocoders are built with phone as the segmental unit. Here, the

TIMIT database is used where the transcriptions are provided for phones. Using

these transcriptions, the HMMs are trained on the entire database which are then

used to recognize the phone sequence as in continuous speech recognition.

Cernocky, et al., [5] found the segmental units automatically using an iterative

technique called temporal decomposition and vector quantization. The iterative

technique resulted in the units which are eventually phonetic in nature. The

transcriptions are prepared using the boundary information obtained.

Felici, et al., [25] have proposed a neural network based segmentation algorithm

to get the diphone boundaries. A neural network is trained with 30 examples of

each diphone. The input speech is decoded against the neural network model

to track the boundaries. Here again, the neural network model is trained in a

supervised manner.

The above segment vocoders use the parametric models to obtain the segment

boundary information. The segmentation and segment quantization is performed

in a single step. Vocoders are also developed wherein the segmentation and seg-

ment quantization are done in two different steps. In such vocoders, the automatic

segmentation is performed on the input speech to get desired segments and the

obtained segments are later used for codebook design and segment quantization.

Brandenhagen, et al., [23] introduced fenone as a segmental unit which is

defined as an acoustically homogeneous unit of speech. Quasi-stationary regions

of speech are represented by fenones. It is shown that fenones can capture fine

31

discriminatory differences in speech. An automatic segmentation algorithm uses

the frame-based correlation measure to group the frames based on their quasi-

stationary acoustic characteristics.

Roucos, et al., [2][16] proposed an early segment vocoder where the segmental

unit is a region between the middle points of two consecutive steady state regions.

The segmentation algorithm is a heuristic algorithm that uses a set of thresholds

on two spectral derivatives to determine the spectral steady state regions in the

input speech.

Ramasubramanian, et al., [11] discuss different automatic segmentation tech-

niques to derive phone and diphone boundaries. In their work, they used two

automatic segmentation techniques - Maximum Likelihood (ML) segmentation

and Spectral Transition Measure (STM) based segmentation. It is shown that the

phone like units from the ML segmentation outperformed the diphone like units

obtained using the STM based segmentation. These two methods are popularly

used to obtain the segment boundaries automatically. A brief overview of the two

techniques is presented below.

3.2.1 Maximum Likelihood (ML) Segmentation

The ML segmentation assumes piecewise stationarity of speech as the criterion

to obtain segments which are acoustically homogeneous within their boundaries.

Homogeneity here refers to the similarity in spectral properties of the speech signal.

Speech is a quasi-stationary signal. For all practical purposes, speech is assumed

to be stationary for a duration of about to 25 milliseconds, which is often termed

as a ‘frame’. In other words, it is assumed that the spectral properties of speech

do not vary significantly. Similarly, the ML segmentation checks for homogeneity

within the segment, i.e., absence of significant variation in the spectral properties

within the segment duration. Since a segment is made up of more than one

32

frame, there will be spectral differences from one frame to the other. The ML

segmentation looks for segments where the overall spectral distortion within the

segment is small. The spectral distortion is measured in terms of ‘intra segmental

distortion’, given by the sum of distances from the frames that span the segment,

to the centroid of the frames comprising the segment. The distance is computed

between the feature vectors that represent the spectral properties of each frame.

The ML segmentation algorithm [11] is as follows:

1. The speech utterance of T frames is represented by a sequence of T vectors

X = (x1, x2, · · · , xT ) where xn is a p-dimensional parameter vector for the

nth frame.

2. The aim of the ML segmentation algorithm is to find m consecutive segments

in the observation sequence X. Here, m is predefined based on the desired

segment rate.

3. Initially, some m boundaries are arbitrarily marked. The boundary sequence

is denoted by B = (b0, b1, · · · , bm). bm refers to the boundary at T th frame.

4. The distortion measure is defined in terms of a distance measure between a

frame in the segment and the centroid of the frames comprising the segment.

For the Euclidean distance measure, the centroid is the average of all the

frames in the segment. The total distortion measure is computed for all the

frames within the segment.

5. The solution is obtained by dynamic programming and the boundaries are

recovered by backtracking after the first pass like Viterbi decoding. This is

termed as an optimal boundary sequence. The intra segmental distortion

for a typical boundary sequence is given by

33

D(B) =m∑

i=1

bi∑

n=bi−1+1

d(xn, µi) (3.1)

where, D(B) is the total distortion of an m-segment segmentation of X, µi

is the centroid of the ith segment and d(xn, µi) is the distortion measure

between the nth frame xn and the centroid of ith segment, µi.

6. The optimal segment boundaries are obtained using a dynamic programming

algorithm given in [38].

The ML segmentation results in the phoneme boundaries when m is fixed to

the phone rate of speech. However, sometimes the resulting segments are as short

as one frame or as long as one word. This makes modeling the segments a difficult

task. In order to control the duration of segments formed, a duration constraint

is also added to the ML segmentation algorithm.

3.2.2 Spectral Transition Measure (STM) based Segmen-

tation

The STM based segmentation is the earliest method used for speech segmentation.

The segmentation is based on the principle of measuring the spectral deviation at

every frame instant. For phone like units, boundary is detected when there is a

large spectral deviation between two frames. The STM segmentation gives both

phone and diphone boundaries. Many algorithms for STM based segmentation

are available. One of these algorithms is described here [11].

1. The STM is defined as follows: Let xn be the parameter vector for the nth

frame. The STM at frame n, di(n), is given by di(n) = ‖xn − xn−i‖2.

34

2. d1(n) as a function of n is the distance between the spectral parameter vector

for frame n and its preceding frame n − 1.

3. d3(n) gives a smooth measure of the spectral derivative.

4. The plot of spectral derivative will have peaks at the places where the spec-

tral transitions are significant and valleys near the steady state regions.

5. Successive peaks of d1(n) or d3(n) locate the phone boundaries and successive

valleys mark the diphone like units.

An extrema picking algorithm [39] is used to pick the peaks and valleys effec-

tively. But, the algorithm uses a threshold δ to detect the peaks which is optimized

for a desired segment rate.

3.3 Syllable Segmentation

The review on segmentation techniques in the previous section suggests that, the

techniques for segmentation can be broadly classified into two categories. In the

first category, the parametric models are built for each segmental unit. This re-

quires that the models be built using a large training dataset. The second category

uses automatic segmentation algorithms that continuously try to find the regions

of interest in speech based on some rules. The algorithms used are either iterative

in nature as in ML segmentation or use predefined thresholds as in STM seg-

mentation that have to be optimized to obtain correct boundaries. Further, some

parameters in the algorithms are assigned values based on the segment rate. The

values need to be optimized for different segment rates. In 2-pass ML segmenta-

tion, the segment rate is fixed to get the boundaries close to phoneme boundaries.

But, the phoneme rate can vary quite significantly from one speech utterance to

another. Also, the techniques are computationally intensive. So, there is a need

to develop a simple and flexible algorithm capable of giving the desired segments.

35

In the proposed segment vocoder, syllable is chosen as segmental unit. The pri-

mary reason to choose syllable is that a signal processing algorithm called group

delay based segmentation can be used to get the boundaries. Syllable is a larger

unit than phoneme or diphone, and hence enables more compression. Also, it has

been suggested that the syllable has a better representational and duration stabil-

ity relative to phoneme [40]. Systems for Automatic Speech Recognition (ASR)

and Text to Speech (TTS) synthesis that use syllable as a basic unit have been

developed successfully for Indian languages which are syllable timed in nature

[30][31][32][33][34].

Before going into the working of the proposed segmentation algorithm for ob-

taining syllable boundaries, it is necessary to study the nature of a syllable seg-

ment. A syllable consists of three major parts - onset, nucleus and coda. Generally,

the nucleus is a vowel and the other two are consonants. Spectrally, a syllable dis-

plays a high energy region of the vowel surrounded by low energy regions of the

consonants. This property forms the basis for the boundary demarcation of a

syllable in a given speech utterance.

3.4 Group Delay based Segmentation

Since an automatic segmentation algorithm is important in the implementation of

a segment vocoder, a less complex algorithm is needed. The method for segment-

ing speech into syllable like units uses the short term energy (STE) function of the

speech signal. But the local fluctuations in the consonant regions results in spu-

rious boundaries. It is shown in [41] that if the signal is a minimum phase signal,

the group delay computed on the spectrum will resolve the peaks and valleys well.

The peaks and valleys in the group delay function of the signal will now correspond

to the peaks and valleys in the short term energy function. If we consider a speech

utterance as made up of syllables, the valleys in the energy spectrum correspond

36

to the syllable boundaries. If group delay is computed on such a signal, the valleys

of the group delay function correspond to the syllable boundaries. To resolve the

boundaries better, the inverted group delay is computed where the peaks mark

the syllable boundaries. The group delay based method for segmentation [30] is

as follows:

1. Let s(n) be the speech signal.

2. Short term energy (STE), E(n), is calculated with overlapped windows for

s(n).

3. This energy function is viewed as some arbitrary magnitude spectrum E(K).

4. Since any real signal should have even symmetry in its magnitude spectrum,

E(K) is symmetrised along the Y-axis. Now, the magnitude spectrum E(K)

of a real signal is obtained.

5. E(K) is inverted to get Ei(K). This step reduces the dynamic range and

thus prevents large excursion near peaks.

6. IDFT of the sequence Ei(K) is computed. The resultant signal ei(n) is

called the root cepstrum. The causal portion of this signal, which has the

properties of a minimum phase signal, is used to compute the group delay

function.

7. Group delay is computed using a certain window length ‘Nc’ called the cep-

stral lifter window. Let the group delay function be Egd(K).

8. Now, the locations of the positive peaks of Egd(K) give the approximate

syllable boundaries.

The block diagram of the group delay based segmentation method is given

in Figure 3.1. Though the group delay segmentation on speech is successful in

locating many syllable boundaries, 30% of boundaries are missed [42]. It is also

37

Figure 3.1: Group delay based segmentation of a speech signal

shown that the group delay based segmentation [12][43] on shorter segments of

speech resulted in more accurate boundaries. Therefore, in order to retrieve the

boundaries missed in the single level segmentation, a duration criterion is employed

to find the segments of speech where the boundaries might be missing. After the

first level of segmentation, segments that are larger than the expected duration

for a syllable are segmented again [32]. It is found that the two-level segmentation

improves the segmentation performance by 3 − 5% [32]. For completeness sake,

3.1 is reproduced from [32] which shows the performace of single and two level GD

based segmentation. The two-level group delay based segmentation of a speech

utterance is illustrated in Figure 3.2.

38

Table 3.1: Syllable boundaries with single and two-level segmentations

Speaker Type Tamil Female Tamil Male Hindi FemaleSegmentation type 1-level 2-level 1-level 2-level 1-level 2-level

Syllable count 14268 14787 6931 7585 7688 7865Insertions 352 394 66 165 83 52Deletions 1464 987 1390 835 1450 1275

Performance 88.19% 91.02% 82.36% 87.89% 83.99% 85.18%

0.5 1 1.5 2 2.5 3

x 104

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.5 1 1.5 2 2.5 3−0.5

0

0.5

1

Time in msx 10

4

10 20 30 40 50 60 70

−0.2

0

0.2

0.4

0.6

0.8

10 20 30 40 50 60 70 80

−0.2

0

0.2

0.4

0.6

0.8

10 20 30 40 50 60 70

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Resegmentation

Polysyllable boiundaries

tinnnArurAIyAcharmAItuvelAInU yitLi mudala

(a)

(b)

(c)

GD

GD

Figure 3.2: Illustration of the two-level group delay based segmentation (a) Wave-

form of the speech signal for a sentence (b) Result of the first level of

segmentation and (c) Result of the second level of segmentation

39

Figure 3.2(a) shows a Tamil sentence with the corresponding syllable tran-

scription on top of it. Figure 3.2(b) shows the first level of segmentation where

the vertical lines represent the boundaries. Figure 3.2(c) shows the second level

of segmentation of three words /mudala/, /urAlyA/ and /tinnAr/ for which the

boundaries are missed in the first level of segmentation.

3.5 Syllable Segmentation for the Syllable based

Segment Vocoder

The two level group delay based segmentation method is used to get the syllable

segments for the proposed segment vocoder. Initially, a large number of syllable

segments are obtained for preparation of the system codebook. For this purpose,

three databases are created using Tamil and Hindi languages. The databases are

taken from the Database for Indian Languages (DBIL) [44]. The DBIL consists

of news bulletins in eight Indian languages. The first database includes the data

for a single female speaker, the second database includes the data for four male

and four female Tamil speakers. The third database is a multi speaker and multi

language database. 12 speakers are considered for preparing the database in Tamil

and Hindi languages. The two-level group delay based segmentation for the three

databases resulted in 1683, 9465 and 17739 syllable segments respectively. An

example Tamil sentence, the syllable like units obtained using group delay based

segmentation and the duration file that shows the segment boundaries are incorpo-

rated under the directory named ‘Chapter-3’ in the CD-ROM attatched with the

Thesis. The syllable segments are clustered for preparation of the system code-

book. Segmentation is also used during the encoding operation of the segment

vocoder, where an input sentence is segmented into syllable like segments.

40

3.6 Summary

In this chapter, the importance of segmentation in the implementation of a seg-

ment vocoder is discussed. It is observed that the method of segmentation depends

on the choice of segmental unit. Most of the segmentation techniques either use

parametric modeling or some automatic segmentation techniques based on spectral

variation. Parametric modeling requires models to be built on a large database.

The models should be able to capture the variations among the segmental units

to obtain the segment boundaries. As the segment becomes large, the variations

among the same segmental units also increase. The second type of techniques that

consider the spectral variation among adjacent frames to detect the boundaries

use iterative algorithms. In such techniques, some of the parameters are tuned

ahead to obtain the segment boundaries. It is seen that phoneme has been ex-

tensively used as the basic segmental unit of compression. Earlier techniques for

segmenting the speech signals are statistical or iterative in nature. In the present

work a simple signal processing based algorithm called group delay based segmen-

tation is used. The algorithm is tested to successfully locate the syllable segment

boundaries in a given speech signal. Since a larger unit allows better compression,

the syllable is chosen as the unit of compression.

41

CHAPTER 4

SYSTEM CODEBOOK PREPARATION

4.1 Introduction

Segment vocoders differ mostly in the type of segmental unit chosen, segmentation

method and system codebook preparation as discussed earlier. In the present

work, a syllable is chosen as the basic segmental unit for compression. The two-

level group delay based segmentation algorithm is used to obtain the syllable like

segments. In order to compress the system characteristics and prepare a system

codebook, an unsupervised HMM based clustering algorithm [19][45][46] is used

to form the clusters of syllable segments. Each cluster consists of acoustically

similar syllable segments. A ‘representative’ syllable segment is identified from

each cluster to build the system codebook.

The need for system compression is discussed in Section 4.2. An overview

of clustering techniques is given in Section 4.3. The motivation for clustering

method used and the system codebook preparation is presented in sections 4.4

and 4.5 respectively.

4.2 Need for System Compression

Vocoders are built based on the fact that the speech signal has redundancy. Two

frames of a speech signal resemble in their spectral characteristics (features).

Vocoders exploit the similarities in these features to compress the system. Gen-

erally, MFCCs, LPCCs or LSFs are used to represent the spectral characteristics.

Speech is a random process, and hence the spectral features keep changing. Since

the system codebook is prepared using the spectral features, there is a need to

design a sophisticated clustering algorithm that captures the similarities and dis-

similarities in the spectral features to the maximum possible extent and at the

same time allows good compression to reduce the bit rates.

In the segment vocoders, the segmental units are compressed instead of frames.

Compression at the segment level is possible because the same sounds occur at

different instants due to the finite structure of a language. But, the same segments

may sound differently depending on the context. Hence, the spectral properties of

same sound units are not the same. Not only that, the durations of the segments

are different. Therefore, the clustering process should consider variable length

segments and still should be able to cluster the similar sound units.

4.3 Review of Clustering Algorithms

Many clustering techniques are employed for system codebook preparation. Ma-

jority of these methods are based on vector quantization (VQ) and its advance-

ments, or parametric modeling of HMMs or neural networks.

Roucos, et al., [2] had used a two-pass vector quantization based clustering

algorithm. In the first pass, called binary clustering, the clusters are formed and

in the second pass templates are selected for each cluster. In the clustering phase,

the training examples are divided into two clusters using the K-means clustering

algorithm and each cluster is represented by its mean vector. Later, the cluster

which has the largest total quantization error is subdivided into two clusters. Eu-

clidean distance is used as the distance measure. The process is repeated until

the desired number of clusters are formed. For selecting the template of each

cluster, either the mean vector or the vector which is closest to the mean vector

is selected. The codebook is built using the templates of the clusters. The au-

thors also proposed a waveform segment vocoder [15] in which templates are the

43

waveform segments corresponding to the spectral vectors. The source characteris-

tics of the template waveform are changed according to the input segment source

characteristics during synthesis.

Shiraki, et al., [14] developed a segment vocoder using the variable length seg-

ment quantization. Here, the variable length code segments are obtained by time-

warping the fixed length code segments. A dynamic programing algorithm is used

to determine the best codebook segment with the desired duration. The speech

parameter sequence is generated by concatenating the variable length spectra in

such a way that the spectral distortion between the original and the synthesized

speech is minimized.

Bardenhagen, et al., [23] introduced fenones as the basic segmental units for

coding. Fenones are defined as acoustically homogeneous units of speech. The

system quantization is accomplished using a 3-way split vector quantization for

LSF parameters of the fenonic unit.

After the development of speech recognizers based on parametric modeling

techniques like HMMs and neural networks, the recognition/synthesis paradigm

has evolved for the design of segment vocoders. The units for modeling in speech

recognizers like phones and diphones are used as units for compression in the seg-

ment vocoders. The system codebooks are prepared using the trained parametric

models of segmental units.

In [3], Picone, et al., introduced a phonetic vocoder with phonemes as seg-

mental units. To model the spectral characteristics of phonemes for codebook

preparation, a high performance HMM based recognition scheme is used.segmentis

clustering technique, the seed models for each phoneme are built using the TIMIT

transcription. For each phoneme, codebooks of size 2, 4 and 8 are built using

contextually rich phonemes. Once the initial phoneme models are built, they are

re-estimated using the entire training database in a supervised mode.

44

In [25], Felici, et al., proposed a diphone recognizer as the codebook for en-

coding diphone segments. Here, each diphone cluster is referred to as a class. To

form a class of diphone, contextually rich examples of diphones are clustered until

the segment distortion measure within the cluster is below a threshold value. For

a large diphone class, multiple templates are identified based on the duration and

prosody differences. For a small diphone class, a single template that represents

the diphone class is selected.

Cernocky, et al., [5] also used the HMM based training method to build the

codebook for units that are larger than phonemes. The models are obtained in two

phases - first generation and next generation. In the first generation, the HMMs

are built using the labeled training corpus of automatically derived segments.

In the next generation phase, models formed in first generation are re-estimated

using more number of examples. The HMMs formed after a few iterations of re-

estimation are used for codebook preparation. Each these final HMMs contains

8 examples. While matching the input segment, the first 4 examples (out of

8 examples) whose durations are well matched with the duration of the input

segment are chosen. Out of the 4 examples, the example that has the minimum

Dynamic Time Warping (DTW) distance to the input segment is chosen as the

template segment for synthesis.

In [6] [8] [47], a phoneme recognizer is used as codebook. The codebook is gen-

erated from phonetically balanced sentences. No specific clustering procedure is

used for HMM training. Models are trained using the examples identified for each

phoneme. A total of 34 phonemes are identified for modeling. No template vector

of waveform is selected for a model as mentioned in earlier methods. Instead, a

HMM based synthesis technique is developed where, the spectral parameter vector

sequence for a given duration is generated using the model parameters. The spec-

tral vector sequence of the recognized phoneme derived from the corresponding

HMM is used for synthesis.

45

Another way to represent the system codebook is to use a large TTS labeled

database[26][27][28][4]. The identities of the segmental units (generally phonemes),

their durations, their pitch contours and all speech coding parameters are included

in the database. Different algorithms are developed to choose the appropriate unit

to represent the input speech segment. For example, Dynamic Programming (DP)

is applied to unit selection [4] in which two cost functions, an acoustic target cost

and a concatenation cost, are optimized while choosing the best representative

units for the input segmental units.

4.4 Motivation for HMM based syllable segment

clustering

From the overview given in the previous section, it is seen that the clustering

techniques either use VQ or parametric modeling. The reason for choosing the

parametric modeling over VQ for syllable segment clustering is already discussed

in Section 2.7. The sequence information which is crucial in modeling larger

segments like syllables is not captured in VQ. A modified K-Means for VQ can be

used for clustering purpose. In that method, the distance matrix is prepared using

Dynamic Time Warping (DTW) distace between the segment. Since, DTW is the

priliminary step in developing HMMs, HMM based parametric modeling is chosen

over other kinds of clustering including VQ. But, it is seen that the parametric

modeling is mostly done in a supervised mode. Supervised training consumes a

considerable amount of time and effort to segment and label the syllable segments

manually before clustering. Also, there is a need to search for sufficient number

of examples before training the HMMs. The proposed method of clustering for

system codebook preparation is based on the fact that the primary motive of

any segment vocoder is to convey the information at the receiver without an

objectionable loss in intelligibility. Therefore, even if there is some error in syllable

46

recognition while encoding, it should not affect the intelligibility of the synthesized

speech at the receiver.

To justify the proposed method of clustering, an example of a speech utterance

in Tamil language is chosen. Let the input speech utterance be /vanakkam/. The

syllable segments that resulted from the segmentation task are /va/, /nak/ and

/kam/. The following two scenarios are considered:

Case I: Suppose the syllable /va/ is replaced by /ba/. The output speech will

be /banakkam/.

Case II: Suppose the syllable /kam/ is replaced by /gam/. The output speech

will be /vanakgam/.

In both the scenarios, a native Tamil speaker is able to understand the utter-

ance as /vanakkam/. This extra information comes from the fact that depending

on the context, the person might make out the spoken work without any difficulty.

This particular tendency has been exploited in the present clustering task where

an unsupervised training algorithm is used to group similar sounding syllables.

By the end of the clustering process, the similar sounding syllable segments fall

into one cluster.

4.5 System Codebook Preparation

System codebook is prepared from the syllable segments obtained from the seg-

mentation process. The codebook is generated in the following three stages.

1. Clustering of syllable segments

2. Selection of representative syllable segment

3. Preparation of system codebook

47

Before discussing the clustering method, the Multiple Frame Size (MFS) and

Multiple Frame Rate (MFR) techniques for speech signal processing [46] that

enable the clustering to be an unsupervised method are briefly explained.

4.5.1 MFS and MFR Techniques

In speech signal processing, the features are extracted by windowing the signal at

frame level, typically of 25 milliseconds. This is because of the fact that the speech

is assumed to be stationary in that interval. This method of feature extraction

works in most of the cases. However, this may face problems when the pitch

frequency or speaking rate for the test speaker is very different from that of the

speakers data used in training [46]. Another problem is that such a technique for

feature extraction may not capture the sudden changes in the spectral information.

To avoid these effects, the MFS and MFR techniques are proposed. In the present

work, MFS and MFR techniques are used to generate multiple examples from

a single training example. Physically, it is like considering same example with

different resolutions. These have been first used in a spoken language identification

task, where the models are initialized with a single training example. In this work

[19], it is shown that the MFS feature extraction ensures a reasonable variance for

each Gaussian mixture in the models. Additionally, the speaking rate varies for

different speakers. If the rate of speaking for the test speaker is different from that

of the trained speakers, the models generated with the single frame size technique

may not be able to capture the changes. It is shown that along with the MFS,

the MFR technique also should be used to build the models.

4.5.2 Clustering of syllable segments

The syllable segments obtained from a large speech corpus are automatically clus-

tered using an unsupervised and incremental HMM based clustering algorithm

48

proposed in [19]. After clustering is performed, syllable segments which are acous-

tically close to one another are grouped together into one cluster. A Separate

HMM is built for each cluster.

The clustering is accomplished in two stages - initial cluster selection and

unsupervised incremental training. The description of these two stages is adapted

from [19][48].

Initial cluster selection

Initial cluster selection chooses a set of syllable clusters on which the entire clus-

tering process takes place. The steps in the procedure for initial cluster selection

are given below.

1. All the M syllable segments obtained from the training data are used for

clustering. The silence regions are added to each of the syllable segments

at the beginning and at the end. This is called silence normalization [31].

Segmentation might result in segments which have silence regions at the

beginning or at the end. So, when silence normalization is done, a sepa-

rate state is assigned to the silence portion at the syllable boundaries while

modeling.

2. An HMM is built for each syllable segment. To generate sufficient number

of examples for model initialization, multiple frame size (MFS) and multi-

ple frame rate (MFR) techniques are used. In this technique, features (13

MFCC/LPCC + 13 delta + 13 acceleration) are extracted from the same

syllable segment with different frame size and frame rate to generate mul-

tiple examples. Both MFCC and LPCC are able to capture the spectral

information. Hence, either of them can be used. There is no significant

change in the number of clusters formed when any of the features is used.

3. Similarly, M HMMs are initialized for M syllables. The initial models which

49

depend solely on a single syllable segment are first obtained.

4. Each of the M syllable segments is recognized using M HMMs. For any

segment Si, the model HMMi is expected to give the largest likelihood score

because the model HMMi is trained with the examples generated from Si. In

order to initiate the clustering process, the model HMMj, j 6= i, that gives

the second largest score for Si is also considered. This process is repeated

for all the M segments.

5. If the HMMj gives the second highest score for any segment Si, i 6= j, then

the model corresponding to Sj is not used further. This is done to reduce

the number of clusters in the subsequent iterations. Let M ′ be the number

of clusters at the end of this step.

6. For each of the M ′ segments identified in the previous step, a new HMM is

trained. The examples used for training the new model HMMi are derived

from the segments Si and Sj, where Sj is the segment of the model that

gave the second highest score for Si in step 4. Multiple examples are derived

from Si and Sj using MFS and MFR techniques.

7. Steps 4 to 6 are repeated form m iterations. After m iterations, the number

of segments associated with each HMM is equal to 2m. For small values of m

(m = 2), the number of syllable segments associated with each model is small

and the differences among the segments are expected to be insignificant.

At the end of this stage, CI initial clusters are formed where CI < M. Initial

cluster selection is done to ensure fast convergence of the incremental training

algorithm used in the following stage.

Unsupervised incremental training

1. For each of the CI clusters formed in the initial cluster selection, an HMM is

trained using the examples derived from the syllable segments in the cluster.

50

2. Each of the M segments is decoded or recognized using the CI HMMs trained

in step 1. It is expected that the similar syllable segments will be recognized

by the same HMM. All the segments for which a particular HMM gives the

highest likelihood score are grouped into a cluster.

3. The clusters with less than three syllable segments in them are removed.

However, the segments in these clusters are not excluded from the clustering

process.

4. An HMM is re-trained for each of the clusters formed, using the examples

derived from segments in the cluster.

5. Steps 2 to 4 are repeated until the convergence is met. Convergence is

reached when there is no migration of syllable segments from one cluster to

another cluster [49]. This can be described as follows: Since, the HMMs

are re-estimated every time, there is possibility that the models may change

after each iteration. When the models change continuously, the syllables

which match against them also change. But at a certain point, the models

stop changing and hence the syllables in that cluster remain the same.

At the end of this stage, C clusters are formed where C < CI . Each of the C

clusters is represented by an HMM.

4.5.3 Color palette analogy

The proposed method for clustering syllable segments is explained by using a color

palette analogy with the help of a diagram shown in Figure 4.1.

Each of the M syllable segments is represented by a unique color from the

color palette. In this explanation that follows, a color is analogous to a syllable

segment. Variations in the syllable segments are analogous to variations in the

color. For example, two variations of a syllable segment /va/ are considered as the

51

syllable segments /va1/ and /va2/. The differences in the syllable segments are

due to the mismatch of their duration and spectral characteristics. Analogously

in the color palette, two different shades of a color X are considered for syllable

segments /va1/ and /va2/.

Initially, a color model is prepared for each color. The color model is analogous

to an HMM in the clustering algorithm. The color model is assumed to be prepared

by considering some of the color properties like the percentage of red, blue and

green colors present in it. It is also assumed that a color model cannot be prepared

from a single color, hence different examples of the same color are aggregated to

estimate its properties. A color model is built for each color. In the next step, each

color is compared against the color model. The color model that best matches the

input color is considered. Initially, the color models are prepared from a single

color, hence, the input color is matched to its own model. So, the second best

match is also considered. The color which closely resembles the original color will

be the second best match. The two colors which are similar are paired up. Since,

the second color is already paired up with the first one, it is not considered in

the next iteration. This procedure is repeated for all the colors. After the first

iteration, the color and its closest match are grouped together. The properties of

these two colors are used to re-estimate the color model. In this process, the basic

properties of the color model are not changed but the parameters are varied to

account for the changes in the two colors. The colors which have similar properties

(for example the two shades of dark green), are clustered into one group and the

color model (of dark green) is re-estimated using both the shades. Once the colors

are clustered, they are not used again for clustering in that iteration. This results

in a reduced number of clusters in the next iteration. This process of combining

similar colors is carried out for some number of iterations till a set of distinct colors

is obtained. It is seen that color models of green and color models of red can never

mix with each other. If the modeling is good enough, even the color models of light

green and dark green will not mix. Hence, this ensures that the basic properties

52

of syllables will not change. The syllables which are very different will never occur

in one cluster. But, there is a chance for confusability. For example, if there is a

brown color, it may be confused with dark red and dark brown. Instead of falling

into the brown cluster, it may fall into the dark red cluster. This is the reason

why the recognition of syllable segments is not always accurate. The final color

models are shown in Figure 4.1. It can be seen that all shades of blue are modeled

by a single blue color model, all shades of green by a single green color model and

so on. Similarly, in the HMM based segment clustering process, similar syllable

segments are clustered into one group and modeled by a single HMM.

MMM

M

MM

M

1

23

k

M 123

M c−1c

Intermediatemodels

S

Final color models

color Models

colors

s−1MM s

Figure 4.1: Color palette analogy to explain the proposed HMM based techniquefor segment clustering.

4.5.4 Selection of representative syllable segments

Clustering enables to compress the system information. Now, in order to build

a codebook from the clusters, a representative for each cluster should be chosen.

The cluster contains similarly sounding segments. In order represent the cluster,

53

������������������������������������

������������

���������������������������������������������������� ����������

��������� � � � ���������������������������������������� ����������������������������������������

������������������������������������������������������������������������

������������������������������������������ ������������������������������������������������������������������������������������������������

������������������������������������������������������������������������������������������

��������������������������������������������������������������������������������������������

������������������ ��������������� � � � � � S

S S

S

SS

S

S

cluster K

k1k2

k3

ki

kl

kj knkm

syllable segmentRepresentative

Figure 4.2: Illustration of selection of a representative syllable segment

one syllable segment is to be chosen from each cluster. That syllable segment is

called the ‘representative syllable segment’. It represents the acoustic properties

of all the other syllable segments in the cluster. The other syllable segments are

variations of the representative syllable segment.

The selection of a representative syllable segment is illustrated in Figure 4.2.

A blue cluster that resulted from the color palette analogy is considered. It is seen

that all examples are of blue shade. The different sizes represent the differences

in the durations of the syllable segments. The different textures represent the

different spectral characteristics. Among these color elements, the element which

has properties close to all the remaining elements is chosen as the representative

color element. Similarly, to build a system codebook, a ‘representative’ syllable

segment is chosen for each cluster. The procedure for identifying the representative

syllable segments is given below:

1. Let C be the number of clusters obtained. Let Ci denote the ith cluster and

Ni denote the number of syllable segments present in that cluster. Let Sij

denote the jth syllable segment in cluster Ci,

54

2. For each syllable segment Sij, the 10th order LP Ceptral Coefficient (LPCC)

vectors using a frame size of 20 milliseconds and a frame shift of 10 millisec-

onds are computed.

3. The average Dynamic Time Warping (DTW) distance [9] Dij for Sij is com-

puted as follows:

Dij =1

Ni − 1

Ni∑

k=1,k 6=j

DTWdistance(Sij, Sik)

The DTW distance is used to compute the dissimilarity between two vector

sequences of different lengths.

4. The syllable segment which has the minimum average DTW distance is

chosen as the representative syllable segment for the cluster Ci.

Figure 4.3 shows the waveforms and corresponding spectrograms of the syllable

segments in one of the clusters. There are five syllables in the cluster which are

variations of the syllable /va/. The duration mismatch among the waveforms is

clearly observed. On the other hand, spectrograms show the similarities in the

spectral properties of the syllable segments. It is observed that all the five syllable

segments have the same formant structure. The segment /va3/ is chosen as the

representative syllable segment of the cluster.

55

Waveform Spectrogram

Time (in seconds)

Ampli

tude

Frequ

ency

Time (seconds)

(a) segment /va1/

Waveform Spectrogram

Ampli

tude

Frequ

ency

Time (in seconds) Time (in seconds)

(b) segment /va2/

Waveform Spectrogram

Time (in seconds)

Ampli

tude

Frequ

ency

Time (in seconds)

(c) segment /va3/

Waveform Spectrogram

Ampli

tude

Time (in seconds)

Frequ

ency

Time (in seconds)

(d) segment /va4/

Waveform Spectrogram

Time (in seconds)

Ampli

tude

Frequ

ency

Time (in seconds)

(e) segment /va5/

Figure 4.3: Waveforms and Spectrograms of five syllable segments in a clusterformed using the proposed clustering algorithm. The cluster containsthe syllable segments that sound like /va/.

56

4.5.5 Preparation of system codebook

The codebook of size C is prepared using the C representative syllable segments.

The sequence of LP coefficient vectors for the representative segments are used

to synthesize the speech at the receiver. Therefore, the sequence of LP coefficient

vectors corresponding to the representative syllable segments are stored in the

codebook. Since MELP is used for residual modeling, the frame size used in

MELP analysis, i.e., 22.5 milliseconds is used in the LP analysis. As the duration

differs from one syllable segment to another, the number of LP coefficient vectors

corresponding to each syllable segment is different. The information regarding the

duration of a representative syllable segment is also made available in the system

codebook. Figure 4.4 illustrates the system codebook. The widths of the codebook

entries are shown to be different to emphasize the difference in the durations of

the representative syllable segments.

As already mentioned in the previous chapter, three databases are created us-

ing Tamil and Hindi news bulletins from DBIL. The HTK took kit [50] is used for

implementing the clustering procedure. For a single speaker database, 1683 sylla-

ble segments are used to form 295 clusters. For the multi speaker Tamil database,

9465 syllable segments are used to form 1583 clusters. The multi speaker, multi

lingual database that has 17739 syllable segments resulted in 2980 clusters. Hence,

the codebook sizes for the three databases are 295, 1583 and 2980 respectively.

The codebooks are incorporated at both the transmitter and receiver sections of

the vocoder.

57

n 2

n 1

n 3

n 4

S1

S2S3S4

S i n i

n c−1n cSc

Sc−1

Figure 4.4: System codebook entries. Here Si denotes the representative syllablefor the cluster Ci and ni is the duration of Si.

4.6 Summary

This chapter discusses the preparation of system codebook for the proposed seg-

ment vocoder. The different scheme for codebook preparation proposed in the

literature use either vector quantization or HMM modeling. It is observed that

the syllable segments cannot be modeled properly by vector quantization. The

HMM training has been chosen for syllable segment clustering. Most of the meth-

ods in literature are supervised and need transcriptions for HMM training. In

the present chapter, an unsupervised clustering algorithm is used for clustering

the syllable segments into groups of acoustically similar segments. Representative

syllable segments that represent the common properties of all the segments in re-

spective syllable clusters are identified to build the system codebook used in the

segment vocoder.

58

CHAPTER 5

SYLLABLE BASED SEGMENT VOCODER

5.1 Introduction

The working of a segment vocoder involves encoding the speech signal at the

transmitter, decoding and regenerating the speech back at the receiver. Encoding

of speech involves segmentation of the speech signal, quantizing the system and

source characteristics of the segment using codebooks, and then transmitting the

binary encoded codebook indices. At the receiver, the transmitted bit stream is

decoded to retrieve the codebook indices of each segment. System and source

characteristics of each segment are modeled using parameters derived from the

respective codebooks. The modeled parameters are used to synthesize speech. In

this chapter, the operation of proposed syllable based segment vocoder is discussed.

Initially, the operation of proposed vocoder, encoding and decoding are pre-

sented in Sections 5.2, 5.3 and 5.4 respectively. Later, a sanity check is performed

on the proposed methods by passing the residual as it is during synthesis. A mod-

ification of the vocoder along with the reasons for modification are discussed in

Section 5.5. The implementation of the proposed syllable based segment vocoder

and the performance evaluation are presented in Section 5.6. Finally, the chapter

concludes by discussing some of the implementation issues in Section 5.7.

5.2 Operation of the Vocoder

The structure of the syllable based segment vocoder is shown in Figure 5.1. The

input speech signal is segmented using group delay based segmentation. The seg-

ments are recognized against the HMMs that are trained using a large corpus

of syllable like segments using the unsupervised clustering algorithm discussed in

Chapter 4. The index of the HMM which decodes the input syllable segment is

encoded and transmitted. Additionally, duration information of each segment is

also encoded for transmission. Simultaneously, the input speech signal is passed

through an LP analysis filter to obtain the residual. The parameters of the residual

are encoded using MELP. At the receiver, residual is modeled using the decoded

source parameters. System information in the form of LP coefficients is obtained

from the system codebook. The modeled residual is passed through the LP synthe-

sis filter formed from LP coefficients to obtain the synthesized speech. Description

of encoding and decoding processes is given in the following sections.

Transmission Channel

InputSpeech Group delay

Segmentation

LPAnalysis

segments

cluster models

indicesBinary

Encoding

indices

ResidualEncoding

Segment indices

SegmentCodebook

ResidualCodebook

ResidualDecoding

representative syllable segments

LPSynthesis

DurationAnalysis

Ouput speech

Syllable

Syllable Duration DurationEncoding

Duration bits

Modeled residual

Syllable Segment Bitstream of

Residual bitstream

Sequence of LPCs ofDecoding of

Figure 5.1: Block diagram of the syllable based segment vocoder

60

5.3 Encoding

Encoding in a speech coder is the process of representing the input speech signal in

the form of a bit stream. The stages involved in encoding the input speech signal

in the proposed syllable based segment vocoder are explained in this section.

5.3.1 Segmentation

The input speech signal is segmented using the two-level group delay based seg-

mentation algorithm to obtain syllable like units. The duration of each syllable

is calculated in terms of the number of frames with the frame size being 22.5

milliseconds. If the number of frames is not a whole number it is rounded off

to the next integer. The speech signal waveform for a sentence, the syllable like

segments obtained from the segmentation and the durations of the segments are

shown in Figure 5.2. Vertical lines in the figure represent syllable boundaries. The

transcription and duration of each syllable segment are also given.

til vi la hi niR pa de^n Rum

(frames)7 5 7 5 5 6 117 5 8

sa ma dU rat

6 6

Syllable:

Duration:

Figure 5.2: Segmentation of a Tamil sentence into syllable like units

61

5.3.2 System Quantization

System quantization of each syllable segment is accomplished using the HMMs

formed from the clustering task. The system encoding of a syllable segment is

shown in Figure 5.3. Each syllable segment is recognized against the cluster mod-

els. The index of the HMM which decodes the syllable segment is used for en-

coding. At the same time, the duration of the syllable segment is also encoded.

HMMs

1Index ofmatchedHMM

Duration Encoding

Feature

Syllable segment stream

Bit stream

System quantization

SyllableSegment

extraction i

2

i*

sa ma dU rat til vi la hi nir pa de^n Rum

C−1C

Figure 5.3: System encoding in the syllable based segment vocoder

5.3.3 Source Encoding

The input speech signal is passed through an LP analysis filter to obtain the

residual. Pitch, gain and voicing are the main source parameters that are to be

encoded. The MELP residual coding is used for this purpose. Block diagram

showing the MELP encoding process [20] is given in Figure 5.4. Residual encod-

ing in MELP is done at frame level with the duration of each frame being 22.5

milliseconds.

62

LP analysis

Voicing

Bandpass filtering

0 500 1000 2000 3000 4000 f(Hz)

Bi t streamQuantization

and

Encoding

Low pass filter Input speech signal

Pitch estimation

decisions

Gain estimation

Residual

Fourier magnituedes Calculation of

LP analysis

Residual

Figure 5.4: Block diagram of the MELP encoder used for source encoding in thesyllable based segment vocoder

63

The input signal is first filtered to remove any low frequency noise components.

Initial pitch estimation is carried out on the filtered speech. The next step is to

perform bandpass voicing analysis. The speech signal is passed through five filters

with the following frequency bands: 0 − 500 Hz, 500 − 1000 Hz, 1000 − 2000 Hz,

2000− 3000 Hz and 3000− 4000 Hz. The voicing decisions (voiced/unvoiced) are

made for each band. The extent of voicing in a frequency band is determined by

the strength of periodicity in that frequency band. Later, the peakiness of each

frame of the residual is calculated. The peakiness refers to the presence of samples

having relatively high magnitudes (peaks) with respect to the average magnitude

of the group of samples. The peakiness measure helps in estimating the voicing

strengths as voiced frames generally have high peakiness values when compared to

unvoiced frames. Later, final pitch is estimated over the low pass filtered residual

signal. A series of pitch processing techniques results in final pitch estimation

for a frame of speech. Once the pitch is determined, the gain is estimated. The

estimated gain is the Root Mean Square (RMS) value of the input signal in dB.

Before encoding all the parameters, Fourier magnitudes are calculated from the

residual. The Fourier Magnitudes are the first ten pitch harmonics extracted

from the residual signal. They are identified by using the spectral peak picking

algorithm on the Fourier Transform of the residual signal. Fourier magnitudes are

obtained only if the speech frame is voiced. For unvoiced frames, the unoccupied

bit positions are used for forward error correction. All the source parameters are

encoded using the respective codebooks.

5.4 Decoding

Decoding is the process of retrieving the useful information from the coded input.

The bit stream transmitted by the encoder contains the information regarding

the indices of system and source codebooks. The bit stream is decoded to get

the codebook indices. System information and duration are decoded segment

64

by segment using the system codebook and the duration codebook respectively.

Source information is decoded for every frame. Residual is modeled from the

decoded source parameters using MELP. The decoded system parameters and the

modeled residual are used for synthesizing speech using an LP synthesis filter. A

detailed explanation of system and source decoding is given in this section.

5.4.1 Extraction of system parameters

The system parameters are extracted from the system codebook. The system

codebook contains the LP coefficient vector sequences for the representative syl-

lable segment of each cluster. System codebook preparation is already explained

in section 4.5.5. The decoding process involves extraction of system parameters.

The bit stream is decoded to get system codebook indices. The LP coefficient

vector sequences corresponding to the decoded indices are used to form an LP

synthesis filter. The duration mapping between the input syllable segment and

the corresponding representative syllable segment is done by the duration match-

ing section. The duration matching technique used in this vocoder is discussed in

Section 5.5.

5.4.2 Source Modeling

Source is modeled using MELP. The MELP is developed as an extension for LPC

vocoder [20], but with an objective to synthesize speech which sounds natural and

close to the human speech. The synthesized speech from an LPC vocoder sounds

synthetic because the excitation used is either a series of impulses or random

noise depending on the nature of speech (voiced/unvoiced). But, the excitation of

human speech cannot be represented by a perfect impulse train in voiced regions

and a pure random noise in unvoiced regions. So, in order to mimic the human

speech, MELP introduces ‘mixed excitation’ which is calculated as a combination

65

of both impulses and noise. The features which make the MELP synthesized

speech sound more natural are: (i) mixed pulse and noise excitation, (ii) periodic

and aperiodic pulses and (iii) adaptive spectral enhancement. The above three

features are accommodated in the decoder part of MELP, to produce a more

natural excitation signal that resembles the residual. Modeling the excitation

signal requires source parameters like pitch, gain, voicing and Fourier magnitudes

which are quantized, encoded and transmitted. The excitation modeling using

MELP is shown in Figure 5.5.

Decodebit stream

voicinggain

PitchExtract

Impulsetrain

generator

Pulsegenerationfilter

Pulse shaping filter

noisegenerator

shapingfilter

mixedexcitation

AdaptiveSpectral Enhancement

MELP modeled residual

White Noise

voiced

jitter

pitch

If

If

unvoicedinformation

Figure 5.5: Block diagram of the MELP decoder

The decoder retrieves the source information in the form of indices correspond-

ing to the respective source codebooks. The decoded values of source parameters

are used to generate a mixed excitation signal. All the decoded parameters are

interpolated pitch synchronously. The mixed excitation is generated as the sum

of filtered pulse and noise excitations. Pulse excitation is generated using the

pitch information. The noise excitation is generated by a uniform random number

generator. These excitations are filtered and added together. Mixed excitation

removes the buzziness in the synthesized speech. One more major contribution in

66

the MELP residual modeling is the use of aperiodic pulses. This is because the

voiced regions of speech do not have perfect periodicity. There are slight variations

in the periodicity within a voiced region. In MELP, the periodicity in a voiced

region is modeled to mimic the erratic glottal pulses produced by humans. This

is accomplished by varying the pitch period length using the pulse position jitter.

The mixed excitation is then passed through an adaptive spectral enhancement

filter. The coefficients of the filter are calculated using the interpolated LP co-

efficient vector sequences. Adaptive spectral enhancement helps in matching the

synthesized speech with the original speech in formant regions. The excitation sig-

nal filtered through adaptive spectral enhancement filter is used for synthesizing

speech.

The residual modeled using the MELP increases the overall bit rate of the

coder as the major portion of the bit rate is occupied by the residual bits. To

check the sanity of the proposed techniques care is taken that residual is well

modeled. Hence, the MELP is chosen for residual coding. Figure 5.6 shows the

residual modeled by MELP. It can been seen that the MELP modeled residual is

very close to the original residual.

0 20 40 60 80 100 120 140 160 180−3000

−2000

−1000

0

1000

2000

3000

4000

5000

6000

Time (in samples)

Am

plitu

de

(a) Original residual

0 20 40 60 80 100 120 140 160 180−2000

−1000

0

1000

2000

3000

4000

5000

Time (in samples)

Am

plitu

de

(b) MELP modeled residual

Figure 5.6: Comparison of the MELP model residual with the original residual

67

5.5 Synthesis of Speech

After decoding the system and source parameters, an LP synthesis filter is used

to synthesize speech sentence. Prior to this, duration adjustment is to be made

to match the input segment duration to the duration of the representative syl-

lable segment. The duration of the representative syllable segment is obtained

from the system codebook. In order to validate the proposed algorithms, initially

the residual is passed as is at the synthesizer. Ideally, the synthesized waveform

should match the original signal. But, the duration of the representative sylla-

ble segment is very different from the actual segment duration. The duration

mismatch is compensated by repeating or deleting the middle frames of the LP

coefficient vector sequence of the representative syllable segment so that the num-

ber of frames as required from the segment duration of the input syllable segment

is same as that of the representative syllable segment. When speech is synthesized

using the duration matched LP coefficient vector sequences of the representative

syllable segments and the original residual, clipping is observed in some sentence

waveforms. Figure 5.7 shows the waveforms for the original speech signal and the

clipped signal of the synthesized speech. This clipping effect shows that there is

still a mismatch between the residual and LP coefficient vector sequence obtained

from representative syllables. In order to remove the mismatch, a modified version

of the segment vocoder shown in Figure 5.8 is implemented. In this vocoder, the

residual which is to be encoded and later modeled by the MELP, is obtained by

passing the input speech signal through the filter formed by LP coefficients of the

representative syllable segments. This ensures matching because the residual is

the prediction error obtained when the speech signal is passed through the LP

analysis filter. When such a residual signal is filtered through the inverse filter

formed by the same set of LP coefficients, the original signal is retrieved. To match

the durations, the frames in the middle of the LP coefficient vector sequence of

the representative syllable segment are repeated or deleted.

68

0 2000 4000 6000 8000 10000 12000 14000 16000−3

−2

−1

0

1

2

3x 10

4

Time (in samples)

Am

plitu

de

(a)

0 2000 4000 6000 8000 10000 12000 14000 16000−3

−2

−1

0

1

2

3x 10

4

Time (in samples)

Am

plitu

de

(b)

Figure 5.7: Waveform of (a) The original speech signal and (b) The clipped signalof the synthesized waveform

69

Transmission Channel

InputSpeech Encoding

ResidualInverse LP Filter

as a bit streamGroup delaySegmentation

segments

InformationDuration

Syllable codebook(sequence of LPC

Duration information

Residual(MELP)

DecodingSegment indices

representative syllable segments

LPSynthesis

Duration

Ouput speech

Information

Residual

using MELP

Syllable codebook(sequence of LPCCoefficient Vectors)

SyllableSegment indices

Residual bitstream

Sequence of LPCs of

Modeling Modeled residual

coefficient vectors)

Duration (4 bits)

Figure 5.8: Block diagram of the modified segment vocoder

5.6 Experiments and Results

When the input speech utterance to be coded is given, group delay based segmen-

tation is done to get the syllable segments. The segments are recognized using the

HMM cluster models. The indices corresponding to the recognized cluster models

are encoded. The speech signal is then passed through the filter formed from the

LP coefficient vector sequence of the representative syllable segments after the du-

ration adjustment i.e, any duration mismatch present between the input syllable

segment and its representative syllable segment is removed by either repeating or

deleting of frames in the middle sections of the LP coefficient vector sequence of

the representative syllable segment. The residual obtained from the filter is coded

using the MELP residual coding algorithm [51]. The residual is coded for every

22.5 milliseconds using 28 bits. This results in 1.2 Kbits for residual coding alone.

A syllable rate of of 8 syllables/s is assumed. If a maximum syllable duration of

70

16 frames is assumed where each frame is of 22.5 ms, 4 bits are needed to encode

the duration of each segment and the overall bit rate will be about 1.4 Kbps.

Frame size for source encoding = 22.5 ms

Number of bits required to encode source parameters = 28 bits/frame

Number of frames = 44.4 frames/s

Codebook size for single-speaker database = 295

Number of bits required to encode system parameters = 9 bits/segment

Total bit rate for single-speaker database = 44.4 × 28 + 7 × (9 + 4) ≈ 1400 bps

Codebook size for multi speaker database = 1583

Number of bits required to encode system parameters = 11 bits/segment

Total bit rate for multi-speaker database = 44.4 × 28 + 7 × (11 + 4) ≈ 1400 bps

Codebook size for multi speaker, multi lingual database = 2980

Number of bits required to encode system parameters = 12 bits/segment

Total bit rate for single-speaker database = 44.4 × 28 + 7 × (12 + 4) ≈ 1400 bps

It can be observed clearly that even if the codebook size is increased nearly

ten times the bit rate does not change significantly. But, the computation time

increases with increase in the number of HMMs. This forms a major area of focus

when real time implementation of the coder is considered. The spectrograms of

the original speech and the synthesized speech for a part of a test sentence are

shown in Figure 5.9(a) and Figure 5.9(b) respectively. It is observed that the

formant structure is preserved. The output speech quality of the proposed seg-

mental vocoder is compared with that of the 2.4 Kbps standard MELP codec.

Perceptual Evaluation of Speech Quality(PESQ) [52] score is used for the purpose

of comparison. The PESQ tool described in ITU-T Rec. P.862 is called Percep-

tual Evaluation of Speech Quality. It will provide a rapid and repeatable result

in a few moments. PESQ is an objective measurement tool that predicts the

results of subjective listening tests on telephony systems. PESQ uses a sensory

71

Time (in seconds)

Freq

uenc

y

(a)

Time (in seconds)

Freq

uenc

y

(b)

Figure 5.9: Comparison of spectrograms: (a) Original signal and (b) Synthesizedsignal

72

model to compare the original, unprocessed signal with the degraded signal from

the network or network element. The resulting quality score is analogous to the

subjective ”Mean Opinion Score” (MOS) measured using panel tests according to

ITU-T P.800. The PESQ scores are calibrated using a large database of subjective

tests. PESQ takes into account coding distortions, errors, packet loss, delay and

variable delay, and filtering in analogue network components. The range of values

the PESQ takes is same as that of the MOS. The 5.1 shows the range of values

PESQ can take.

Table 5.1: PESQ scores

MOS Quality Impairment5 Excellent Imperceptible4 Good Perceptible but not annoying3 Fair Slightly annoying2 Poor Annoying1 Bad Very annoying

The results for the segment vocoder are shown separately for each database

in Table 5.2, Table 5.3, Table 5.4 and Table 5.5. The orginal and synthesized

speech waveforms are available in the directory named ‘Chapter-5’ in the CD-ROM

attatched with the Thesis. It can be observed that the waveforms synthesized using

the proposed segment vocoder are natural and intelligible.

73

Table 5.2: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for four sentences in the single speaker database

Sentence Proposed method MELP

1 1.66 2.55

2 1.60 2.87

3 1.45 2.50

4 1.82 2.78

Table 5.3: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for six sentences in the multi speaker database

Sentence Proposed method MELP

1 1.70 2.44

2 1.37 2.95

3 2.05 2.90

4 1.42 2.73

5 1.80 2.58

6 1.61 2.34

74

Table 5.4: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for six sentences of Tamil language in the multispeaker, multi lingual database

Sentence Proposed method MELP

1 1.72 2.63

2 1.45 2.28

3 1.71 2.44

4 1.85 2.50

5 1.52 2.32

6 1.33 2.14

Table 5.5: Comparison of PESQ scores for the syllable based segment vocoder and2.4 Kbps MELP codec for six sentences of Hindi language in the multispeaker, multi lingual database

Sentence Proposed method MELP

1 1.82 2.53

2 1.70 2.61

3 1.75 2.38

4 1.42 2.04

5 1.42 2.59

6 1.38 2.49

75

5.7 Issues in the Implementation

The speech synthesized from the proposed syllable based segment vocoder is in-

telligible and preserves the naturalness. However, it has been observed that there

are several issues which lead to deterioration in the speech quality. The major

issues are:

1. Residual modeling

2. Syllable recognition

3. Duration matching

5.7.1 Residual Modeling

In linear prediction analysis, residual is the error signal between the predicted and

original speech signals. Coefficients of the LP analysis filter are computed such

that error (residual) obtained is minimum. During coding, the MELP uses the

residual signal to extract source parameters, and quantize and encode them for

transmission. In the proposed segment vocoder, the LP analysis filter is formed

from the LP coefficient vectors of the representative syllable segments. This re-

sults in an error signal which is not minimum error that may be obtained if LP

modeling was performed on the original signal. When the representative syllable

segments are very much different from the input syllable segments, as it happens

in recognition, the error signal obtained is larger in amplitude, almost resembling

the speech signal. Since, MELP does compression of a 8000 Hz signal to 2400

bits/s, it is difficult to quantize the residual if the error is more. The residual

almost resembles the speech signal in the areas where the error is large. This

is clearly seen in the following figures. Figure 5.10(a) shows the residual of a

speech signal. Figure 5.10(b) shows the residual obtained by passing the speech

76

0 2000 4000 6000 8000 10000 12000 14000 16000 18000−1.5

−1

−0.5

0

0.5

1

1.5

2x 104

Time (in samples)

Ampli

tude

(a)

0 2000 4000 6000 8000 10000 12000 14000 16000 18000−2

−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

4

Time (in samples)

Ampli

tude

(b)

Figure 5.10: Comparison of the residual for the original speech signal and theresidual used in the vocoder: (a) Residual of the original speechsignal and (b) Residual of the speech signal passed through the filterfrom the LP coefficient vectors of the representative syllable segments

77

signal through the LP analysis filter formed from the LP coefficient vectors of the

representative syllable segment. This residual is used for MELP modeling in the

proposed segment vocoder. It is seen that the error is very high in some regions.

In such cases, the MELP residual modeling may not be able to model the source

parameters effectively. This may also lead to the poor quality of the synthesized

speech.

5.7.2 Syllable Recognition

As discussed previously, unsupervised clustering of syllables gives a syllable recog-

nition accuracy of 48% [31]. The recognition accuracy of the clustering algorithm

is not given much importance in the present work. The reason is that the error

in the recognition is expected to get captured in the residual as we pass the input

speech signal through the inverse filter formed from the representative syllable seg-

ment. The residual, if well modeled is used for synthesis, it is possible to get back

the original signal with minimum distortion. But the first best result of syllable

recognition is not always correct. Because of this, sometimes there will be a huge

mismatch between the recognized and original signal which leads to poor quality

of the synthesized speech waveform. Sentence 2, whose PESQ score is 1.40, in

Table 5.3 is one such example. To improve the quality of the synthesized speech,

instead of selecting only the 1-best result cluster for an input syllable segment,

3-best results are selected. Representative syllable segments are chosen from the

3-best clusters. The index of the cluster whose representative syllable segment is

closer to the input syllable segment is encoded and transmitted. In other words,

the syllable combination that yields the best PESQ score is encoded and trans-

mitted. Figure 5.11 illustrates this method by encoding a part of a test utterance.

The syllables corresponding to the path −×− can be encoded and transmitted.

The above suggested method is implemented for a speech utterance manually.

78

va

va

va na

na

na

1 1 1

2 2 2

3 3 3

rep

rep

rep rep

rep

rep rep

rep

rep

kam

kam

kam

PESQ scores

1.68

1.751.671.672.00

1.947

1.72

Figure 5.11: Different combinations of syllables for the utterance /vanakkam/

The 3-best results for every syllable segment are heard and the representative

syllable segment that is perceptually close to the input syllable segment is used

for encoding. This improved the performance from a PESQ score of 1.30 to a

PESQ score of 1.60. Figure 5.12 shows the selection of best representative syllable

segment that is close to the input syllable segment. It can be seen that except

for the second and fourth syllable, all other syllables are close to their second

and third best results. It is also observed that the residual obtained using these

2 best 2 best 2 best 2 best 2 best 2 best

syl 1 syl 2 syl 3 syl 4 syl 5 syl 6

Recognition results

3 best 3 best

1 best 1 best 1 best

3 best 3 best 3 best 3 best

rep 11 best

rep 2

rep 3

1 best

rep 4

rep 5

rep 6

Input syllable segments

1 best

1 best

rep 2 rep 3 rep 5

Recognition results

1 best 1 best 1 best 1 best 1 best

rep 1 rep 4 rep 6PESQ= 1.30

PESQ = 1.60

Figure 5.12: PESQ improvement by selecting perceptually close representative syl-lable segments

representative syllable segments is close to the original residual than the residual

79

that is obtained from the set of first best representative syllable segments. Fig-

ure 5.13(a) shows the original residual. Figure 5.13(b) contains the residual as

obtained by considering the first best results of recognition, while Figure 5.13(c)

shows the residual obtained from the representative syllables that are manually

selected from the 3-best results. It is clearly observed that the error in Figure

5.13(c) is much small when compared to the error in 5.13(b).

80

0 2000 4000 6000 8000 10000 12000 14000−8000

−6000

−4000

−2000

0

2000

4000

6000

8000

10000

12000

Time (in samples)

Amplit

ude

(a)

0 2000 4000 6000 8000 10000 12000 14000−8000

−6000

−4000

−2000

0

2000

4000

6000

8000

10000

12000

Time (in samples)

Amplit

ude

(b)

0 2000 4000 6000 8000 10000 12000 14000−8000

−6000

−4000

−2000

0

2000

4000

6000

8000

10000

12000

Time (in samples)

Ampli

tude

(c)

Figure 5.13: Comparison of the residual for the (a) original speech signal, (b) firstbest results of the representative syllable segments and (c) perceptu-ally close representative syllable segments from 3-best results.

81

The analysis establishes that there is a need to develop some automatic tech-

nique to recognize the syllable that is close to the input syllable segment.

5.7.3 Duration Matching

The frame repetition and deletion technique used to match the durations of input

syllable segment and the corresponding representative syllable segment may not

yield good results in all the cases. This technique is suitable when the duration

mismatch is small. When the duration mismatch is large, extensive repetition or

deletion in the middle portion may totally change the spectral characteristics of

the syllable segment. In order to demonstrate the mismatch between the actual

syllable segment and the representative syllable segment an example of syllable

/va/ is considered in Figure 5.14(a) and Figure 5.14(b). It is seen that, duration

difference is as large as half of duration of the original speech signal. So, nearly

half of the number of frames should be repeated.

Am

plitu

de

Time (in seconds)

(a)

Time (in seconds)

Ampli

tude

(b)

Figure 5.14: Comparison of the segment duration of (a) original syllable segmentand (b) representative syllable segment

82

The synthesized syllable segment obtained using such a feature vector sequence

has very different characteristics from the original input syllable. This is shown

in the Figure 5.15(a) and Figure 5.15(b). It is observed that the spectrogram

of the synthesized syllable in Figure 5.15(b) is significantly different from the

spectrogram of the input syllable shown in Figure 5.15(a). Alternatively, instead

of repeating the middle frame, an attempt is made to repeat the frames in the

place where the time domain waveforms of two syllables is similar. The syllable

synthesized using such a feature vector showed better similarity to the spectrogram

of the input syllable. Figure 5.15(c) and Figure 5.15(d) are the spectrograms of

the synthesized syllables obtained by using the feature vectors whose frames are

repeated in accordance with the similarities in the time domain waveforms of the

input and representative syllables. It can be observed clearly that the top most

formant lost in Figure 5.15(b), is restored in Figure 5.15(c) and Figure 5.15(d).

83

Freq

uenc

y

Time (in seconds)

(a)Fr

eque

ncy

Time (in seconds)

(b)

Freq

uenc

y

Time (in seconds)

(c)

Freq

uenc

y

Time (in seconds)

(d)

Figure 5.15: Analysis of spectrograms for (a) Input syllable segment, (b) Syn-thesized syllable segment obtained with middle frame repetition, (c)Synthesized syllable segment with appropriate frame repetition and(d) Synthesized syllable segment with appropriate frame repetitionbut different from the frame repetition used in (c).

84

The analysis is also verified by synthesizing a part of utterance. The utterance

chosen here is again “vanakkam”. The utterance consists of three syllable units

- /va/, /nak/ and /kam/. The duration mismatch between the input syllable

segments and the corresponding representative syllable segments is such that, for

/va/ frames are to be repeated, for /nak/ frames are to be deleted and for /kam/

no repetition or deletion is required. The utterance is synthesized in three different

forms:

1. Form 1: The LP coefficient vector sequence of the representative syllable

segments of /va/ and /nak/ are formed by repeating and deleting frames in

the middle.

2. Form 2: The LP coefficient vector sequence of the representative syllable

segments of /va/ and /nak/ are formed by repeating and deleting appropri-

ate frames.

3. Form 3: The LP coefficient vector sequence of the representative syllable

segments of /va/ and /nak/ are formed by repeating and deleting appropri-

ate frames, but the positions of repetition and deletion are slightly altered

from Form 3.

It is seen from Table 5.6 that the PESQ scores of Form 2 and Form 3 are higher

than that for Form 1.

Table 5.6: PESQ scores of an utterance after appropriate frame repetition

Synthesis type PESQ score

Form 1 1.886

Form 2 2.224

Form 3 2.295

85

Moreover, while encoding the duration of the syllable segment, the number of

frames is rounded off to the next integer value. This results in frame mismatch at

the boundary where the last frame of the present syllable segment gets convolved

with the residual of the next syllable segment during synthesis. This results in a

mismatch which manifests itself as clipping in the synthesized waveform. Large

duration mismatch at the boundaries may lead to chopping of the synthesized sig-

nal, which in turn reduces the PESQ score. Figure 5.16(a) and Figure 5.16(b) show

the waveforms synthesized from MELP and the proposed vocoder respectively. It

can be clearly observed that the synthesized waveforms are similar except for the

clipping. This sentence is same as the 4th sentence in Table 5.3 whose PESQ score

is 1.42.

Am

plitu

de

Time (in seconds)

(a)

clipping

Am

plitu

de

Time (in seconds)

(b)

Figure 5.16: (a) Synthesized speech obtained using the MELP codec and (b) Syn-thesized speech obtained using the syllable based segment vocoder.

86

5.8 Summary

The operation of a syllable based segment vocoder is presented. The encoding

operation includes segmentation, system and source quantization, and duration

quantization. The decoding operation involves representing the system character-

istics in the form of LP coefficient vector sequences of the representative syllable

segments, modeling the excitation signal using the MELP, duration matching and

finally synthesizing the speech using an LP synthesis filter. The quality of the

output speech is compared with that of the 2.4 Kbps MELP standard codec. The

output speech is intelligible and natural. It has been observed that the choice of

the representative syllable segment is not always the best. Analysis indicates that

the performance is poor when a large number of syllables are wrongly recognized.

It is also observed that the duration mismatch near the segment boundaries also

deteriorate the quality.

87

CHAPTER 6

CONCLUSIONS

6.1 Summary

The excessive demand for bandwidth in voice communications has triggered the

evolution of low bit rate speech coders. The coders are developed to preserve the

quality of the output speech at very low transmission bit rates. However, there

is always a trade-off between the bit rate and the speech quality. As the bit rate

decreases, the quality of the output speech also decreases. The research in low

bit rate speech coding aims at producing the good quality speech at very low bit

rates. Segment vocoders are such low bit rate coders, that are capable of achieving

bit rates less than 2.4 Kbps. However, the quality of speech is very low and often

synthetic. This is because of the high compression at segment level.

Most of the segment vocoders use phonemes or diphones as the segmental

units for compression. The segmental units are obtained using various segmenta-

tion techniques. The techniques are broadly classified into two categories. One

category uses algorithms which dynamically search for the regions of interest to

locate the boundaries. The algorithms are iterative and complex in nature. The

other category uses parametric modeling, where the models are built from a large

amount of segmental data. The models are then used to locate the boundaries

in the incoming speech signal. Parametric modeling requires the models to be

built on a large amount of data so that the variations of the segmental units are

captured effectively. Hence, there is a need to develop a simple and flexible seg-

mentation algorithm which does not require any prior information to locate the

boundaries. A signal processing technique called group delay based segmentation

is one such technique which is capable of locating the syllable boundaries. Since,

the syllable is a larger unit than a phoneme or a diphone, selecting the syllable

as a segmental unit ensures better compression. Hence, in this work syllable like

units are chosen for compression.

The compression is achieved by quantizing the system and source parame-

ters of the segmental units. Priorly prepared codebooks are used to quantize the

system and source parameters of the segmental units. Generally, the system char-

acteristics are quantized at the segmental level, whereas the source parameters

are quantized at the frame level. For system codebook preparation, Vector Quan-

tization (VQ) and parametric modeling techniques are widely employed. As the

syllable is chosen as the segmental unit, VQ may not be feasible for clustering as it

does not capture the sequence information in syllables. Hence, parametric model-

ing using HMMs is employed. Unlike most of the parametric modeling techniques

that operate in the supervised mode, an unsupervised clustering algorithm is used.

The algorithm results in clusters that have similar sounding segments. For each

cluster, a representative syllable segment that represents the characteristics of all

the other syllable segments in the cluster is selected.

The system codebook contains the LP coefficient vector sequences of the repre-

sentative syllable segments. For source coding, parameters like pitch, voicing and

jitter are quantized using the respective codebooks. In the present work, source

is encoded and modeled using the MELP. Source modeling using MELP preserves

naturality in the synthesized speech. With a system compression of 100 bps and

source compression of 1.2 Kbps, the proposed vocoder operates at 1.4 Kbps.

Speech is synthesized at the receiver using an LP synthesis filter. The filter

is formed from the LP coefficient vector sequences of the representative syllable

segments stored in the system codebook. The durations of the input syllable

segment and the corresponding representative syllable segment are matched using

the frame repetition and deletion technique. The excitation signal is modeled

89

using MELP residual modeling. The modeled residual is passed through the LP

synthesis filter to produce the output speech which is intelligible and natural.

6.2 Conclusions

The following are the main conclusions drawn from the present work:

1. Syllable like units that are much larger than phoneme and diphone units can

be considered as units for compression in the segment vocoders.

2. The complexity of segmentation can be reduced using signal processing based

technique like group delay based segmentation.

3. Syllable recognition and duration matching have a significant effect on the

quality of the output speech.

6.3 Criticism of the work

A novel syllable based segment vocoder is proposed in the present work. The

vocoder is capable of producing the natural sounding speech with good intelligi-

bility. The following drawbacks are observed in the implementation of the vocoder.

1. The issue regarding the selection of the best representative unit is not fully

addressed. An automatic technique has to be developed to obtain a better

recognition unit for synthesis. The method used to find the representative

syllable segment may not be correct.

2. Long silences in the input speech signal have to be modeled effectively.

Presently, some manual correction of duration is done near the long silence

90

regions. Since the database used in the present work is NEWS data, there

are no significantly long silences. But if the coder has to work on casual

speech, long silence regions have to be modeled separately.

3. The frame repetition and deletion technique is not an effective way to match

the durations. The quality of speech drastically reduces when there is a

significant mismatch near the segment boundaries.

6.4 Directions for Future Work

The proposed vocoder can be enhanced by extending the work in following areas:

1. Speech quality can be improved by addressing the issues like syllable recog-

nition and duration matching. A two level encoding and varying the number

of states according to the duration of the syllable segment can be considered

to improve the recognition performance.

2. Instead of using frame repetition, techniques like DTW can be used to ef-

fectively match the durations of the input syllable segment and the corre-

sponding representative syllable segment.

3. Silence regions in speech are to be modeled separately for better quality of

the synthesized speech.

4. The vocoder has to be implemented for multiple languages.

5. Bit rates can be reduced by modeling the source information with less num-

ber of bits.

91

REFERENCES

[1] L. Hanzo, F. C. Somerville, and J. Woodard, Voice and Audio Compressionfor Wireless Communications. Wiley, 2007.

[2] S. Roucos, R. M. Schwartz, and J. Makhoul, “A segment vocoder at 150bits/s,” Proceedings of International Conference on Acoustics, Speech andSignal Processing, vol. 8, pp. 61–64, April 1983.

[3] J. Picone and G. Doddington, “A phonetic vocoder,” Proceedings of Inter-national Conference on Acoustics, Speech and Signal Processing, vol. 1, pp.580–583, May 1989.

[4] K. S. Lee and R. V. Cox, “A very low bit rate speech coder based on recogni-tion/synthesis paradigm,” IEEE Transactions on Speech and Audio Process-ing, vol. 9, pp. 482–491, July 2001.

[5] J. Cernocky, G. B. Baudoin, and G. Chollet, “Segmental vocoder - Goingbeyond the phonetic approach,” Proceedings of International Conference onAcoustics, Speech and Signal Processing, vol. 2, pp. 605–608, May 1998.

[6] K. Tokuda, T. Masuko, J. Hiroi, T. Kobayashi, and T. Kitamura, “A verylow bitrate speech coder using HMM-based speech recognition/synthesis tech-niques,” Proceedings of International Conference on Acoustics, Speech andSignal Processing, vol. 2, pp. 609–612, May 1998.

[7] M. Ismail and K. Ponting, “Between recognition and synthesis - 300 b/sspeech coding,” Proceedings of EUROSPEECH, vol. 1, pp. 441–444, 1997.

[8] T. Hoshiya, S. Sako, H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, andT. Kitamura, “Improving performance of HMM-based very low bitrate speechcoding,” Proceedings of International Conference on Acoustics, Speech andSignal Processing, vol. 1, pp. 800–803, April 2003.

[9] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. PearsonEducation Pvt. Ltd., 2003.

[10] L. Rabiner and R. Schafer, Digital Processing of Speech Signals. Prentice-Hall, Inc., Eaglewood Cliffs, New Jersey, 1978.

[11] V. Ramasubramanian and T. V. Sreenivas, “Automatically derived units forsegment vocoders,” Proceedings of International Conference on Acoustics,Speech and Signal Processing, vol. 1, pp. 473–476, May 2004.

[12] T. Nagarajan and H. A. Murthy, “Group delay based segmentation of spon-taneous speech into syllable-like units,” EURASIP Journal of Applied SignalProcessing, vol. 17, pp. 2614–2625, 2004.

92

[13] A. Lakshmi and H. A. Murthy, “A syllable based continuous speech recognizerfor Tamil,” Proceedings of INTERSPEECH, pp. 1878–1881, September 2006.

[14] Y. Shiraki and M. Honda, “LPC speech coding based on variable-length seg-ment quantization,” IEEE Transactions on Acoustics, Speech and Signal Pro-cessing, vol. 36, pp. 1437–1444, 1988.

[15] S. Roucos and A. M. Wilgus, “A waveform segment vocoder: A new approachfor very low rate speech coding,” Proceedings of International Conference onAcoustics, Speech and Signal Processing, vol. 10, pp. 236–239, April 1985.

[16] S. Roucos, R. M. Schwartz, and J. Makhoul, “Vector quantization for very-low-rate coding of speech,” Proceedings of IEEE Globecom’82, pp. 1074–1078,1982.

[17] D. Wong, B. H. Juang, and A. Gray, “An 800 bit/s vector quantizationlpc vocoder,” IEEE transactions on Acoustics, Speech and Signal Process-ing, vol. 30, no. 5, pp. 770–780, October 1982.

[18] C. Tsao and R. M. Gray, “Matrix quantizer design for lpc speech using thegeneralized lloyd algorithm,” IEEE transactions on Acoustics, Speech andSignal Processing, vol. 33, no. 3, pp. 537–545, June 1985.

[19] T. Nagarajan and H. A. Murthy, “Language Identification using acousticlog-likelihoods of syllable like units,” Speech Communication, vol. 48, pp.913–926, 2006.

[20] A. V. McCree and T. P. Barnwell, “A mixed excitation LPC vocoder modelfor low bit rate speech coding,” IEEE Transactions on Speech and AudioProcessing, vol. 3, no. 4, pp. 242–250, July 1995.

[21] L. M. Supplee, R. P. Cohn, J. S. Collura, and A. V. McCree, “MELP: thenew federal standard at 2400 bps,” Proceedings of International Conferenceon Acoustics, Speech and Signal Processing, vol. 2, pp. 1591–1594, April 1997.

[22] W. C. Chu, Speech Coding Algorithms. Wiley-IEEE, 2003.

[23] S. T. Bardenhagen, K. L. Brown, and R. D. Braun, “Low bit rate speechcompression using hidden Markov models,” Proceedings of MILCOM, vol. 1,pp. 507–511, 1997.

[24] R. S. Kumar, N. Tamrakar, and P. Rao, “Segment based MBE speech cod-ing at 1000 bps,” Proceedings of National Conference on Communication,February 2008.

[25] M. Felici, M. Borgatti, and R. Guerrieri, “Very low bit rate speech codingusing a diphone-based recognition and synthesis approach,” IEE ElectronicLetters, vol. 34, no. 9, pp. 859–860, 1998.

[26] G. Benbassat and X. Delon, “Low bit rate speech coding by concatenationof sound units and prosody coding,” Proceedings of International Conferenceon Acoustics, Speech and Signal Processing, vol. 9, pp. 5–8, March 1984.

93

[27] P. Vepyak and A. B. Bradley, “Consideration of processing strategies for verylow rate compression of wide band speech signal with known text transcrip-tion,” Proceedings of EUROSPEECH, vol. 3, pp. 1279–1282, 1997.

[28] H. C. Chen, C. Y. Chen, K. M. Tsou, and O. T. Chen, “A 0.75 kbps speechcodec using recognition and synthesis schemes,” Proceedings of IEEE Work-shop on Speech Coding in Telecommunications, vol. 3, pp. 27–29, 1997.

[29] K. S. Lee and R. V. Cox, “A segmental speech coder based on a concatenativeTTS,” Speech Communication, vol. 38, no. 1, pp. 89–100, September 2002.

[30] V. Kamakshi Prasad, “Segmentation and Recognition of Continuous Speech,”Ph.D. dissertation, Department of Computer Science and Engineering, IndianInstitute of Technology Madras, India, January 2002.

[31] G. Lakshmi Sarada, “Automatic Transcription of Continuous Speech for In-dian Languages,” Master’s thesis, Department of Computer Science and En-gineering, Indian Institute of Technology Madras, India, December 2005.

[32] A. Lakshmi, “A Syllable based Continuous Speech Recognizer for Indian Lan-guages,” Master’s thesis, Department of Computer Science and Engineering,Indian Institute of Technology Madras, India, April 2007.

[33] Samuel Thomas, “Natural Sounding Text-to-Speech Synthesis based on Syl-lable like Units,” Master’s thesis, Department of Computer Science and En-gineering, Indian Institute of Technology Madras, India, February 2007.

[34] N. Sridhar Krishna, “Text-to-Speech Synthesis system for Indian Languageswithin Festival Framework,” Master’s thesis, Department of Computer Sci-ence and Engineering, Indian Institute of Technology Madras, India, January2004.

[35] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Mixed excitation for HMM-based speech synthesis,” Proceedings of EU-ROSPEECH, pp. 600–604, 2001.

[36] R. Maia, “A novel excitation approach for HMM-based speech synthesis -report IV,” A report by HTS group - Nagoya Institute of Technology, May2007.

[37] Y. J. Kim and A. Conkie, “Automatic segmentation combining an HMMbased approach and spectral boundary correction,” Proceedings of Interna-tional Conference on Spoken Language Processing, pp. 145–148, September2002.

[38] T. Svendsen and F.K.Soong, “On the automatic segmentation of speech sig-nals,” Proceedings of International Conference on Acoustics, Speech and Sig-nal Processing, vol. 12, pp. 77–80, April 1987.

[39] J. P. Martens and L. Depuydt, “Broad phonetic classification and segmenta-tion of continuous speech by means of Neural Networks and Dynamic Pro-graming,” Speech Communication, vol. 10, pp. 81–90, February 1991.

94

[40] S. Wu, E. D. Kingsbury, N. Morgan, and S. Greenberg, “Incorporating infor-mation from syllable length time scales into automatic speech recognition,”Proceedings of International Conference on Acoustics, Speech and Signal Pro-cessing, vol. 2, pp. 987–990, April 1997.

[41] H. A. Murthy and B. Yegnanarayana, “Formant extraction from minimumphase group delay function,” Speech Communication, vol. 10, pp. 209–221,August 1991.

[42] V. K. Prasad, T. Nagarajan, and H. A. Murthy, “Automatic segmentationof continuous speech using minimum phase group delay functions,” SpeechCommunication, vol. 42, pp. 429–446, 2004.

[43] T. Nagarajan, R. M. Hegde, and H. A. Murthy, “Segmentation of speech intosyllable-like units,” Proceedings of EUROSPEECH, pp. 2893–2896, Septem-ber 2003.

[44] “Database for Indian Languages,” Speech and Vision Lab, Indian Instituteof Technology Madras, India, 2001.

[45] T. Nagarajan and H. A. Murthy, “Language identification using parallelsyllable-like unit recognition,” Proceedings of International Conference onAcoustics, Speech and Signal Processing, vol. 1, pp. 401–404, May 2004.

[46] G. L. Sarada, T. Nagarajan, and H. A. Murthy, “Multiple frame size andmultiple frame rate feature extraction for speech recognition,” Proceedings ofSPCOM, pp. 592–595, December 2004.

[47] T. Masuko, K. Tokuda, and T. Kobayashi, “A very low bit rate speech coderusing HMM with speaker adaptation,” Proceedings of 5th International Con-ference on Spoken Language Processing, vol. 2, pp. 507–510, December 1998.

[48] G. L. Sarada, N. Hemalatha, T. Nagarajan, and H. A. Murthy, “Automatictranscription of continuous speech using unsupervised and incremental train-ing,” Proceedings of INTERSPEECH, pp. 405–408, October 2004.

[49] T. Nagarajan and H. A. Murthy, “An approach to segmentation and labelingof continuous speech without bootstrapping,” Proceedings of National Con-ference on Communication, pp. 508–512, January 2004.

[50] “HTK Speech Recognition Toolkit,” http://htk.eng.cam.ac.uk.

[51] “2.4 Kbps MELP: Federal standard speech coder, version 1.2,” Texas Instru-ments, 1996.

[52] “Perceptual Evaluation of Speech Quality (PESQ), ITU-T P.862 (02/2001),”www.pesq.org.

95

LIST OF PUBLICATIONS

1. Sadhana Chevireddy, Hema A. Murthy and C. Chandra Sekhar “Signal pro-

cessing based segmentation and HMM based acoustic clustering of syllable

segments for low bitrate segment vocoder at 1.4 Kbps,” In the proceedings

of 16th European Signal Processing Conference (EUSIPCO-2008), August 25

- 29, Lausanne, Switzerland.

2. Sadhana Chevireddy, Hema A. Murthy and C. Chandra Sekhar “A syllable

based segment vocoder,” in Proceedings of National Conference on Commu-

nication (NCC-2008), pp. 442-445, February 2 - 4, IIT Bombay.

96

CURRICULUM VITAE

1. Name: Sadhana Chevireddy

2. Date of Birth: 13th August, 1984

3. Educational Qualifications:

(a) 2008 - Master of Science (M.S)

(b) 2005 - Bachelor of Technology (B.Tech)

4. Permanant address:

8-186, New Balaji Colony

M R Palle Road

Tirupati - 517502

Andhra Pradesh

Ph No: 0877 - 2243678

email: [email protected]

97

GENERAL TEST COMMITTEE

1. Chairperson: Dr. P. Sreenivas Kumar

2. Guides: Dr. Hema A. Murthy, Dr. C. Chandra Sekhar

3. Members:

(a) Dr. B. Ravindran (CSE)

(b) Dr. Andrew Thangaraj (EE)

98